Skip to main content

Feature Selection and Dimensionality Reduction

Feature selection and dimensionality reduction address the same pressure from different directions: real data sets often have too many variables, many of which are noisy, redundant, expensive, or irrelevant. Aggarwal separates feature subset selection from transformations such as principal component analysis, singular value decomposition, latent semantic analysis, wavelets, multidimensional scaling, and graph spectral embeddings. The shared goal is to keep the structure that matters while reducing computational and statistical burden.

This page belongs with data preparation, but it also connects to nearly every later chapter. Clustering in high dimensions becomes unstable because distances concentrate. Text mining depends on sparse high-dimensional vectors, then often reduces them with SVD or topic models. Stream mining uses compact synopses. Graph mining uses spectral transformations to turn topology into coordinates.

Definitions

Feature subset selection chooses a subset of the original features. It preserves interpretability because selected features keep their original meaning.

Filter methods score features before fitting the final mining model. Scores may use variance, correlation, mutual information, entropy, term strength, or class-separation criteria.

Wrapper methods evaluate feature subsets by repeatedly fitting a model. They can match a target algorithm closely but are usually more expensive.

Embedded methods select features during model training, as in Lasso regression or tree-based split selection.

Dimensionality reduction maps data from dd dimensions to kk dimensions, where k<dk\lt d. The new coordinates may be linear combinations of old features or nonlinear embeddings.

Principal component analysis (PCA) finds orthogonal directions of maximum variance. For centered data matrix XX, PCA uses eigenvectors of the covariance matrix XX/(n1)X^\top X/(n-1).

Singular value decomposition (SVD) writes a matrix as

X=UΣV,X = U \Sigma V^\top,

where columns of UU and VV are orthonormal and Σ\Sigma contains singular values. Truncating to the largest kk singular values gives the best rank-kk approximation in squared Frobenius error.

Latent semantic analysis (LSA) applies truncated SVD to a term-document matrix so documents and terms are embedded in a lower-dimensional semantic space.

Haar wavelet transform represents a sequence by recursive averages and differences. It is useful when the signal has local changes.

Multidimensional scaling (MDS) embeds objects from pairwise distances into coordinates, usually by preserving distances as well as possible.

Key results

Feature selection reduces variance but may increase bias. Removing irrelevant variables helps models generalize and improves speed. Removing weak but genuinely useful variables can hide structure. Selection should be validated on held-out data or with stability checks.

PCA maximizes retained variance. The first principal component is the unit vector vv maximizing Xv2=vXXv\|Xv\|^2 = v^\top X^\top Xv. The solution is the top eigenvector of XXX^\top X. The next components repeat the same optimization subject to orthogonality constraints.

Truncated SVD gives the best low-rank reconstruction. If X=UΣVX=U\Sigma V^\top, then

Xk=UkΣkVkX_k = U_k \Sigma_k V_k^\top

minimizes XYF\|X-Y\|_F over all matrices YY with rank at most kk. This is why SVD is central in text mining and matrix approximation.

Reduction is not always supervised. PCA and SVD do not know the target label. They may keep high-variance directions that are irrelevant to classification and discard low-variance directions that separate classes. Supervised feature selection should be preferred when label prediction is the task and labels are reliable.

Graph embeddings use matrix structure. An adjacency matrix, Laplacian, or normalized Laplacian can be decomposed spectrally. The resulting eigenvectors provide coordinates that reflect graph connectivity, which can then feed clustering or classification.

Dimensionality reduction should be fitted as part of the training pipeline. The reducer learns means, variances, singular vectors, selected features, or neighborhood structure from data. If it is fitted before a train-test split, information from the test set influences the representation and the validation score becomes optimistic. This is especially easy to miss with PCA, SVD, feature scaling, and mutual-information filters because they feel like harmless preprocessing. In predictive work, fit the selector or reducer only on the training fold, then apply the learned transformation to validation and test data.

Visual

MethodKeeps original features?Uses labels?StrengthMain caution
Filter selectionYesSometimesFast and scalableIgnores feature interactions
Wrapper selectionYesUsuallyTuned to target modelExpensive and overfit-prone
PCANoNoCaptures variance compactlyComponents may be hard to interpret
SVD/LSANoNoWorks well for sparse matricesRequires choosing rank
WaveletsNoNoCaptures local signal changesBest for ordered signals
MDSNoNoStarts from distances onlyCostly for many objects

Worked example 1: PCA by hand on centered points

Problem. Reduce four two-dimensional points to one dimension:

(2,0), (0,2), (2,0), (0,2).(2,0),\ (0,2),\ (-2,0),\ (0,-2).

Rotate them? Or is one dimension insufficient?

Method.

  1. The mean is (0,0)(0,0), so the data are already centered.

  2. Build the covariance matrix. The xx values are (2,0,2,0)(2,0,-2,0) and yy values are (0,2,0,2)(0,2,0,-2).

xi2=8,yi2=8,xiyi=0.\sum x_i^2 = 8,\quad \sum y_i^2 = 8,\quad \sum x_i y_i=0.

Using n1=3n-1=3,

C=[8/3008/3].C=\begin{bmatrix}8/3 & 0\\0 & 8/3\end{bmatrix}.
  1. The eigenvalues are both 8/38/3. Every unit direction has the same variance.

  2. Projecting onto any one-dimensional direction keeps only half of the total variance, because total variance is 16/316/3 and one component retains 8/38/3.

Checked answer. PCA has no unique preferred direction here. A one-dimensional embedding loses half the variance. The data form a symmetric cross, so reducing to one coordinate is not a strong representation.

Worked example 2: Feature selection with class separation

Problem. Choose between two features for a binary classifier.

objectfeature Afeature Bclass
1010negative
2111negative
3510positive
4611positive

Method.

  1. Compare class means for feature A:

    Negative mean is (0+1)/2=0.5(0+1)/2=0.5. Positive mean is (5+6)/2=5.5(5+6)/2=5.5. Difference is 5.05.0.

  2. Compare within-class spread for feature A:

    Each class values differ from its mean by 0.50.5, so within-class variation is small.

  3. Compare class means for feature B:

    Negative mean is (10+11)/2=10.5(10+11)/2=10.5. Positive mean is (10+11)/2=10.5(10+11)/2=10.5. Difference is 00.

  4. Feature B varies, but it does not separate the classes.

Checked answer. Feature A is useful for classification; feature B is not useful for this label, even though it has nonzero variance. This shows why unsupervised variance filters and supervised selection can disagree.

Code

Pseudocode for a conservative reduction workflow:

INPUT: feature matrix X, optional labels y, target dimension k
OUTPUT: reduced representation Z

split data into training and validation parts
if labels y are available and prediction is the goal:
score features using training labels only
keep a candidate subset
if features are still high-dimensional:
fit PCA, SVD, or another reducer on training data only
transform training and validation data
evaluate downstream mining task
check stability of selected features or components
return fitted reducer and transformed data
import numpy as np
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.feature_selection import mutual_info_classif, SelectKBest
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

X = np.array(
[
[0, 10, 3],
[1, 11, 2],
[5, 10, 3],
[6, 11, 2],
[5, 12, 4],
[0, 12, 1],
],
dtype=float,
)
y = np.array([0, 0, 1, 1, 1, 0])

selector = SelectKBest(mutual_info_classif, k=2)
pipeline = Pipeline(
[
("scale", StandardScaler()),
("select", selector),
("pca", PCA(n_components=1)),
]
)

z = pipeline.fit_transform(X, y)
print("selected feature indices:", selector.fit(X, y).get_support(indices=True))
print("one-dimensional coordinates:", np.round(z.ravel(), 3))

term_doc = np.array([[3, 0, 1], [0, 4, 0], [1, 0, 2]], dtype=float)
lsa = TruncatedSVD(n_components=2, random_state=0)
print(np.round(lsa.fit_transform(term_doc), 3))

Common pitfalls

  • Selecting features on the full data set before train-test splitting.
  • Assuming high variance means high predictive value.
  • Keeping too many principal components because the scree plot looks smooth; use downstream validation.
  • Interpreting PCA components as original features without checking loadings.
  • Applying PCA to unscaled features when units differ.
  • Using SVD embeddings from a term-document matrix without controlling common stop words and document length effects.
  • Forgetting that nonlinear structures may need graph, manifold, or kernel methods rather than a global linear projection.

Connections