Feature Selection and Dimensionality Reduction
Feature selection and dimensionality reduction address the same pressure from different directions: real data sets often have too many variables, many of which are noisy, redundant, expensive, or irrelevant. Aggarwal separates feature subset selection from transformations such as principal component analysis, singular value decomposition, latent semantic analysis, wavelets, multidimensional scaling, and graph spectral embeddings. The shared goal is to keep the structure that matters while reducing computational and statistical burden.
This page belongs with data preparation, but it also connects to nearly every later chapter. Clustering in high dimensions becomes unstable because distances concentrate. Text mining depends on sparse high-dimensional vectors, then often reduces them with SVD or topic models. Stream mining uses compact synopses. Graph mining uses spectral transformations to turn topology into coordinates.
Definitions
Feature subset selection chooses a subset of the original features. It preserves interpretability because selected features keep their original meaning.
Filter methods score features before fitting the final mining model. Scores may use variance, correlation, mutual information, entropy, term strength, or class-separation criteria.
Wrapper methods evaluate feature subsets by repeatedly fitting a model. They can match a target algorithm closely but are usually more expensive.
Embedded methods select features during model training, as in Lasso regression or tree-based split selection.
Dimensionality reduction maps data from dimensions to dimensions, where . The new coordinates may be linear combinations of old features or nonlinear embeddings.
Principal component analysis (PCA) finds orthogonal directions of maximum variance. For centered data matrix , PCA uses eigenvectors of the covariance matrix .
Singular value decomposition (SVD) writes a matrix as
where columns of and are orthonormal and contains singular values. Truncating to the largest singular values gives the best rank- approximation in squared Frobenius error.
Latent semantic analysis (LSA) applies truncated SVD to a term-document matrix so documents and terms are embedded in a lower-dimensional semantic space.
Haar wavelet transform represents a sequence by recursive averages and differences. It is useful when the signal has local changes.
Multidimensional scaling (MDS) embeds objects from pairwise distances into coordinates, usually by preserving distances as well as possible.
Key results
Feature selection reduces variance but may increase bias. Removing irrelevant variables helps models generalize and improves speed. Removing weak but genuinely useful variables can hide structure. Selection should be validated on held-out data or with stability checks.
PCA maximizes retained variance. The first principal component is the unit vector maximizing . The solution is the top eigenvector of . The next components repeat the same optimization subject to orthogonality constraints.
Truncated SVD gives the best low-rank reconstruction. If , then
minimizes over all matrices with rank at most . This is why SVD is central in text mining and matrix approximation.
Reduction is not always supervised. PCA and SVD do not know the target label. They may keep high-variance directions that are irrelevant to classification and discard low-variance directions that separate classes. Supervised feature selection should be preferred when label prediction is the task and labels are reliable.
Graph embeddings use matrix structure. An adjacency matrix, Laplacian, or normalized Laplacian can be decomposed spectrally. The resulting eigenvectors provide coordinates that reflect graph connectivity, which can then feed clustering or classification.
Dimensionality reduction should be fitted as part of the training pipeline. The reducer learns means, variances, singular vectors, selected features, or neighborhood structure from data. If it is fitted before a train-test split, information from the test set influences the representation and the validation score becomes optimistic. This is especially easy to miss with PCA, SVD, feature scaling, and mutual-information filters because they feel like harmless preprocessing. In predictive work, fit the selector or reducer only on the training fold, then apply the learned transformation to validation and test data.
Visual
| Method | Keeps original features? | Uses labels? | Strength | Main caution |
|---|---|---|---|---|
| Filter selection | Yes | Sometimes | Fast and scalable | Ignores feature interactions |
| Wrapper selection | Yes | Usually | Tuned to target model | Expensive and overfit-prone |
| PCA | No | No | Captures variance compactly | Components may be hard to interpret |
| SVD/LSA | No | No | Works well for sparse matrices | Requires choosing rank |
| Wavelets | No | No | Captures local signal changes | Best for ordered signals |
| MDS | No | No | Starts from distances only | Costly for many objects |
Worked example 1: PCA by hand on centered points
Problem. Reduce four two-dimensional points to one dimension:
Rotate them? Or is one dimension insufficient?
Method.
-
The mean is , so the data are already centered.
-
Build the covariance matrix. The values are and values are .
Using ,
-
The eigenvalues are both . Every unit direction has the same variance.
-
Projecting onto any one-dimensional direction keeps only half of the total variance, because total variance is and one component retains .
Checked answer. PCA has no unique preferred direction here. A one-dimensional embedding loses half the variance. The data form a symmetric cross, so reducing to one coordinate is not a strong representation.
Worked example 2: Feature selection with class separation
Problem. Choose between two features for a binary classifier.
| object | feature A | feature B | class |
|---|---|---|---|
| 1 | 0 | 10 | negative |
| 2 | 1 | 11 | negative |
| 3 | 5 | 10 | positive |
| 4 | 6 | 11 | positive |
Method.
-
Compare class means for feature A:
Negative mean is . Positive mean is . Difference is .
-
Compare within-class spread for feature A:
Each class values differ from its mean by , so within-class variation is small.
-
Compare class means for feature B:
Negative mean is . Positive mean is . Difference is .
-
Feature B varies, but it does not separate the classes.
Checked answer. Feature A is useful for classification; feature B is not useful for this label, even though it has nonzero variance. This shows why unsupervised variance filters and supervised selection can disagree.
Code
Pseudocode for a conservative reduction workflow:
INPUT: feature matrix X, optional labels y, target dimension k
OUTPUT: reduced representation Z
split data into training and validation parts
if labels y are available and prediction is the goal:
score features using training labels only
keep a candidate subset
if features are still high-dimensional:
fit PCA, SVD, or another reducer on training data only
transform training and validation data
evaluate downstream mining task
check stability of selected features or components
return fitted reducer and transformed data
import numpy as np
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.feature_selection import mutual_info_classif, SelectKBest
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
X = np.array(
[
[0, 10, 3],
[1, 11, 2],
[5, 10, 3],
[6, 11, 2],
[5, 12, 4],
[0, 12, 1],
],
dtype=float,
)
y = np.array([0, 0, 1, 1, 1, 0])
selector = SelectKBest(mutual_info_classif, k=2)
pipeline = Pipeline(
[
("scale", StandardScaler()),
("select", selector),
("pca", PCA(n_components=1)),
]
)
z = pipeline.fit_transform(X, y)
print("selected feature indices:", selector.fit(X, y).get_support(indices=True))
print("one-dimensional coordinates:", np.round(z.ravel(), 3))
term_doc = np.array([[3, 0, 1], [0, 4, 0], [1, 0, 2]], dtype=float)
lsa = TruncatedSVD(n_components=2, random_state=0)
print(np.round(lsa.fit_transform(term_doc), 3))
Common pitfalls
- Selecting features on the full data set before train-test splitting.
- Assuming high variance means high predictive value.
- Keeping too many principal components because the scree plot looks smooth; use downstream validation.
- Interpreting PCA components as original features without checking loadings.
- Applying PCA to unscaled features when units differ.
- Using SVD embeddings from a term-document matrix without controlling common stop words and document length effects.
- Forgetting that nonlinear structures may need graph, manifold, or kernel methods rather than a global linear projection.