Dimensionality Reduction with PCA
Lesson, slides, and applied problem sets.
View SlidesLesson
Dimensionality Reduction with PCA
Why this module exists
High-dimensional data is hard to visualize, slow to process, and prone to overfitting. Principal Component Analysis (PCA) finds the most important directions in data, letting you reduce dimensions while preserving maximum information.
1) The curse of dimensionality
As dimensions increase:
- Data becomes sparse (points are far apart)
- Distance metrics become less meaningful
- More data needed to cover the space
- Models overfit more easily
PCA helps by reducing to the dimensions that matter.
2) What is PCA?
PCA finds new axes (principal components) that:
- Capture maximum variance in the data
- Are orthogonal (perpendicular) to each other
- Are ordered by importance (first PC captures most variance)
You then keep only the top K components.
3) The intuition
Imagine a 3D cloud of points that's actually flat (like a tilted frisbee):
- The frisbee lives in 3D, but its "true" shape is 2D
- PCA finds the 2D plane that best fits the cloud
- You can describe points with 2 coordinates instead of 3
PCA finds the "natural" axes of your data.
4) The algorithm
def pca(X, n_components):
# 1. Center the data
mean = X.mean(axis=0)
X_centered = X - mean
# 2. Compute covariance matrix
cov = X_centered.T @ X_centered / (len(X) - 1)
# 3. Eigendecomposition
eigenvalues, eigenvectors = eig(cov)
# 4. Sort by eigenvalue (descending)
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
# 5. Select top components
components = eigenvectors[:, :n_components]
# 6. Project data
X_transformed = X_centered @ components
return X_transformed, components, eigenvalues
5) Step 1: Center the data
Subtract the mean from each feature:
X_centered = X - X.mean(axis=0)
This moves the data to be centered at the origin. Centering is essential—PCA finds directions through the origin.
6) Step 2: Covariance matrix
The covariance matrix captures how features vary together:
# (n_features × n_features) matrix
cov = X_centered.T @ X_centered / (n_samples - 1)
Diagonal: variance of each feature Off-diagonal: covariance between features
7) Step 3: Eigendecomposition
Find eigenvectors and eigenvalues of the covariance matrix:
Cov × v = λ × v
- Eigenvector v: a direction in feature space
- Eigenvalue λ: variance along that direction
Eigenvectors are the principal components. Eigenvalues tell you how much variance each captures.
8) Step 4: Sort and select
Sort eigenvectors by eigenvalue (largest first):
# Most important component first
PC1: eigenvalue = 5.2 (captures most variance)
PC2: eigenvalue = 2.1
PC3: eigenvalue = 0.3 (captures little)
Keep top K components based on:
- Fixed K (e.g., K=2 for visualization)
- Variance threshold (keep 95% of variance)
9) Explained variance
How much variance does each component explain?
total_variance = sum(eigenvalues)
explained_ratio = eigenvalues / total_variance
# Example:
# PC1: 70%, PC2: 20%, PC3: 5%, PC4: 3%, ...
# Keeping PC1+PC2 retains 90% of variance
Plot cumulative explained variance to choose K.
10) Step 5: Project data
Transform original data to new coordinates:
# components is (n_features × n_components)
X_transformed = X_centered @ components
# Result: (n_samples × n_components)
Each sample is now represented in the lower-dimensional space.
11) Reconstruction
You can approximate original data from reduced representation:
X_reconstructed = X_transformed @ components.T + mean
Reconstruction error shows how much information was lost.
12) When to use PCA
Good uses:
- Visualization (reduce to 2D/3D)
- Noise reduction (drop low-variance components)
- Speed up training (fewer features)
- Decorrelate features
- Preprocessing for other algorithms
Limitations:
- Only linear relationships
- Hard to interpret components
- Sensitive to scaling (standardize first!)
13) Standardization before PCA
Always standardize features before PCA:
X_standardized = (X - X.mean(axis=0)) / X.std(axis=0)
Otherwise, high-variance features (different units) dominate.
Example: If one feature is in millions and another in decimals, PCA will focus on the millions.
14) PCA vs other methods
| Method | Type | Preserves |
|---|---|---|
| PCA | Linear | Variance |
| t-SNE | Non-linear | Local structure |
| UMAP | Non-linear | Local + global |
| Autoencoders | Neural network | Learned representation |
PCA is simple and fast; use t-SNE/UMAP for complex non-linear visualization.
15) Practical example
# Original data: 100 samples × 50 features
X_original = load_data() # shape: (100, 50)
# Standardize
X_std = standardize(X_original)
# PCA to 10 components
X_reduced, components, eigenvalues = pca(X_std, n_components=10)
# X_reduced shape: (100, 10)
# Check variance retained
print(sum(eigenvalues[:10]) / sum(eigenvalues)) # e.g., 0.95 = 95%
Key takeaways
- PCA finds directions of maximum variance
- Components are orthogonal, ordered by importance
- Center (and usually standardize) data first
- Eigenvalues tell you variance captured
- Keep enough components for desired variance (e.g., 95%)
- Use for visualization, noise reduction, speedup
- Only captures linear relationships