Dimensionality Reduction with PCA

Lesson, slides, and applied problem sets.

View Slides

Lesson

Dimensionality Reduction with PCA

Why this module exists

High-dimensional data is hard to visualize, slow to process, and prone to overfitting. Principal Component Analysis (PCA) finds the most important directions in data, letting you reduce dimensions while preserving maximum information.


1) The curse of dimensionality

As dimensions increase:

  • Data becomes sparse (points are far apart)
  • Distance metrics become less meaningful
  • More data needed to cover the space
  • Models overfit more easily

PCA helps by reducing to the dimensions that matter.


2) What is PCA?

PCA finds new axes (principal components) that:

  1. Capture maximum variance in the data
  2. Are orthogonal (perpendicular) to each other
  3. Are ordered by importance (first PC captures most variance)

You then keep only the top K components.


3) The intuition

Imagine a 3D cloud of points that's actually flat (like a tilted frisbee):

  • The frisbee lives in 3D, but its "true" shape is 2D
  • PCA finds the 2D plane that best fits the cloud
  • You can describe points with 2 coordinates instead of 3

PCA finds the "natural" axes of your data.


4) The algorithm

def pca(X, n_components):
    # 1. Center the data
    mean = X.mean(axis=0)
    X_centered = X - mean

    # 2. Compute covariance matrix
    cov = X_centered.T @ X_centered / (len(X) - 1)

    # 3. Eigendecomposition
    eigenvalues, eigenvectors = eig(cov)

    # 4. Sort by eigenvalue (descending)
    idx = eigenvalues.argsort()[::-1]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]

    # 5. Select top components
    components = eigenvectors[:, :n_components]

    # 6. Project data
    X_transformed = X_centered @ components

    return X_transformed, components, eigenvalues

5) Step 1: Center the data

Subtract the mean from each feature:

X_centered = X - X.mean(axis=0)

This moves the data to be centered at the origin. Centering is essential—PCA finds directions through the origin.


6) Step 2: Covariance matrix

The covariance matrix captures how features vary together:

# (n_features × n_features) matrix
cov = X_centered.T @ X_centered / (n_samples - 1)

Diagonal: variance of each feature Off-diagonal: covariance between features


7) Step 3: Eigendecomposition

Find eigenvectors and eigenvalues of the covariance matrix:

Cov × v = λ × v
  • Eigenvector v: a direction in feature space
  • Eigenvalue λ: variance along that direction

Eigenvectors are the principal components. Eigenvalues tell you how much variance each captures.


8) Step 4: Sort and select

Sort eigenvectors by eigenvalue (largest first):

# Most important component first
PC1: eigenvalue = 5.2 (captures most variance)
PC2: eigenvalue = 2.1
PC3: eigenvalue = 0.3 (captures little)

Keep top K components based on:

  • Fixed K (e.g., K=2 for visualization)
  • Variance threshold (keep 95% of variance)

9) Explained variance

How much variance does each component explain?

total_variance = sum(eigenvalues)
explained_ratio = eigenvalues / total_variance

# Example:
# PC1: 70%, PC2: 20%, PC3: 5%, PC4: 3%, ...
# Keeping PC1+PC2 retains 90% of variance

Plot cumulative explained variance to choose K.


10) Step 5: Project data

Transform original data to new coordinates:

# components is (n_features × n_components)
X_transformed = X_centered @ components

# Result: (n_samples × n_components)

Each sample is now represented in the lower-dimensional space.


11) Reconstruction

You can approximate original data from reduced representation:

X_reconstructed = X_transformed @ components.T + mean

Reconstruction error shows how much information was lost.


12) When to use PCA

Good uses:

  • Visualization (reduce to 2D/3D)
  • Noise reduction (drop low-variance components)
  • Speed up training (fewer features)
  • Decorrelate features
  • Preprocessing for other algorithms

Limitations:

  • Only linear relationships
  • Hard to interpret components
  • Sensitive to scaling (standardize first!)

13) Standardization before PCA

Always standardize features before PCA:

X_standardized = (X - X.mean(axis=0)) / X.std(axis=0)

Otherwise, high-variance features (different units) dominate.

Example: If one feature is in millions and another in decimals, PCA will focus on the millions.


14) PCA vs other methods

MethodTypePreserves
PCALinearVariance
t-SNENon-linearLocal structure
UMAPNon-linearLocal + global
AutoencodersNeural networkLearned representation

PCA is simple and fast; use t-SNE/UMAP for complex non-linear visualization.


15) Practical example

# Original data: 100 samples × 50 features
X_original = load_data()  # shape: (100, 50)

# Standardize
X_std = standardize(X_original)

# PCA to 10 components
X_reduced, components, eigenvalues = pca(X_std, n_components=10)
# X_reduced shape: (100, 10)

# Check variance retained
print(sum(eigenvalues[:10]) / sum(eigenvalues))  # e.g., 0.95 = 95%

Key takeaways

  • PCA finds directions of maximum variance
  • Components are orthogonal, ordered by importance
  • Center (and usually standardize) data first
  • Eigenvalues tell you variance captured
  • Keep enough components for desired variance (e.g., 95%)
  • Use for visualization, noise reduction, speedup
  • Only captures linear relationships

Module Items