Dimensionality Reduction with PCA

Lesson, slides, and applied problem sets.

Lesson

Dimensionality Reduction with PCA

Why this module exists

High-dimensional data is hard to visualize, slow to process, and prone to overfitting. Principal Component Analysis (PCA) finds the most important directions in data, letting you reduce dimensions while preserving maximum information.

1) The curse of dimensionality

As dimensions increase:

Data becomes sparse (points are far apart)
Distance metrics become less meaningful
More data needed to cover the space
Models overfit more easily

PCA helps by reducing to the dimensions that matter.

2) What is PCA?

PCA finds new axes (principal components) that:

Capture maximum variance in the data
Are orthogonal (perpendicular) to each other
Are ordered by importance (first PC captures most variance)

You then keep only the top K components.

3) The intuition

Imagine a 3D cloud of points that's actually flat (like a tilted frisbee):

The frisbee lives in 3D, but its "true" shape is 2D
PCA finds the 2D plane that best fits the cloud
You can describe points with 2 coordinates instead of 3

PCA finds the "natural" axes of your data.

4) The algorithm

def pca(X, n_components):
    # 1. Center the data
    mean = X.mean(axis=0)
    X_centered = X - mean

    # 2. Compute covariance matrix
    cov = X_centered.T @ X_centered / (len(X) - 1)

    # 3. Eigendecomposition
    eigenvalues, eigenvectors = eig(cov)

    # 4. Sort by eigenvalue (descending)
    idx = eigenvalues.argsort()[::-1]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]

    # 5. Select top components
    components = eigenvectors[:, :n_components]

    # 6. Project data
    X_transformed = X_centered @ components

    return X_transformed, components, eigenvalues

5) Step 1: Center the data

Subtract the mean from each feature:

X_centered = X - X.mean(axis=0)

This moves the data to be centered at the origin. Centering is essential—PCA finds directions through the origin.

6) Step 2: Covariance matrix

The covariance matrix captures how features vary together:

# (n_features × n_features) matrix
cov = X_centered.T @ X_centered / (n_samples - 1)

Diagonal: variance of each feature Off-diagonal: covariance between features

7) Step 3: Eigendecomposition

Find eigenvectors and eigenvalues of the covariance matrix:

Cov × v = λ × v

Eigenvector v: a direction in feature space
Eigenvalue λ: variance along that direction

Eigenvectors are the principal components. Eigenvalues tell you how much variance each captures.

8) Step 4: Sort and select

Sort eigenvectors by eigenvalue (largest first):

# Most important component first
PC1: eigenvalue = 5.2 (captures most variance)
PC2: eigenvalue = 2.1
PC3: eigenvalue = 0.3 (captures little)

Keep top K components based on:

Fixed K (e.g., K=2 for visualization)
Variance threshold (keep 95% of variance)

9) Explained variance

How much variance does each component explain?

total_variance = sum(eigenvalues)
explained_ratio = eigenvalues / total_variance

# Example:
# PC1: 70%, PC2: 20%, PC3: 5%, PC4: 3%, ...
# Keeping PC1+PC2 retains 90% of variance

Plot cumulative explained variance to choose K.

10) Step 5: Project data

Transform original data to new coordinates:

# components is (n_features × n_components)
X_transformed = X_centered @ components

# Result: (n_samples × n_components)

Each sample is now represented in the lower-dimensional space.

11) Reconstruction

You can approximate original data from reduced representation:

X_reconstructed = X_transformed @ components.T + mean

Reconstruction error shows how much information was lost.

12) When to use PCA

Good uses:

Visualization (reduce to 2D/3D)
Noise reduction (drop low-variance components)
Speed up training (fewer features)
Decorrelate features
Preprocessing for other algorithms

Limitations:

Only linear relationships
Hard to interpret components
Sensitive to scaling (standardize first!)

13) Standardization before PCA

Always standardize features before PCA:

X_standardized = (X - X.mean(axis=0)) / X.std(axis=0)

Otherwise, high-variance features (different units) dominate.

Example: If one feature is in millions and another in decimals, PCA will focus on the millions.

14) PCA vs other methods

Method	Type	Preserves
PCA	Linear	Variance
t-SNE	Non-linear	Local structure
UMAP	Non-linear	Local + global
Autoencoders	Neural network	Learned representation

PCA is simple and fast; use t-SNE/UMAP for complex non-linear visualization.

15) Practical example

# Original data: 100 samples × 50 features
X_original = load_data()  # shape: (100, 50)

# Standardize
X_std = standardize(X_original)

# PCA to 10 components
X_reduced, components, eigenvalues = pca(X_std, n_components=10)
# X_reduced shape: (100, 10)

# Check variance retained
print(sum(eigenvalues[:10]) / sum(eigenvalues))  # e.g., 0.95 = 95%

Key takeaways

PCA finds directions of maximum variance
Components are orthogonal, ordered by importance
Center (and usually standardize) data first
Eigenvalues tell you variance captured
Keep enough components for desired variance (e.g., 95%)
Use for visualization, noise reduction, speedup
Only captures linear relationships

Dimensionality Reduction with PCA

Lesson

Dimensionality Reduction with PCA

Why this module exists

1) The curse of dimensionality

2) What is PCA?

3) The intuition

4) The algorithm

5) Step 1: Center the data

6) Step 2: Covariance matrix

7) Step 3: Eigendecomposition

8) Step 4: Sort and select

9) Explained variance

10) Step 5: Project data

11) Reconstruction

12) When to use PCA

13) Standardization before PCA

14) PCA vs other methods

15) Practical example

Key takeaways

Module Items

PCA Implementation

PCA Checkpoint