Regularization and Overfitting

Lesson, slides, and applied problem sets.

Lesson

Regularization and Overfitting

Why this module exists

A model that perfectly fits training data often fails on new data—it has memorized rather than learned. Regularization techniques prevent this overfitting, leading to models that generalize well.

1) Overfitting and underfitting

Underfitting (high bias):

Model is too simple
Poor performance on both train and test
Example: linear model for curved data

Overfitting (high variance):

Model is too complex
Great on train, poor on test
Memorizes noise instead of learning patterns

Good fit:

Captures the underlying pattern
Similar performance on train and test

2) The bias-variance tradeoff

Bias: Error from wrong assumptions (too simple) Variance: Error from sensitivity to training data (too complex)

Total Error = Bias² + Variance + Irreducible Noise

Increasing model complexity:

Decreases bias (fits data better)
Increases variance (more sensitive to training set)

The goal is to find the sweet spot.

3) Detecting overfitting

Compare training and validation metrics:

train_loss = model.evaluate(train_data)
val_loss = model.evaluate(val_data)

if val_loss >> train_loss:
    print("Overfitting detected!")

Plot learning curves:

Training loss decreases
Validation loss decreases, then increases ← overfitting starts here

4) Regularization idea

Add a penalty to the loss function for model complexity:

Total Loss = Data Loss + λ × Complexity Penalty

Where λ controls regularization strength:

λ = 0: No regularization
λ large: Strong regularization (simpler model)

5) L2 regularization (Ridge)

Penalize the sum of squared weights:

def l2_penalty(weights):
    return sum(w ** 2 for w in weights)

total_loss = data_loss + lambda_ * l2_penalty(weights)

Effects:

Shrinks weights toward zero
Keeps all features, just with smaller coefficients
Smooth penalty (differentiable everywhere)

6) L1 regularization (Lasso)

Penalize the sum of absolute weights:

def l1_penalty(weights):
    return sum(abs(w) for w in weights)

total_loss = data_loss + lambda_ * l1_penalty(weights)

Effects:

Drives some weights exactly to zero
Automatic feature selection
Sparse solutions
Non-smooth at zero (subgradient needed)

7) L1 vs L2 comparison

Aspect	L1 (Lasso)	L2 (Ridge)
Sparsity	Yes (zeros)	No (small but nonzero)
Feature selection	Yes	No
Multiple correlated features	Picks one	Shares weight
Computation	Harder	Easier

Elastic Net: Combine both L1 and L2

penalty = α × L1 + (1 - α) × L2

8) Gradient with regularization

L2 regularization gradient:

gradient = data_gradient + 2 * lambda_ * weights

Update becomes:

weights = weights - lr * (data_gradient + 2 * lambda_ * weights)
       = weights * (1 - 2 * lr * lambda_) - lr * data_gradient

The (1 - 2 lr λ) factor shrinks weights each step—called "weight decay."

9) Dropout (neural networks)

During training, randomly "drop" neurons (set to zero):

def dropout(layer_output, p=0.5):
    mask = random_bernoulli(shape=layer_output.shape, prob=p)
    return layer_output * mask / (1 - p)  # scale to maintain expected value

Effects:

Forces redundancy (network can't rely on single neurons)
Like training many smaller networks
Regularizes without explicit penalty

At test time: use all neurons (no dropout).

10) Early stopping

Stop training when validation loss stops improving:

best_val_loss = float('inf')
patience_counter = 0

for epoch in range(max_epochs):
    train_loss = train_one_epoch()
    val_loss = evaluate()

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        save_model()  # Save best model
    else:
        patience_counter += 1
        if patience_counter >= patience:
            break  # Stop training

Simple and effective regularization.

11) Data augmentation

Create synthetic training examples:

Images: Rotate, flip, crop, adjust brightness Text: Synonym replacement, back-translation Audio: Time stretch, pitch shift, add noise

More diverse training data → better generalization.

12) Batch normalization

Normalize layer activations during training:

def batch_norm(x, gamma, beta):
    mean = x.mean(axis=0)
    var = x.var(axis=0)
    x_normalized = (x - mean) / sqrt(var + eps)
    return gamma * x_normalized + beta

Effects:

Stabilizes training
Allows higher learning rates
Has regularization effect (noise from batch statistics)

13) Choosing regularization strength

Cross-validation to find best λ:

lambdas = [0.001, 0.01, 0.1, 1.0, 10.0]
best_lambda = None
best_score = -float('inf')

for lambda_ in lambdas:
    score = cross_validate(model, lambda_)
    if score > best_score:
        best_score = score
        best_lambda = lambda_

Or use validation set to tune.

Key takeaways

Overfitting: great train, poor test; model memorizes noise
Regularization adds penalty for complexity
L2: shrinks weights, smooth
L1: creates sparsity, feature selection
Dropout: random neuron dropping during training
Early stopping: stop when validation loss increases
Data augmentation: more diverse training examples
Choose λ via cross-validation

Regularization and Overfitting

Lesson

Regularization and Overfitting

Why this module exists

1) Overfitting and underfitting

2) The bias-variance tradeoff

3) Detecting overfitting

4) Regularization idea

5) L2 regularization (Ridge)

6) L1 regularization (Lasso)

7) L1 vs L2 comparison

8) Gradient with regularization

9) Dropout (neural networks)

10) Early stopping

11) Data augmentation

12) Batch normalization

13) Choosing regularization strength

Key takeaways

Module Items

Regularized Regression

Regularization Checkpoint