Regularization and Overfitting

Lesson, slides, and applied problem sets.

View Slides

Lesson

Regularization and Overfitting

Why this module exists

A model that perfectly fits training data often fails on new data—it has memorized rather than learned. Regularization techniques prevent this overfitting, leading to models that generalize well.


1) Overfitting and underfitting

Underfitting (high bias):

  • Model is too simple
  • Poor performance on both train and test
  • Example: linear model for curved data

Overfitting (high variance):

  • Model is too complex
  • Great on train, poor on test
  • Memorizes noise instead of learning patterns

Good fit:

  • Captures the underlying pattern
  • Similar performance on train and test

2) The bias-variance tradeoff

Bias: Error from wrong assumptions (too simple) Variance: Error from sensitivity to training data (too complex)

Total Error = Bias² + Variance + Irreducible Noise

Increasing model complexity:

  • Decreases bias (fits data better)
  • Increases variance (more sensitive to training set)

The goal is to find the sweet spot.


3) Detecting overfitting

Compare training and validation metrics:

train_loss = model.evaluate(train_data)
val_loss = model.evaluate(val_data)

if val_loss >> train_loss:
    print("Overfitting detected!")

Plot learning curves:

  • Training loss decreases
  • Validation loss decreases, then increases ← overfitting starts here

4) Regularization idea

Add a penalty to the loss function for model complexity:

Total Loss = Data Loss + λ × Complexity Penalty

Where λ controls regularization strength:

  • λ = 0: No regularization
  • λ large: Strong regularization (simpler model)

5) L2 regularization (Ridge)

Penalize the sum of squared weights:

def l2_penalty(weights):
    return sum(w ** 2 for w in weights)

total_loss = data_loss + lambda_ * l2_penalty(weights)

Effects:

  • Shrinks weights toward zero
  • Keeps all features, just with smaller coefficients
  • Smooth penalty (differentiable everywhere)

6) L1 regularization (Lasso)

Penalize the sum of absolute weights:

def l1_penalty(weights):
    return sum(abs(w) for w in weights)

total_loss = data_loss + lambda_ * l1_penalty(weights)

Effects:

  • Drives some weights exactly to zero
  • Automatic feature selection
  • Sparse solutions
  • Non-smooth at zero (subgradient needed)

7) L1 vs L2 comparison

AspectL1 (Lasso)L2 (Ridge)
SparsityYes (zeros)No (small but nonzero)
Feature selectionYesNo
Multiple correlated featuresPicks oneShares weight
ComputationHarderEasier

Elastic Net: Combine both L1 and L2

penalty = α × L1 + (1 - α) × L2

8) Gradient with regularization

L2 regularization gradient:

gradient = data_gradient + 2 * lambda_ * weights

Update becomes:

weights = weights - lr * (data_gradient + 2 * lambda_ * weights)
       = weights * (1 - 2 * lr * lambda_) - lr * data_gradient

The (1 - 2 lr λ) factor shrinks weights each step—called "weight decay."


9) Dropout (neural networks)

During training, randomly "drop" neurons (set to zero):

def dropout(layer_output, p=0.5):
    mask = random_bernoulli(shape=layer_output.shape, prob=p)
    return layer_output * mask / (1 - p)  # scale to maintain expected value

Effects:

  • Forces redundancy (network can't rely on single neurons)
  • Like training many smaller networks
  • Regularizes without explicit penalty

At test time: use all neurons (no dropout).


10) Early stopping

Stop training when validation loss stops improving:

best_val_loss = float('inf')
patience_counter = 0

for epoch in range(max_epochs):
    train_loss = train_one_epoch()
    val_loss = evaluate()

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        save_model()  # Save best model
    else:
        patience_counter += 1
        if patience_counter >= patience:
            break  # Stop training

Simple and effective regularization.


11) Data augmentation

Create synthetic training examples:

Images: Rotate, flip, crop, adjust brightness Text: Synonym replacement, back-translation Audio: Time stretch, pitch shift, add noise

More diverse training data → better generalization.


12) Batch normalization

Normalize layer activations during training:

def batch_norm(x, gamma, beta):
    mean = x.mean(axis=0)
    var = x.var(axis=0)
    x_normalized = (x - mean) / sqrt(var + eps)
    return gamma * x_normalized + beta

Effects:

  • Stabilizes training
  • Allows higher learning rates
  • Has regularization effect (noise from batch statistics)

13) Choosing regularization strength

Cross-validation to find best λ:

lambdas = [0.001, 0.01, 0.1, 1.0, 10.0]
best_lambda = None
best_score = -float('inf')

for lambda_ in lambdas:
    score = cross_validate(model, lambda_)
    if score > best_score:
        best_score = score
        best_lambda = lambda_

Or use validation set to tune.


Key takeaways

  • Overfitting: great train, poor test; model memorizes noise
  • Regularization adds penalty for complexity
  • L2: shrinks weights, smooth
  • L1: creates sparsity, feature selection
  • Dropout: random neuron dropping during training
  • Early stopping: stop when validation loss increases
  • Data augmentation: more diverse training examples
  • Choose λ via cross-validation

Module Items