Regularization and Overfitting
Lesson, slides, and applied problem sets.
View SlidesLesson
Regularization and Overfitting
Why this module exists
A model that perfectly fits training data often fails on new data—it has memorized rather than learned. Regularization techniques prevent this overfitting, leading to models that generalize well.
1) Overfitting and underfitting
Underfitting (high bias):
- Model is too simple
- Poor performance on both train and test
- Example: linear model for curved data
Overfitting (high variance):
- Model is too complex
- Great on train, poor on test
- Memorizes noise instead of learning patterns
Good fit:
- Captures the underlying pattern
- Similar performance on train and test
2) The bias-variance tradeoff
Bias: Error from wrong assumptions (too simple) Variance: Error from sensitivity to training data (too complex)
Total Error = Bias² + Variance + Irreducible Noise
Increasing model complexity:
- Decreases bias (fits data better)
- Increases variance (more sensitive to training set)
The goal is to find the sweet spot.
3) Detecting overfitting
Compare training and validation metrics:
train_loss = model.evaluate(train_data)
val_loss = model.evaluate(val_data)
if val_loss >> train_loss:
print("Overfitting detected!")
Plot learning curves:
- Training loss decreases
- Validation loss decreases, then increases ← overfitting starts here
4) Regularization idea
Add a penalty to the loss function for model complexity:
Total Loss = Data Loss + λ × Complexity Penalty
Where λ controls regularization strength:
- λ = 0: No regularization
- λ large: Strong regularization (simpler model)
5) L2 regularization (Ridge)
Penalize the sum of squared weights:
def l2_penalty(weights):
return sum(w ** 2 for w in weights)
total_loss = data_loss + lambda_ * l2_penalty(weights)
Effects:
- Shrinks weights toward zero
- Keeps all features, just with smaller coefficients
- Smooth penalty (differentiable everywhere)
6) L1 regularization (Lasso)
Penalize the sum of absolute weights:
def l1_penalty(weights):
return sum(abs(w) for w in weights)
total_loss = data_loss + lambda_ * l1_penalty(weights)
Effects:
- Drives some weights exactly to zero
- Automatic feature selection
- Sparse solutions
- Non-smooth at zero (subgradient needed)
7) L1 vs L2 comparison
| Aspect | L1 (Lasso) | L2 (Ridge) |
|---|---|---|
| Sparsity | Yes (zeros) | No (small but nonzero) |
| Feature selection | Yes | No |
| Multiple correlated features | Picks one | Shares weight |
| Computation | Harder | Easier |
Elastic Net: Combine both L1 and L2
penalty = α × L1 + (1 - α) × L2
8) Gradient with regularization
L2 regularization gradient:
gradient = data_gradient + 2 * lambda_ * weights
Update becomes:
weights = weights - lr * (data_gradient + 2 * lambda_ * weights)
= weights * (1 - 2 * lr * lambda_) - lr * data_gradient
The (1 - 2 lr λ) factor shrinks weights each step—called "weight decay."
9) Dropout (neural networks)
During training, randomly "drop" neurons (set to zero):
def dropout(layer_output, p=0.5):
mask = random_bernoulli(shape=layer_output.shape, prob=p)
return layer_output * mask / (1 - p) # scale to maintain expected value
Effects:
- Forces redundancy (network can't rely on single neurons)
- Like training many smaller networks
- Regularizes without explicit penalty
At test time: use all neurons (no dropout).
10) Early stopping
Stop training when validation loss stops improving:
best_val_loss = float('inf')
patience_counter = 0
for epoch in range(max_epochs):
train_loss = train_one_epoch()
val_loss = evaluate()
if val_loss < best_val_loss:
best_val_loss = val_loss
patience_counter = 0
save_model() # Save best model
else:
patience_counter += 1
if patience_counter >= patience:
break # Stop training
Simple and effective regularization.
11) Data augmentation
Create synthetic training examples:
Images: Rotate, flip, crop, adjust brightness Text: Synonym replacement, back-translation Audio: Time stretch, pitch shift, add noise
More diverse training data → better generalization.
12) Batch normalization
Normalize layer activations during training:
def batch_norm(x, gamma, beta):
mean = x.mean(axis=0)
var = x.var(axis=0)
x_normalized = (x - mean) / sqrt(var + eps)
return gamma * x_normalized + beta
Effects:
- Stabilizes training
- Allows higher learning rates
- Has regularization effect (noise from batch statistics)
13) Choosing regularization strength
Cross-validation to find best λ:
lambdas = [0.001, 0.01, 0.1, 1.0, 10.0]
best_lambda = None
best_score = -float('inf')
for lambda_ in lambdas:
score = cross_validate(model, lambda_)
if score > best_score:
best_score = score
best_lambda = lambda_
Or use validation set to tune.
Key takeaways
- Overfitting: great train, poor test; model memorizes noise
- Regularization adds penalty for complexity
- L2: shrinks weights, smooth
- L1: creates sparsity, feature selection
- Dropout: random neuron dropping during training
- Early stopping: stop when validation loss increases
- Data augmentation: more diverse training examples
- Choose λ via cross-validation