Loss Functions: Measuring Model Error
Lesson, slides, and applied problem sets.
View SlidesLesson
Loss Functions: Measuring Model Error
Why this module exists
A model is only as good as its ability to minimize error. Loss functions (also called cost functions or objective functions) quantify how wrong the model's predictions are. The choice of loss function shapes what the model learns.
Understanding loss functions is essential for understanding training dynamics and debugging models.
1) What is a loss function?
A loss function L(y_true, y_pred) takes:
- y_true: The correct answer (ground truth)
- y_pred: The model's prediction
And returns a scalar measuring how wrong the prediction is.
Properties:
- L ≥ 0 (non-negative)
- L = 0 when prediction is perfect
- Larger L = worse prediction
2) Mean Squared Error (MSE)
The default for regression tasks:
def mse(y_true, y_pred):
n = len(y_true)
return sum((y_true[i] - y_pred[i])**2 for i in range(n)) / n
Formula: MSE = (1/n) Σ(yᵢ - ŷᵢ)²
Properties:
- Penalizes large errors heavily (squared term)
- Differentiable everywhere
- Sensitive to outliers
- Units are squared (not interpretable directly)
3) Mean Absolute Error (MAE)
An alternative for regression:
def mae(y_true, y_pred):
n = len(y_true)
return sum(abs(y_true[i] - y_pred[i]) for i in range(n)) / n
Formula: MAE = (1/n) Σ|yᵢ - ŷᵢ|
Properties:
- Less sensitive to outliers than MSE
- Same units as the target
- Not differentiable at zero (but subgradient works)
4) MSE vs MAE: When to use which
| Aspect | MSE | MAE |
|---|---|---|
| Outlier sensitivity | High | Low |
| Gradient at zero | Smooth | Sharp |
| Interpretation | Squared units | Same units |
| When predictions cluster | Prefers median | Prefers mean |
Use MSE when outliers are meaningful; use MAE when they're noise.
5) Binary Cross-Entropy (Log Loss)
The loss for binary classification (0 or 1):
def binary_cross_entropy(y_true, y_pred):
eps = 1e-15 # avoid log(0)
y_pred = clip(y_pred, eps, 1 - eps)
n = len(y_true)
return -sum(y_true[i] * log(y_pred[i]) +
(1 - y_true[i]) * log(1 - y_pred[i])
for i in range(n)) / n
Formula: BCE = -(1/n) Σ[yᵢ log(ŷᵢ) + (1-yᵢ) log(1-ŷᵢ)]
Where ŷ is the predicted probability (0 to 1).
6) Understanding binary cross-entropy
When y_true = 1:
- Loss = -log(y_pred)
- If y_pred = 1: loss = 0 (correct, confident)
- If y_pred = 0.5: loss = 0.69 (uncertain)
- If y_pred = 0.01: loss = 4.6 (wrong, confident = very bad!)
When y_true = 0:
- Loss = -log(1 - y_pred)
- Symmetric behavior
Key insight: Confident wrong predictions are heavily penalized.
7) Categorical Cross-Entropy
For multi-class classification (K classes):
def categorical_cross_entropy(y_true, y_pred):
# y_true: one-hot vector, y_pred: probability distribution
eps = 1e-15
n = len(y_true)
loss = 0
for i in range(n):
for k in range(len(y_true[i])):
loss -= y_true[i][k] * log(y_pred[i][k] + eps)
return loss / n
Formula: CCE = -(1/n) Σᵢ Σₖ yᵢₖ log(ŷᵢₖ)
The prediction ŷ should be a probability distribution (from softmax).
8) Softmax: Turning scores into probabilities
Before cross-entropy, convert raw scores (logits) to probabilities:
def softmax(logits):
exp_logits = [exp(x) for x in logits]
sum_exp = sum(exp_logits)
return [x / sum_exp for x in exp_logits]
# Example
scores = [2.0, 1.0, 0.1]
probs = softmax(scores) # [0.65, 0.24, 0.09]
Properties:
- All outputs in (0, 1)
- Sum to 1
- Preserves ordering
9) Logits and numerical stability
Raw model outputs (before softmax) are called logits.
For numerical stability, compute log-softmax and cross-entropy together:
def log_softmax(logits):
max_logit = max(logits) # subtract max for stability
shifted = [x - max_logit for x in logits]
log_sum_exp = log(sum(exp(x) for x in shifted))
return [x - log_sum_exp for x in shifted]
Libraries handle this; understand why it matters.
10) Hinge loss (SVM)
For support vector machines:
def hinge_loss(y_true, y_pred):
# y_true in {-1, +1}
return max(0, 1 - y_true * y_pred)
Properties:
- Zero loss when prediction has correct sign and magnitude > 1
- Linear penalty for violations
- Leads to maximum margin classifiers
11) Choosing the right loss function
| Task | Common Loss |
|---|---|
| Regression | MSE, MAE, Huber |
| Binary classification | Binary cross-entropy |
| Multi-class classification | Categorical cross-entropy |
| Ranking | Pairwise/Listwise losses |
| Generative | Reconstruction + KL divergence |
The loss function encodes your goals. Choose wisely.
12) Loss landscapes
The loss function creates a surface over parameter space. Training navigates this landscape:
- Convex loss (linear regression + MSE): One global minimum
- Non-convex loss (neural nets): Many local minima, saddle points
Visualization helps intuition, but real landscapes are high-dimensional.
Key takeaways
- Loss functions measure prediction error; training minimizes them
- MSE for regression: penalizes large errors, sensitive to outliers
- MAE for regression: robust to outliers, less smooth
- Binary cross-entropy for binary classification: punishes confident mistakes
- Categorical cross-entropy for multi-class: requires softmax probabilities
- Softmax converts logits to probability distribution
- Loss choice shapes model behavior and gradients