Loss Functions: Measuring Model Error

Lesson, slides, and applied problem sets.

Lesson

Loss Functions: Measuring Model Error

Why this module exists

A model is only as good as its ability to minimize error. Loss functions (also called cost functions or objective functions) quantify how wrong the model's predictions are. The choice of loss function shapes what the model learns.

Understanding loss functions is essential for understanding training dynamics and debugging models.

1) What is a loss function?

A loss function L(y_true, y_pred) takes:

y_true: The correct answer (ground truth)
y_pred: The model's prediction

And returns a scalar measuring how wrong the prediction is.

Properties:

L ≥ 0 (non-negative)
L = 0 when prediction is perfect
Larger L = worse prediction

2) Mean Squared Error (MSE)

The default for regression tasks:

def mse(y_true, y_pred):
    n = len(y_true)
    return sum((y_true[i] - y_pred[i])**2 for i in range(n)) / n

Formula: MSE = (1/n) Σ(yᵢ - ŷᵢ)²

Properties:

Penalizes large errors heavily (squared term)
Differentiable everywhere
Sensitive to outliers
Units are squared (not interpretable directly)

3) Mean Absolute Error (MAE)

An alternative for regression:

def mae(y_true, y_pred):
    n = len(y_true)
    return sum(abs(y_true[i] - y_pred[i]) for i in range(n)) / n

Formula: MAE = (1/n) Σ|yᵢ - ŷᵢ|

Properties:

Less sensitive to outliers than MSE
Same units as the target
Not differentiable at zero (but subgradient works)

4) MSE vs MAE: When to use which

Aspect	MSE	MAE
Outlier sensitivity	High	Low
Gradient at zero	Smooth	Sharp
Interpretation	Squared units	Same units
When predictions cluster	Prefers median	Prefers mean

Use MSE when outliers are meaningful; use MAE when they're noise.

5) Binary Cross-Entropy (Log Loss)

The loss for binary classification (0 or 1):

def binary_cross_entropy(y_true, y_pred):
    eps = 1e-15  # avoid log(0)
    y_pred = clip(y_pred, eps, 1 - eps)
    n = len(y_true)
    return -sum(y_true[i] * log(y_pred[i]) +
                (1 - y_true[i]) * log(1 - y_pred[i])
                for i in range(n)) / n

Formula: BCE = -(1/n) Σ[yᵢ log(ŷᵢ) + (1-yᵢ) log(1-ŷᵢ)]

Where ŷ is the predicted probability (0 to 1).

6) Understanding binary cross-entropy

When y_true = 1:

Loss = -log(y_pred)
If y_pred = 1: loss = 0 (correct, confident)
If y_pred = 0.5: loss = 0.69 (uncertain)
If y_pred = 0.01: loss = 4.6 (wrong, confident = very bad!)

When y_true = 0:

Loss = -log(1 - y_pred)
Symmetric behavior

Key insight: Confident wrong predictions are heavily penalized.

7) Categorical Cross-Entropy

For multi-class classification (K classes):

def categorical_cross_entropy(y_true, y_pred):
    # y_true: one-hot vector, y_pred: probability distribution
    eps = 1e-15
    n = len(y_true)
    loss = 0
    for i in range(n):
        for k in range(len(y_true[i])):
            loss -= y_true[i][k] * log(y_pred[i][k] + eps)
    return loss / n

Formula: CCE = -(1/n) Σᵢ Σₖ yᵢₖ log(ŷᵢₖ)

The prediction ŷ should be a probability distribution (from softmax).

8) Softmax: Turning scores into probabilities

Before cross-entropy, convert raw scores (logits) to probabilities:

def softmax(logits):
    exp_logits = [exp(x) for x in logits]
    sum_exp = sum(exp_logits)
    return [x / sum_exp for x in exp_logits]

# Example
scores = [2.0, 1.0, 0.1]
probs = softmax(scores)  # [0.65, 0.24, 0.09]

Properties:

All outputs in (0, 1)
Sum to 1
Preserves ordering

9) Logits and numerical stability

Raw model outputs (before softmax) are called logits.

For numerical stability, compute log-softmax and cross-entropy together:

def log_softmax(logits):
    max_logit = max(logits)  # subtract max for stability
    shifted = [x - max_logit for x in logits]
    log_sum_exp = log(sum(exp(x) for x in shifted))
    return [x - log_sum_exp for x in shifted]

Libraries handle this; understand why it matters.

10) Hinge loss (SVM)

For support vector machines:

def hinge_loss(y_true, y_pred):
    # y_true in {-1, +1}
    return max(0, 1 - y_true * y_pred)

Properties:

Zero loss when prediction has correct sign and magnitude > 1
Linear penalty for violations
Leads to maximum margin classifiers

11) Choosing the right loss function

Task	Common Loss
Regression	MSE, MAE, Huber
Binary classification	Binary cross-entropy
Multi-class classification	Categorical cross-entropy
Ranking	Pairwise/Listwise losses
Generative	Reconstruction + KL divergence

The loss function encodes your goals. Choose wisely.

12) Loss landscapes

The loss function creates a surface over parameter space. Training navigates this landscape:

Convex loss (linear regression + MSE): One global minimum
Non-convex loss (neural nets): Many local minima, saddle points

Visualization helps intuition, but real landscapes are high-dimensional.

Key takeaways

Loss functions measure prediction error; training minimizes them
MSE for regression: penalizes large errors, sensitive to outliers
MAE for regression: robust to outliers, less smooth
Binary cross-entropy for binary classification: punishes confident mistakes
Categorical cross-entropy for multi-class: requires softmax probabilities
Softmax converts logits to probability distribution
Loss choice shapes model behavior and gradients

Loss Functions: Measuring Model Error

Lesson

Loss Functions: Measuring Model Error

Why this module exists

1) What is a loss function?

2) Mean Squared Error (MSE)

3) Mean Absolute Error (MAE)

4) MSE vs MAE: When to use which

5) Binary Cross-Entropy (Log Loss)

6) Understanding binary cross-entropy

7) Categorical Cross-Entropy

8) Softmax: Turning scores into probabilities

9) Logits and numerical stability

10) Hinge loss (SVM)

11) Choosing the right loss function

12) Loss landscapes

Key takeaways

Module Items

Loss Functions

Loss Functions Checkpoint