Loss Functions: Measuring Model Error

Lesson, slides, and applied problem sets.

View Slides

Lesson

Loss Functions: Measuring Model Error

Why this module exists

A model is only as good as its ability to minimize error. Loss functions (also called cost functions or objective functions) quantify how wrong the model's predictions are. The choice of loss function shapes what the model learns.

Understanding loss functions is essential for understanding training dynamics and debugging models.


1) What is a loss function?

A loss function L(y_true, y_pred) takes:

  • y_true: The correct answer (ground truth)
  • y_pred: The model's prediction

And returns a scalar measuring how wrong the prediction is.

Properties:

  • L ≥ 0 (non-negative)
  • L = 0 when prediction is perfect
  • Larger L = worse prediction

2) Mean Squared Error (MSE)

The default for regression tasks:

def mse(y_true, y_pred):
    n = len(y_true)
    return sum((y_true[i] - y_pred[i])**2 for i in range(n)) / n

Formula: MSE = (1/n) Σ(yᵢ - ŷᵢ)²

Properties:

  • Penalizes large errors heavily (squared term)
  • Differentiable everywhere
  • Sensitive to outliers
  • Units are squared (not interpretable directly)

3) Mean Absolute Error (MAE)

An alternative for regression:

def mae(y_true, y_pred):
    n = len(y_true)
    return sum(abs(y_true[i] - y_pred[i]) for i in range(n)) / n

Formula: MAE = (1/n) Σ|yᵢ - ŷᵢ|

Properties:

  • Less sensitive to outliers than MSE
  • Same units as the target
  • Not differentiable at zero (but subgradient works)

4) MSE vs MAE: When to use which

AspectMSEMAE
Outlier sensitivityHighLow
Gradient at zeroSmoothSharp
InterpretationSquared unitsSame units
When predictions clusterPrefers medianPrefers mean

Use MSE when outliers are meaningful; use MAE when they're noise.


5) Binary Cross-Entropy (Log Loss)

The loss for binary classification (0 or 1):

def binary_cross_entropy(y_true, y_pred):
    eps = 1e-15  # avoid log(0)
    y_pred = clip(y_pred, eps, 1 - eps)
    n = len(y_true)
    return -sum(y_true[i] * log(y_pred[i]) +
                (1 - y_true[i]) * log(1 - y_pred[i])
                for i in range(n)) / n

Formula: BCE = -(1/n) Σ[yᵢ log(ŷᵢ) + (1-yᵢ) log(1-ŷᵢ)]

Where ŷ is the predicted probability (0 to 1).


6) Understanding binary cross-entropy

When y_true = 1:

  • Loss = -log(y_pred)
  • If y_pred = 1: loss = 0 (correct, confident)
  • If y_pred = 0.5: loss = 0.69 (uncertain)
  • If y_pred = 0.01: loss = 4.6 (wrong, confident = very bad!)

When y_true = 0:

  • Loss = -log(1 - y_pred)
  • Symmetric behavior

Key insight: Confident wrong predictions are heavily penalized.


7) Categorical Cross-Entropy

For multi-class classification (K classes):

def categorical_cross_entropy(y_true, y_pred):
    # y_true: one-hot vector, y_pred: probability distribution
    eps = 1e-15
    n = len(y_true)
    loss = 0
    for i in range(n):
        for k in range(len(y_true[i])):
            loss -= y_true[i][k] * log(y_pred[i][k] + eps)
    return loss / n

Formula: CCE = -(1/n) Σᵢ Σₖ yᵢₖ log(ŷᵢₖ)

The prediction ŷ should be a probability distribution (from softmax).


8) Softmax: Turning scores into probabilities

Before cross-entropy, convert raw scores (logits) to probabilities:

def softmax(logits):
    exp_logits = [exp(x) for x in logits]
    sum_exp = sum(exp_logits)
    return [x / sum_exp for x in exp_logits]

# Example
scores = [2.0, 1.0, 0.1]
probs = softmax(scores)  # [0.65, 0.24, 0.09]

Properties:

  • All outputs in (0, 1)
  • Sum to 1
  • Preserves ordering

9) Logits and numerical stability

Raw model outputs (before softmax) are called logits.

For numerical stability, compute log-softmax and cross-entropy together:

def log_softmax(logits):
    max_logit = max(logits)  # subtract max for stability
    shifted = [x - max_logit for x in logits]
    log_sum_exp = log(sum(exp(x) for x in shifted))
    return [x - log_sum_exp for x in shifted]

Libraries handle this; understand why it matters.


10) Hinge loss (SVM)

For support vector machines:

def hinge_loss(y_true, y_pred):
    # y_true in {-1, +1}
    return max(0, 1 - y_true * y_pred)

Properties:

  • Zero loss when prediction has correct sign and magnitude > 1
  • Linear penalty for violations
  • Leads to maximum margin classifiers

11) Choosing the right loss function

TaskCommon Loss
RegressionMSE, MAE, Huber
Binary classificationBinary cross-entropy
Multi-class classificationCategorical cross-entropy
RankingPairwise/Listwise losses
GenerativeReconstruction + KL divergence

The loss function encodes your goals. Choose wisely.


12) Loss landscapes

The loss function creates a surface over parameter space. Training navigates this landscape:

  • Convex loss (linear regression + MSE): One global minimum
  • Non-convex loss (neural nets): Many local minima, saddle points

Visualization helps intuition, but real landscapes are high-dimensional.


Key takeaways

  • Loss functions measure prediction error; training minimizes them
  • MSE for regression: penalizes large errors, sensitive to outliers
  • MAE for regression: robust to outliers, less smooth
  • Binary cross-entropy for binary classification: punishes confident mistakes
  • Categorical cross-entropy for multi-class: requires softmax probabilities
  • Softmax converts logits to probability distribution
  • Loss choice shapes model behavior and gradients

Module Items