Model Evaluation Metrics

Lesson, slides, and applied problem sets.

View Slides

Lesson

Model Evaluation Metrics

Why this module exists

Training a model is meaningless without knowing how well it performs. Evaluation metrics quantify model quality and guide improvement. Choosing the right metric depends on your task and what errors matter most.


1) Train/test split

Never evaluate on training data—you'll be fooled by overfitting.

# Split data: 80% train, 20% test
train_data = data[:int(0.8 * len(data))]
test_data = data[int(0.8 * len(data)):]

# Train on train_data, evaluate on test_data
model.fit(train_data)
score = model.evaluate(test_data)

The model never sees test data during training.


2) Accuracy

The simplest metric: fraction of correct predictions.

def accuracy(y_true, y_pred):
    correct = sum(1 for t, p in zip(y_true, y_pred) if t == p)
    return correct / len(y_true)

Problems with accuracy:

  • Misleading with imbalanced classes
  • 99% accuracy on 99% majority class = useless model

3) Confusion matrix

A 2×2 table for binary classification:

                  Predicted
                  Neg    Pos
Actual   Neg      TN     FP
         Pos      FN     TP

TN = True Negative (correct rejection)
FP = False Positive (false alarm)
FN = False Negative (missed detection)
TP = True Positive (correct detection)
def confusion_matrix(y_true, y_pred):
    TP = sum(1 for t, p in zip(y_true, y_pred) if t == 1 and p == 1)
    TN = sum(1 for t, p in zip(y_true, y_pred) if t == 0 and p == 0)
    FP = sum(1 for t, p in zip(y_true, y_pred) if t == 0 and p == 1)
    FN = sum(1 for t, p in zip(y_true, y_pred) if t == 1 and p == 0)
    return [[TN, FP], [FN, TP]]

4) Precision

Of all positive predictions, how many are correct?

def precision(TP, FP):
    return TP / (TP + FP) if (TP + FP) > 0 else 0

Precision = TP / (TP + FP)

High precision: Few false positives (when you predict positive, you're usually right)

Important when false positives are costly:

  • Spam filter (don't put real email in spam)
  • Medical tests (don't tell healthy person they're sick)

5) Recall (Sensitivity)

Of all actual positives, how many did we find?

def recall(TP, FN):
    return TP / (TP + FN) if (TP + FN) > 0 else 0

Recall = TP / (TP + FN)

High recall: Few false negatives (you find most of the positives)

Important when false negatives are costly:

  • Cancer detection (don't miss cancer cases)
  • Fraud detection (catch most fraud)

6) The precision-recall tradeoff

You can't maximize both simultaneously:

  • Increase threshold → higher precision, lower recall
  • Decrease threshold → higher recall, lower precision
Threshold = 0.9:  P=0.95, R=0.40  (few predictions, but accurate)
Threshold = 0.5:  P=0.70, R=0.80  (balanced)
Threshold = 0.1:  P=0.30, R=0.98  (catch everything, many false alarms)

Choose based on which errors matter more.


7) F1 Score

Harmonic mean of precision and recall:

def f1_score(precision, recall):
    if precision + recall == 0:
        return 0
    return 2 * precision * recall / (precision + recall)

F1 = 2 × (P × R) / (P + R)

Properties:

  • Range: [0, 1]
  • High only if both P and R are high
  • Good single metric for imbalanced data

8) ROC curve and AUC

ROC (Receiver Operating Characteristic): Plot of TPR vs FPR at various thresholds.

TPR = TP / (TP + FN)  # True Positive Rate = Recall
FPR = FP / (FP + TN)  # False Positive Rate

AUC (Area Under Curve): Single number summarizing ROC.

  • AUC = 1.0: Perfect classifier
  • AUC = 0.5: Random guessing
  • AUC < 0.5: Worse than random (model is confused)

AUC is threshold-independent—useful for comparing models.


9) Regression metrics

MSE: Mean Squared Error

MSE = mean((y_true - y_pred) ** 2)

RMSE: Root Mean Squared Error (same units as target)

RMSE = sqrt(MSE)

MAE: Mean Absolute Error

MAE = mean(|y_true - y_pred|)

R² (Coefficient of determination):

R2 = 1 - (sum((y_true - y_pred)**2) / sum((y_true - mean(y_true))**2))
  • R² = 1: Perfect predictions
  • R² = 0: Model predicts the mean
  • R² < 0: Worse than predicting mean

10) Cross-validation

One train/test split can be lucky or unlucky. Cross-validation is more robust:

K-fold cross-validation:

  1. Split data into K folds
  2. For each fold:
    • Train on K-1 folds
    • Test on remaining fold
  3. Average the K scores
def k_fold_cv(X, y, model, k=5):
    fold_size = len(X) // k
    scores = []

    for i in range(k):
        test_start = i * fold_size
        test_end = test_start + fold_size

        X_test = X[test_start:test_end]
        X_train = X[:test_start] + X[test_end:]
        # Similar for y

        model.fit(X_train, y_train)
        score = model.evaluate(X_test, y_test)
        scores.append(score)

    return mean(scores), std(scores)

11) Choosing the right metric

TaskPrimary MetricWhy
Balanced classificationAccuracyAll classes equal
Imbalanced classificationF1, AUCAccuracy misleading
FP costly (spam)PrecisionAvoid false alarms
FN costly (cancer)RecallDon't miss positives
RegressionRMSE, R²Interpretable units

Always consider the business context.


12) Overfitting detection

Compare train and test metrics:

TrainTestDiagnosis
LowLowUnderfitting
HighLowOverfitting
HighHighGood fit

Large gap between train and test = overfitting.


Key takeaways

  • Never evaluate on training data
  • Accuracy is misleading for imbalanced data
  • Confusion matrix reveals error types
  • Precision: few false positives; Recall: few false negatives
  • F1 balances precision and recall
  • AUC is threshold-independent comparison metric
  • Cross-validation gives robust estimates
  • Choose metrics based on what errors matter

Module Items