Model Evaluation Metrics

Lesson, slides, and applied problem sets.

Lesson

Model Evaluation Metrics

Why this module exists

Training a model is meaningless without knowing how well it performs. Evaluation metrics quantify model quality and guide improvement. Choosing the right metric depends on your task and what errors matter most.

1) Train/test split

Never evaluate on training data—you'll be fooled by overfitting.

# Split data: 80% train, 20% test
train_data = data[:int(0.8 * len(data))]
test_data = data[int(0.8 * len(data)):]

# Train on train_data, evaluate on test_data
model.fit(train_data)
score = model.evaluate(test_data)

The model never sees test data during training.

2) Accuracy

The simplest metric: fraction of correct predictions.

def accuracy(y_true, y_pred):
    correct = sum(1 for t, p in zip(y_true, y_pred) if t == p)
    return correct / len(y_true)

Problems with accuracy:

Misleading with imbalanced classes
99% accuracy on 99% majority class = useless model

3) Confusion matrix

A 2×2 table for binary classification:

                  Predicted
                  Neg    Pos
Actual   Neg      TN     FP
         Pos      FN     TP

TN = True Negative (correct rejection)
FP = False Positive (false alarm)
FN = False Negative (missed detection)
TP = True Positive (correct detection)

def confusion_matrix(y_true, y_pred):
    TP = sum(1 for t, p in zip(y_true, y_pred) if t == 1 and p == 1)
    TN = sum(1 for t, p in zip(y_true, y_pred) if t == 0 and p == 0)
    FP = sum(1 for t, p in zip(y_true, y_pred) if t == 0 and p == 1)
    FN = sum(1 for t, p in zip(y_true, y_pred) if t == 1 and p == 0)
    return [[TN, FP], [FN, TP]]

4) Precision

Of all positive predictions, how many are correct?

def precision(TP, FP):
    return TP / (TP + FP) if (TP + FP) > 0 else 0

Precision = TP / (TP + FP)

High precision: Few false positives (when you predict positive, you're usually right)

Important when false positives are costly:

Spam filter (don't put real email in spam)
Medical tests (don't tell healthy person they're sick)

5) Recall (Sensitivity)

Of all actual positives, how many did we find?

def recall(TP, FN):
    return TP / (TP + FN) if (TP + FN) > 0 else 0

Recall = TP / (TP + FN)

High recall: Few false negatives (you find most of the positives)

Important when false negatives are costly:

Cancer detection (don't miss cancer cases)
Fraud detection (catch most fraud)

6) The precision-recall tradeoff

You can't maximize both simultaneously:

Increase threshold → higher precision, lower recall
Decrease threshold → higher recall, lower precision

Threshold = 0.9:  P=0.95, R=0.40  (few predictions, but accurate)
Threshold = 0.5:  P=0.70, R=0.80  (balanced)
Threshold = 0.1:  P=0.30, R=0.98  (catch everything, many false alarms)

Choose based on which errors matter more.

7) F1 Score

Harmonic mean of precision and recall:

def f1_score(precision, recall):
    if precision + recall == 0:
        return 0
    return 2 * precision * recall / (precision + recall)

F1 = 2 × (P × R) / (P + R)

Properties:

Range: [0, 1]
High only if both P and R are high
Good single metric for imbalanced data

8) ROC curve and AUC

ROC (Receiver Operating Characteristic): Plot of TPR vs FPR at various thresholds.

TPR = TP / (TP + FN)  # True Positive Rate = Recall
FPR = FP / (FP + TN)  # False Positive Rate

AUC (Area Under Curve): Single number summarizing ROC.

AUC = 1.0: Perfect classifier
AUC = 0.5: Random guessing
AUC < 0.5: Worse than random (model is confused)

AUC is threshold-independent—useful for comparing models.

9) Regression metrics

MSE: Mean Squared Error

MSE = mean((y_true - y_pred) ** 2)

RMSE: Root Mean Squared Error (same units as target)

RMSE = sqrt(MSE)

MAE: Mean Absolute Error

MAE = mean(|y_true - y_pred|)

R² (Coefficient of determination):

R2 = 1 - (sum((y_true - y_pred)**2) / sum((y_true - mean(y_true))**2))

R² = 1: Perfect predictions
R² = 0: Model predicts the mean
R² < 0: Worse than predicting mean

10) Cross-validation

One train/test split can be lucky or unlucky. Cross-validation is more robust:

K-fold cross-validation:

Split data into K folds
For each fold:
- Train on K-1 folds
- Test on remaining fold
Average the K scores

def k_fold_cv(X, y, model, k=5):
    fold_size = len(X) // k
    scores = []

    for i in range(k):
        test_start = i * fold_size
        test_end = test_start + fold_size

        X_test = X[test_start:test_end]
        X_train = X[:test_start] + X[test_end:]
        # Similar for y

        model.fit(X_train, y_train)
        score = model.evaluate(X_test, y_test)
        scores.append(score)

    return mean(scores), std(scores)

11) Choosing the right metric

Task	Primary Metric	Why
Balanced classification	Accuracy	All classes equal
Imbalanced classification	F1, AUC	Accuracy misleading
FP costly (spam)	Precision	Avoid false alarms
FN costly (cancer)	Recall	Don't miss positives
Regression	RMSE, R²	Interpretable units

Always consider the business context.

12) Overfitting detection

Compare train and test metrics:

Train	Test	Diagnosis
Low	Low	Underfitting
High	Low	Overfitting
High	High	Good fit

Large gap between train and test = overfitting.

Key takeaways

Never evaluate on training data
Accuracy is misleading for imbalanced data
Confusion matrix reveals error types
Precision: few false positives; Recall: few false negatives
F1 balances precision and recall
AUC is threshold-independent comparison metric
Cross-validation gives robust estimates
Choose metrics based on what errors matter

Model Evaluation Metrics

Lesson

Model Evaluation Metrics

Why this module exists

1) Train/test split

2) Accuracy

3) Confusion matrix

4) Precision

5) Recall (Sensitivity)

6) The precision-recall tradeoff

7) F1 Score

8) ROC curve and AUC

9) Regression metrics

10) Cross-validation

11) Choosing the right metric

12) Overfitting detection

Key takeaways

Module Items

Evaluation Metrics

Model Evaluation Checkpoint