Model Evaluation Metrics
Lesson, slides, and applied problem sets.
View SlidesLesson
Model Evaluation Metrics
Why this module exists
Training a model is meaningless without knowing how well it performs. Evaluation metrics quantify model quality and guide improvement. Choosing the right metric depends on your task and what errors matter most.
1) Train/test split
Never evaluate on training data—you'll be fooled by overfitting.
# Split data: 80% train, 20% test
train_data = data[:int(0.8 * len(data))]
test_data = data[int(0.8 * len(data)):]
# Train on train_data, evaluate on test_data
model.fit(train_data)
score = model.evaluate(test_data)
The model never sees test data during training.
2) Accuracy
The simplest metric: fraction of correct predictions.
def accuracy(y_true, y_pred):
correct = sum(1 for t, p in zip(y_true, y_pred) if t == p)
return correct / len(y_true)
Problems with accuracy:
- Misleading with imbalanced classes
- 99% accuracy on 99% majority class = useless model
3) Confusion matrix
A 2×2 table for binary classification:
Predicted
Neg Pos
Actual Neg TN FP
Pos FN TP
TN = True Negative (correct rejection)
FP = False Positive (false alarm)
FN = False Negative (missed detection)
TP = True Positive (correct detection)
def confusion_matrix(y_true, y_pred):
TP = sum(1 for t, p in zip(y_true, y_pred) if t == 1 and p == 1)
TN = sum(1 for t, p in zip(y_true, y_pred) if t == 0 and p == 0)
FP = sum(1 for t, p in zip(y_true, y_pred) if t == 0 and p == 1)
FN = sum(1 for t, p in zip(y_true, y_pred) if t == 1 and p == 0)
return [[TN, FP], [FN, TP]]
4) Precision
Of all positive predictions, how many are correct?
def precision(TP, FP):
return TP / (TP + FP) if (TP + FP) > 0 else 0
Precision = TP / (TP + FP)
High precision: Few false positives (when you predict positive, you're usually right)
Important when false positives are costly:
- Spam filter (don't put real email in spam)
- Medical tests (don't tell healthy person they're sick)
5) Recall (Sensitivity)
Of all actual positives, how many did we find?
def recall(TP, FN):
return TP / (TP + FN) if (TP + FN) > 0 else 0
Recall = TP / (TP + FN)
High recall: Few false negatives (you find most of the positives)
Important when false negatives are costly:
- Cancer detection (don't miss cancer cases)
- Fraud detection (catch most fraud)
6) The precision-recall tradeoff
You can't maximize both simultaneously:
- Increase threshold → higher precision, lower recall
- Decrease threshold → higher recall, lower precision
Threshold = 0.9: P=0.95, R=0.40 (few predictions, but accurate)
Threshold = 0.5: P=0.70, R=0.80 (balanced)
Threshold = 0.1: P=0.30, R=0.98 (catch everything, many false alarms)
Choose based on which errors matter more.
7) F1 Score
Harmonic mean of precision and recall:
def f1_score(precision, recall):
if precision + recall == 0:
return 0
return 2 * precision * recall / (precision + recall)
F1 = 2 × (P × R) / (P + R)
Properties:
- Range: [0, 1]
- High only if both P and R are high
- Good single metric for imbalanced data
8) ROC curve and AUC
ROC (Receiver Operating Characteristic): Plot of TPR vs FPR at various thresholds.
TPR = TP / (TP + FN) # True Positive Rate = Recall
FPR = FP / (FP + TN) # False Positive Rate
AUC (Area Under Curve): Single number summarizing ROC.
- AUC = 1.0: Perfect classifier
- AUC = 0.5: Random guessing
- AUC < 0.5: Worse than random (model is confused)
AUC is threshold-independent—useful for comparing models.
9) Regression metrics
MSE: Mean Squared Error
MSE = mean((y_true - y_pred) ** 2)
RMSE: Root Mean Squared Error (same units as target)
RMSE = sqrt(MSE)
MAE: Mean Absolute Error
MAE = mean(|y_true - y_pred|)
R² (Coefficient of determination):
R2 = 1 - (sum((y_true - y_pred)**2) / sum((y_true - mean(y_true))**2))
- R² = 1: Perfect predictions
- R² = 0: Model predicts the mean
- R² < 0: Worse than predicting mean
10) Cross-validation
One train/test split can be lucky or unlucky. Cross-validation is more robust:
K-fold cross-validation:
- Split data into K folds
- For each fold:
- Train on K-1 folds
- Test on remaining fold
- Average the K scores
def k_fold_cv(X, y, model, k=5):
fold_size = len(X) // k
scores = []
for i in range(k):
test_start = i * fold_size
test_end = test_start + fold_size
X_test = X[test_start:test_end]
X_train = X[:test_start] + X[test_end:]
# Similar for y
model.fit(X_train, y_train)
score = model.evaluate(X_test, y_test)
scores.append(score)
return mean(scores), std(scores)
11) Choosing the right metric
| Task | Primary Metric | Why |
|---|---|---|
| Balanced classification | Accuracy | All classes equal |
| Imbalanced classification | F1, AUC | Accuracy misleading |
| FP costly (spam) | Precision | Avoid false alarms |
| FN costly (cancer) | Recall | Don't miss positives |
| Regression | RMSE, R² | Interpretable units |
Always consider the business context.
12) Overfitting detection
Compare train and test metrics:
| Train | Test | Diagnosis |
|---|---|---|
| Low | Low | Underfitting |
| High | Low | Overfitting |
| High | High | Good fit |
Large gap between train and test = overfitting.
Key takeaways
- Never evaluate on training data
- Accuracy is misleading for imbalanced data
- Confusion matrix reveals error types
- Precision: few false positives; Recall: few false negatives
- F1 balances precision and recall
- AUC is threshold-independent comparison metric
- Cross-validation gives robust estimates
- Choose metrics based on what errors matter