Classification: Decision Boundaries
Lesson, slides, and applied problem sets.
View SlidesLesson
Classification: Decision Boundaries
Why this module exists
Classification is predicting categories: spam or not spam, which digit, what disease. It's one of the most common ML tasks. Understanding classification means understanding decision boundaries, probability outputs, and how models separate classes.
1) Classification vs regression
Regression: Predict continuous values (price, temperature) Classification: Predict discrete categories (spam, cat/dog, digit 0-9)
Classification can be:
- Binary: Two classes (yes/no, positive/negative)
- Multi-class: Many classes (digit recognition: 0-9)
- Multi-label: Multiple labels per sample (tags on a photo)
2) The decision boundary
A classifier learns a boundary that separates classes in feature space.
Class B
|
x x | o o o
x x x | o o
x x | o o
---------+----------
Class A
The boundary can be:
- Linear: A straight line (or hyperplane in higher dimensions)
- Non-linear: Curves, complex shapes
3) Linear classifiers
A linear classifier uses a weighted sum of features:
score = w0 + w1*x1 + w2*x2 + ... + wn*xn
= w · x + b # dot product + bias
Decision rule:
- If score > 0: predict class 1
- If score ≤ 0: predict class 0
The weights define the decision boundary orientation.
4) From scores to probabilities
Raw scores are hard to interpret. We convert to probabilities.
Sigmoid function (for binary classification):
def sigmoid(z):
return 1 / (1 + exp(-z))
# Properties:
# sigmoid(0) = 0.5
# sigmoid(large positive) → 1
# sigmoid(large negative) → 0
Output is P(class=1 | features).
5) Logistic regression
Despite the name, it's a classification algorithm:
def logistic_regression_predict(x, w, b):
z = dot(w, x) + b
prob = sigmoid(z)
return prob # P(class=1)
Training minimizes binary cross-entropy:
L = -[y log(p) + (1-y) log(1-p)]
Gradient update:
∂L/∂w = (p - y) × x
w = w - lr × (p - y) × x
6) Multi-class classification
For K classes, predict a probability distribution:
scores = [w0·x, w1·x, ..., wK·x] # one score per class
probs = softmax(scores) # sum to 1
prediction = argmax(probs) # class with highest prob
Softmax:
def softmax(scores):
exp_scores = [exp(s) for s in scores]
total = sum(exp_scores)
return [e / total for e in exp_scores]
7) Cross-entropy loss for multi-class
def cross_entropy(y_true, probs):
# y_true is the correct class index
return -log(probs[y_true])
This encourages high probability on the correct class.
Over a batch:
L = -(1/n) Σ log(p[y_true_i])
8) One-vs-all (OvA) classification
Another approach for multi-class: train K binary classifiers.
- Classifier 1: class 0 vs rest
- Classifier 2: class 1 vs rest
- ...
- Classifier K: class K-1 vs rest
At prediction, use the classifier with highest confidence.
9) Training a logistic regression model
def train_logistic_regression(X, y, lr=0.01, epochs=1000):
n_features = len(X[0])
w = [0.0] * n_features
b = 0.0
for epoch in range(epochs):
for i in range(len(X)):
# Forward pass
z = dot(w, X[i]) + b
p = sigmoid(z)
# Compute gradient
error = p - y[i]
# Update weights
for j in range(n_features):
w[j] -= lr * error * X[i][j]
b -= lr * error
return w, b
10) Non-linear decision boundaries
Linear models can't separate non-linear patterns (like XOR).
Solutions:
- Feature engineering: Add x², x×y, etc.
- Kernel methods: Implicit high-dimensional mapping
- Neural networks: Learn non-linear transformations
11) Margin and confidence
The margin is the distance from a point to the decision boundary.
- Large margin: confident prediction
- Small margin: uncertain (near the boundary)
Support Vector Machines (SVMs) maximize the margin—they find the boundary with maximum separation.
12) Probabilistic interpretation
Logistic regression gives calibrated probabilities:
- P(spam | email) = 0.95 means 95% confidence
This is useful for:
- Ranking (sort by probability)
- Threshold adjustment (flag if p > 0.9)
- Combining with other information
Not all classifiers give calibrated probabilities; logistic regression does.
13) Class imbalance
When one class is much more common:
- Model might predict majority class always
- Accuracy can be misleading (99% accuracy if 99% are class 0)
Solutions:
- Resampling: Oversample minority or undersample majority
- Class weights: Penalize majority class mistakes less
- Different metrics: Use F1, AUC instead of accuracy
Key takeaways
- Classification predicts discrete categories
- Decision boundaries separate classes in feature space
- Sigmoid converts scores to binary probabilities
- Softmax converts scores to multi-class probabilities
- Logistic regression: simple, interpretable, well-calibrated
- Cross-entropy loss trains classifiers
- Handle class imbalance carefully