Classification: Decision Boundaries

Lesson, slides, and applied problem sets.

View Slides

Lesson

Classification: Decision Boundaries

Why this module exists

Classification is predicting categories: spam or not spam, which digit, what disease. It's one of the most common ML tasks. Understanding classification means understanding decision boundaries, probability outputs, and how models separate classes.


1) Classification vs regression

Regression: Predict continuous values (price, temperature) Classification: Predict discrete categories (spam, cat/dog, digit 0-9)

Classification can be:

  • Binary: Two classes (yes/no, positive/negative)
  • Multi-class: Many classes (digit recognition: 0-9)
  • Multi-label: Multiple labels per sample (tags on a photo)

2) The decision boundary

A classifier learns a boundary that separates classes in feature space.

       Class B
         |
    x x  |  o o o
   x x x | o o
  x x    |   o o
---------+----------
       Class A

The boundary can be:

  • Linear: A straight line (or hyperplane in higher dimensions)
  • Non-linear: Curves, complex shapes

3) Linear classifiers

A linear classifier uses a weighted sum of features:

score = w0 + w1*x1 + w2*x2 + ... + wn*xn
      = w · x + b  # dot product + bias

Decision rule:

  • If score > 0: predict class 1
  • If score ≤ 0: predict class 0

The weights define the decision boundary orientation.


4) From scores to probabilities

Raw scores are hard to interpret. We convert to probabilities.

Sigmoid function (for binary classification):

def sigmoid(z):
    return 1 / (1 + exp(-z))

# Properties:
# sigmoid(0) = 0.5
# sigmoid(large positive) → 1
# sigmoid(large negative) → 0

Output is P(class=1 | features).


5) Logistic regression

Despite the name, it's a classification algorithm:

def logistic_regression_predict(x, w, b):
    z = dot(w, x) + b
    prob = sigmoid(z)
    return prob  # P(class=1)

Training minimizes binary cross-entropy:

L = -[y log(p) + (1-y) log(1-p)]

Gradient update:

∂L/∂w = (p - y) × x
w = w - lr × (p - y) × x

6) Multi-class classification

For K classes, predict a probability distribution:

scores = [w0·x, w1·x, ..., wK·x]  # one score per class
probs = softmax(scores)           # sum to 1
prediction = argmax(probs)        # class with highest prob

Softmax:

def softmax(scores):
    exp_scores = [exp(s) for s in scores]
    total = sum(exp_scores)
    return [e / total for e in exp_scores]

7) Cross-entropy loss for multi-class

def cross_entropy(y_true, probs):
    # y_true is the correct class index
    return -log(probs[y_true])

This encourages high probability on the correct class.

Over a batch:

L = -(1/n) Σ log(p[y_true_i])

8) One-vs-all (OvA) classification

Another approach for multi-class: train K binary classifiers.

  • Classifier 1: class 0 vs rest
  • Classifier 2: class 1 vs rest
  • ...
  • Classifier K: class K-1 vs rest

At prediction, use the classifier with highest confidence.


9) Training a logistic regression model

def train_logistic_regression(X, y, lr=0.01, epochs=1000):
    n_features = len(X[0])
    w = [0.0] * n_features
    b = 0.0

    for epoch in range(epochs):
        for i in range(len(X)):
            # Forward pass
            z = dot(w, X[i]) + b
            p = sigmoid(z)

            # Compute gradient
            error = p - y[i]

            # Update weights
            for j in range(n_features):
                w[j] -= lr * error * X[i][j]
            b -= lr * error

    return w, b

10) Non-linear decision boundaries

Linear models can't separate non-linear patterns (like XOR).

Solutions:

  • Feature engineering: Add x², x×y, etc.
  • Kernel methods: Implicit high-dimensional mapping
  • Neural networks: Learn non-linear transformations

11) Margin and confidence

The margin is the distance from a point to the decision boundary.

  • Large margin: confident prediction
  • Small margin: uncertain (near the boundary)

Support Vector Machines (SVMs) maximize the margin—they find the boundary with maximum separation.


12) Probabilistic interpretation

Logistic regression gives calibrated probabilities:

  • P(spam | email) = 0.95 means 95% confidence

This is useful for:

  • Ranking (sort by probability)
  • Threshold adjustment (flag if p > 0.9)
  • Combining with other information

Not all classifiers give calibrated probabilities; logistic regression does.


13) Class imbalance

When one class is much more common:

  • Model might predict majority class always
  • Accuracy can be misleading (99% accuracy if 99% are class 0)

Solutions:

  • Resampling: Oversample minority or undersample majority
  • Class weights: Penalize majority class mistakes less
  • Different metrics: Use F1, AUC instead of accuracy

Key takeaways

  • Classification predicts discrete categories
  • Decision boundaries separate classes in feature space
  • Sigmoid converts scores to binary probabilities
  • Softmax converts scores to multi-class probabilities
  • Logistic regression: simple, interpretable, well-calibrated
  • Cross-entropy loss trains classifiers
  • Handle class imbalance carefully

Module Items