From Logistic Regression to Neural Networks

Lesson, slides, and applied problem sets.

View Slides

Lesson

From Logistic Regression to Neural Networks

Why this module exists

You've learned logistic regression. It works great for many problems. But there are patterns it simply cannot learn. This capstone shows you why neural networks exist—not as magic, but as the logical next step when linear models fail.

By the end, you'll build a neural network from scratch and understand every line.


1) The XOR problem: Where logistic regression fails

Consider XOR (exclusive or):

Input      Output
(0, 0)  →    0
(0, 1)  →    1
(1, 0)  →    1
(1, 1)  →    0

Plot these points:

    1 |  X         O
      |
    0 |  O         X
      +-------------
         0         1

Try to draw ONE straight line that separates O's from X's. You can't.

This is why logistic regression fails on XOR. It can only learn linear decision boundaries.

# Logistic regression on XOR - will fail!
X = [[0, 0], [0, 1], [1, 0], [1, 1]]
y = [0, 1, 1, 0]

# No matter how long you train, accuracy stays ~50%
# Because no linear boundary can separate these classes

2) The insight: Transform the space

What if we could transform the inputs into a new space where they ARE linearly separable?

That's exactly what a hidden layer does.

Original space:          Transformed space:
                         (after hidden layer)
  1 |  X     O              1 |        O  O
    |                         |
  0 |  O     X              0 |  X  X
    +--------                 +------------
       0     1                   (new features)

The hidden layer learns a transformation that makes the problem solvable.


3) Anatomy of a single-layer neural network

Input Layer      Hidden Layer        Output Layer
   x₁ ─────────┐
               ├──→ h₁ ─────────┐
   x₂ ─────────┤               ├──→ ŷ
               ├──→ h₂ ─────────┘
   (bias) ─────┘

Three stages:

  1. Linear transformation: z = Wx + b
  2. Activation function: h = σ(z)
  3. Output layer: ŷ = σ(W₂h + b₂)

The activation function (σ) is crucial—it introduces non-linearity.


4) Why activation functions matter

Without activation:

# Layer 1: z1 = W1 @ x + b1
# Layer 2: z2 = W2 @ z1 + b2
#        = W2 @ (W1 @ x + b1) + b2
#        = (W2 @ W1) @ x + (W2 @ b1 + b2)
#        = W_combined @ x + b_combined  # Still linear!

Multiple linear layers collapse into one. No matter how many layers, it's still just linear regression.

With activation (sigmoid):

# Layer 1: h1 = sigmoid(W1 @ x + b1)  # Non-linear!
# Layer 2: y  = sigmoid(W2 @ h1 + b2)

Now we can learn non-linear patterns.


5) The sigmoid activation

def sigmoid(z):
    return 1 / (1 + exp(-z))

Properties:

  • Output always between 0 and 1
  • S-shaped curve
  • Derivative: σ'(z) = σ(z) * (1 - σ(z))
       1 |         ___________
         |       /
     0.5 |------/------
         |    /
       0 |___/
         +------------------→ z
           -4   0   4

The simple derivative is why sigmoid was historically popular.


6) Forward pass: Computing predictions

def forward(x, W1, b1, W2, b2):
    # Hidden layer
    z1 = W1 @ x + b1        # Linear
    h = sigmoid(z1)          # Activation

    # Output layer
    z2 = W2 @ h + b2        # Linear
    y_pred = sigmoid(z2)     # Activation

    return y_pred, h, z1

Data flows forward through the network:

x → [W1, b1] → z1 → sigmoid → h → [W2, b2] → z2 → sigmoid → ŷ

We save intermediate values (h, z1) because we need them for backprop.


7) Loss function

Same as logistic regression—binary cross-entropy:

def binary_cross_entropy(y_true, y_pred):
    epsilon = 1e-15  # Prevent log(0)
    y_pred = clip(y_pred, epsilon, 1 - epsilon)
    return -mean(y_true * log(y_pred) + (1 - y_true) * log(1 - y_pred))

We want to minimize this loss by adjusting W1, b1, W2, b2.


8) Backpropagation: The chain rule in action

To update weights, we need gradients: ∂Loss/∂W1, ∂Loss/∂b1, ∂Loss/∂W2, ∂Loss/∂b2.

The chain rule lets us compute these by working backwards:

Loss
  ↑
  ∂L/∂ŷ
  ↑
  ŷ = sigmoid(z2)
  ↑
  ∂z2/∂W2, ∂z2/∂b2, ∂z2/∂h
  ↑
  h = sigmoid(z1)
  ↑
  ∂z1/∂W1, ∂z1/∂b1

9) Backprop step by step

Output layer gradients:

# dL/dy_pred (derivative of BCE)
dL_dy = -(y_true / y_pred - (1 - y_true) / (1 - y_pred))

# dy/dz2 (derivative of sigmoid)
dy_dz2 = y_pred * (1 - y_pred)

# Combined: dL/dz2
dz2 = dL_dy * dy_dz2  # Often simplified to: y_pred - y_true

# Gradients for W2 and b2
dW2 = h.T @ dz2
db2 = sum(dz2, axis=0)

Hidden layer gradients:

# Backprop through W2
dh = dz2 @ W2.T

# Through sigmoid activation
dz1 = dh * (h * (1 - h))  # sigmoid derivative

# Gradients for W1 and b1
dW1 = x.T @ dz1
db1 = sum(dz1, axis=0)

10) The backward function

def backward(x, y_true, y_pred, h, z1, W2):
    m = len(x)  # batch size

    # Output layer
    dz2 = y_pred - y_true  # Simplified gradient for BCE + sigmoid
    dW2 = (1/m) * h.T @ dz2
    db2 = (1/m) * sum(dz2, axis=0)

    # Hidden layer
    dh = dz2 @ W2.T
    dz1 = dh * sigmoid_derivative(z1)
    dW1 = (1/m) * x.T @ dz1
    db1 = (1/m) * sum(dz1, axis=0)

    return dW1, db1, dW2, db2

11) Gradient descent update

Same as before, just more parameters:

def update_weights(W1, b1, W2, b2, dW1, db1, dW2, db2, lr):
    W1 = W1 - lr * dW1
    b1 = b1 - lr * db1
    W2 = W2 - lr * dW2
    b2 = b2 - lr * db2
    return W1, b1, W2, b2

12) The training loop

def train(X, y, hidden_size, epochs, lr):
    # Initialize weights
    input_size = X.shape[1]
    output_size = 1

    W1 = random_init((input_size, hidden_size))
    b1 = zeros(hidden_size)
    W2 = random_init((hidden_size, output_size))
    b2 = zeros(output_size)

    for epoch in range(epochs):
        # Forward
        y_pred, h, z1 = forward(X, W1, b1, W2, b2)

        # Compute loss
        loss = binary_cross_entropy(y, y_pred)

        # Backward
        dW1, db1, dW2, db2 = backward(X, y, y_pred, h, z1, W2)

        # Update
        W1, b1, W2, b2 = update_weights(W1, b1, W2, b2, dW1, db1, dW2, db2, lr)

        if epoch % 1000 == 0:
            print(f"Epoch {epoch}, Loss: {loss:.4f}")

    return W1, b1, W2, b2

13) Solving XOR

# XOR dataset
X = array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = array([[0], [1], [1], [0]])

# Train neural network
W1, b1, W2, b2 = train(X, y, hidden_size=4, epochs=10000, lr=1.0)

# Test
for i in range(len(X)):
    pred, _, _ = forward(X[i:i+1], W1, b1, W2, b2)
    print(f"Input: {X[i]}, True: {y[i][0]}, Pred: {pred[0][0]:.3f}")

# Output:
# Input: [0 0], True: 0, Pred: 0.023
# Input: [0 1], True: 1, Pred: 0.978
# Input: [1 0], True: 1, Pred: 0.977
# Input: [1 1], True: 0, Pred: 0.025

It works! The network learned XOR—something logistic regression cannot do.


14) What the hidden layer learned

After training, examine the hidden layer activations:

# For each input, compute hidden layer output
for i in range(len(X)):
    z1 = W1 @ X[i] + b1
    h = sigmoid(z1)
    print(f"Input: {X[i]} → Hidden: {h}")

# The hidden layer has transformed the space!
# Points that couldn't be separated are now separated.

The hidden neurons have learned features:

  • One might detect "at least one input is 1"
  • Another might detect "both inputs are 1"
  • Combined, these features make XOR linearly separable

15) Weight initialization matters

Bad initialization → training fails.

# BAD: All zeros (neurons learn same thing)
W1 = zeros((2, 4))  # Symmetry problem!

# BAD: Too large (activations saturate)
W1 = random() * 100  # Gradients vanish

# GOOD: Small random values
W1 = random() * 0.5 - 0.25  # Between -0.25 and 0.25

# BETTER: Xavier initialization
W1 = random() * sqrt(2.0 / (fan_in + fan_out))

16) Debugging neural networks

Common issues:

Loss not decreasing:

  • Learning rate too low → increase it
  • Learning rate too high → decrease it
  • Weights initialized wrong → use Xavier init

Loss is NaN:

  • Numerical overflow → clip gradients
  • Log of zero → add epsilon to predictions

Accuracy stuck at 50%:

  • Network too small → add more hidden neurons
  • Not training long enough → more epochs
  • Bug in backprop → check gradients numerically

17) Gradient checking

Verify your backprop is correct:

def gradient_check(X, y, W1, b1, W2, b2, epsilon=1e-7):
    # Compute analytical gradients
    y_pred, h, z1 = forward(X, W1, b1, W2, b2)
    dW1, db1, dW2, db2 = backward(X, y, y_pred, h, z1, W2)

    # Compute numerical gradients for W1
    numerical_dW1 = zeros_like(W1)
    for i in range(W1.shape[0]):
        for j in range(W1.shape[1]):
            W1_plus = W1.copy()
            W1_plus[i, j] += epsilon
            loss_plus = compute_loss(X, y, W1_plus, b1, W2, b2)

            W1_minus = W1.copy()
            W1_minus[i, j] -= epsilon
            loss_minus = compute_loss(X, y, W1_minus, b1, W2, b2)

            numerical_dW1[i, j] = (loss_plus - loss_minus) / (2 * epsilon)

    # Compare
    diff = norm(dW1 - numerical_dW1) / (norm(dW1) + norm(numerical_dW1))
    print(f"Gradient difference: {diff}")  # Should be < 1e-7

18) Beyond XOR: What can one hidden layer learn?

Universal approximation theorem: A single hidden layer with enough neurons can approximate any continuous function.

But:

  • "Enough neurons" might be exponentially many
  • Training might be hard
  • Deep networks are more practical

Still, one hidden layer can learn:

  • XOR and other non-linear patterns
  • Simple image features
  • Text patterns
  • Most "reasonable" functions

19) The bridge to deep learning

What you've built is the foundation of all neural networks:

Your network:
Input → [Linear → Activation] → [Linear → Activation] → Output

Deep network (same pattern, more layers):
Input → [Linear → Activation] → [Linear → Activation] → ... → Output

Key ideas that transfer:

  • Forward pass: Compute predictions layer by layer
  • Backpropagation: Compute gradients layer by layer (in reverse)
  • Gradient descent: Update all weights to minimize loss

Modern frameworks (PyTorch, TensorFlow) automate the gradient computation, but the principles are exactly what you implemented.


Key takeaways

  1. Linear models have limits - XOR cannot be solved by logistic regression
  2. Hidden layers transform the space - Making non-linear problems linearly separable
  3. Activation functions add non-linearity - Without them, deep networks collapse to linear
  4. Backpropagation = chain rule - Compute gradients by working backwards
  5. You can build neural networks from scratch - No magic, just math
  6. This is the foundation of deep learning - Same ideas scale to massive networks

You're now ready for deep learning. The path from here:

  • More layers (deep networks)
  • Different architectures (CNNs, RNNs, Transformers)
  • Better optimizers (Adam, RMSprop)
  • Regularization techniques (dropout, batch norm)

But the core—forward pass, backprop, gradient descent—remains the same.


Module Items