From Logistic Regression to Neural Networks
Lesson, slides, and applied problem sets.
View SlidesLesson
From Logistic Regression to Neural Networks
Why this module exists
You've learned logistic regression. It works great for many problems. But there are patterns it simply cannot learn. This capstone shows you why neural networks exist—not as magic, but as the logical next step when linear models fail.
By the end, you'll build a neural network from scratch and understand every line.
1) The XOR problem: Where logistic regression fails
Consider XOR (exclusive or):
Input Output
(0, 0) → 0
(0, 1) → 1
(1, 0) → 1
(1, 1) → 0
Plot these points:
1 | X O
|
0 | O X
+-------------
0 1
Try to draw ONE straight line that separates O's from X's. You can't.
This is why logistic regression fails on XOR. It can only learn linear decision boundaries.
# Logistic regression on XOR - will fail!
X = [[0, 0], [0, 1], [1, 0], [1, 1]]
y = [0, 1, 1, 0]
# No matter how long you train, accuracy stays ~50%
# Because no linear boundary can separate these classes
2) The insight: Transform the space
What if we could transform the inputs into a new space where they ARE linearly separable?
That's exactly what a hidden layer does.
Original space: Transformed space:
(after hidden layer)
1 | X O 1 | O O
| |
0 | O X 0 | X X
+-------- +------------
0 1 (new features)
The hidden layer learns a transformation that makes the problem solvable.
3) Anatomy of a single-layer neural network
Input Layer Hidden Layer Output Layer
x₁ ─────────┐
├──→ h₁ ─────────┐
x₂ ─────────┤ ├──→ ŷ
├──→ h₂ ─────────┘
(bias) ─────┘
Three stages:
- Linear transformation: z = Wx + b
- Activation function: h = σ(z)
- Output layer: ŷ = σ(W₂h + b₂)
The activation function (σ) is crucial—it introduces non-linearity.
4) Why activation functions matter
Without activation:
# Layer 1: z1 = W1 @ x + b1
# Layer 2: z2 = W2 @ z1 + b2
# = W2 @ (W1 @ x + b1) + b2
# = (W2 @ W1) @ x + (W2 @ b1 + b2)
# = W_combined @ x + b_combined # Still linear!
Multiple linear layers collapse into one. No matter how many layers, it's still just linear regression.
With activation (sigmoid):
# Layer 1: h1 = sigmoid(W1 @ x + b1) # Non-linear!
# Layer 2: y = sigmoid(W2 @ h1 + b2)
Now we can learn non-linear patterns.
5) The sigmoid activation
def sigmoid(z):
return 1 / (1 + exp(-z))
Properties:
- Output always between 0 and 1
- S-shaped curve
- Derivative: σ'(z) = σ(z) * (1 - σ(z))
1 | ___________
| /
0.5 |------/------
| /
0 |___/
+------------------→ z
-4 0 4
The simple derivative is why sigmoid was historically popular.
6) Forward pass: Computing predictions
def forward(x, W1, b1, W2, b2):
# Hidden layer
z1 = W1 @ x + b1 # Linear
h = sigmoid(z1) # Activation
# Output layer
z2 = W2 @ h + b2 # Linear
y_pred = sigmoid(z2) # Activation
return y_pred, h, z1
Data flows forward through the network:
x → [W1, b1] → z1 → sigmoid → h → [W2, b2] → z2 → sigmoid → ŷ
We save intermediate values (h, z1) because we need them for backprop.
7) Loss function
Same as logistic regression—binary cross-entropy:
def binary_cross_entropy(y_true, y_pred):
epsilon = 1e-15 # Prevent log(0)
y_pred = clip(y_pred, epsilon, 1 - epsilon)
return -mean(y_true * log(y_pred) + (1 - y_true) * log(1 - y_pred))
We want to minimize this loss by adjusting W1, b1, W2, b2.
8) Backpropagation: The chain rule in action
To update weights, we need gradients: ∂Loss/∂W1, ∂Loss/∂b1, ∂Loss/∂W2, ∂Loss/∂b2.
The chain rule lets us compute these by working backwards:
Loss
↑
∂L/∂ŷ
↑
ŷ = sigmoid(z2)
↑
∂z2/∂W2, ∂z2/∂b2, ∂z2/∂h
↑
h = sigmoid(z1)
↑
∂z1/∂W1, ∂z1/∂b1
9) Backprop step by step
Output layer gradients:
# dL/dy_pred (derivative of BCE)
dL_dy = -(y_true / y_pred - (1 - y_true) / (1 - y_pred))
# dy/dz2 (derivative of sigmoid)
dy_dz2 = y_pred * (1 - y_pred)
# Combined: dL/dz2
dz2 = dL_dy * dy_dz2 # Often simplified to: y_pred - y_true
# Gradients for W2 and b2
dW2 = h.T @ dz2
db2 = sum(dz2, axis=0)
Hidden layer gradients:
# Backprop through W2
dh = dz2 @ W2.T
# Through sigmoid activation
dz1 = dh * (h * (1 - h)) # sigmoid derivative
# Gradients for W1 and b1
dW1 = x.T @ dz1
db1 = sum(dz1, axis=0)
10) The backward function
def backward(x, y_true, y_pred, h, z1, W2):
m = len(x) # batch size
# Output layer
dz2 = y_pred - y_true # Simplified gradient for BCE + sigmoid
dW2 = (1/m) * h.T @ dz2
db2 = (1/m) * sum(dz2, axis=0)
# Hidden layer
dh = dz2 @ W2.T
dz1 = dh * sigmoid_derivative(z1)
dW1 = (1/m) * x.T @ dz1
db1 = (1/m) * sum(dz1, axis=0)
return dW1, db1, dW2, db2
11) Gradient descent update
Same as before, just more parameters:
def update_weights(W1, b1, W2, b2, dW1, db1, dW2, db2, lr):
W1 = W1 - lr * dW1
b1 = b1 - lr * db1
W2 = W2 - lr * dW2
b2 = b2 - lr * db2
return W1, b1, W2, b2
12) The training loop
def train(X, y, hidden_size, epochs, lr):
# Initialize weights
input_size = X.shape[1]
output_size = 1
W1 = random_init((input_size, hidden_size))
b1 = zeros(hidden_size)
W2 = random_init((hidden_size, output_size))
b2 = zeros(output_size)
for epoch in range(epochs):
# Forward
y_pred, h, z1 = forward(X, W1, b1, W2, b2)
# Compute loss
loss = binary_cross_entropy(y, y_pred)
# Backward
dW1, db1, dW2, db2 = backward(X, y, y_pred, h, z1, W2)
# Update
W1, b1, W2, b2 = update_weights(W1, b1, W2, b2, dW1, db1, dW2, db2, lr)
if epoch % 1000 == 0:
print(f"Epoch {epoch}, Loss: {loss:.4f}")
return W1, b1, W2, b2
13) Solving XOR
# XOR dataset
X = array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = array([[0], [1], [1], [0]])
# Train neural network
W1, b1, W2, b2 = train(X, y, hidden_size=4, epochs=10000, lr=1.0)
# Test
for i in range(len(X)):
pred, _, _ = forward(X[i:i+1], W1, b1, W2, b2)
print(f"Input: {X[i]}, True: {y[i][0]}, Pred: {pred[0][0]:.3f}")
# Output:
# Input: [0 0], True: 0, Pred: 0.023
# Input: [0 1], True: 1, Pred: 0.978
# Input: [1 0], True: 1, Pred: 0.977
# Input: [1 1], True: 0, Pred: 0.025
It works! The network learned XOR—something logistic regression cannot do.
14) What the hidden layer learned
After training, examine the hidden layer activations:
# For each input, compute hidden layer output
for i in range(len(X)):
z1 = W1 @ X[i] + b1
h = sigmoid(z1)
print(f"Input: {X[i]} → Hidden: {h}")
# The hidden layer has transformed the space!
# Points that couldn't be separated are now separated.
The hidden neurons have learned features:
- One might detect "at least one input is 1"
- Another might detect "both inputs are 1"
- Combined, these features make XOR linearly separable
15) Weight initialization matters
Bad initialization → training fails.
# BAD: All zeros (neurons learn same thing)
W1 = zeros((2, 4)) # Symmetry problem!
# BAD: Too large (activations saturate)
W1 = random() * 100 # Gradients vanish
# GOOD: Small random values
W1 = random() * 0.5 - 0.25 # Between -0.25 and 0.25
# BETTER: Xavier initialization
W1 = random() * sqrt(2.0 / (fan_in + fan_out))
16) Debugging neural networks
Common issues:
Loss not decreasing:
- Learning rate too low → increase it
- Learning rate too high → decrease it
- Weights initialized wrong → use Xavier init
Loss is NaN:
- Numerical overflow → clip gradients
- Log of zero → add epsilon to predictions
Accuracy stuck at 50%:
- Network too small → add more hidden neurons
- Not training long enough → more epochs
- Bug in backprop → check gradients numerically
17) Gradient checking
Verify your backprop is correct:
def gradient_check(X, y, W1, b1, W2, b2, epsilon=1e-7):
# Compute analytical gradients
y_pred, h, z1 = forward(X, W1, b1, W2, b2)
dW1, db1, dW2, db2 = backward(X, y, y_pred, h, z1, W2)
# Compute numerical gradients for W1
numerical_dW1 = zeros_like(W1)
for i in range(W1.shape[0]):
for j in range(W1.shape[1]):
W1_plus = W1.copy()
W1_plus[i, j] += epsilon
loss_plus = compute_loss(X, y, W1_plus, b1, W2, b2)
W1_minus = W1.copy()
W1_minus[i, j] -= epsilon
loss_minus = compute_loss(X, y, W1_minus, b1, W2, b2)
numerical_dW1[i, j] = (loss_plus - loss_minus) / (2 * epsilon)
# Compare
diff = norm(dW1 - numerical_dW1) / (norm(dW1) + norm(numerical_dW1))
print(f"Gradient difference: {diff}") # Should be < 1e-7
18) Beyond XOR: What can one hidden layer learn?
Universal approximation theorem: A single hidden layer with enough neurons can approximate any continuous function.
But:
- "Enough neurons" might be exponentially many
- Training might be hard
- Deep networks are more practical
Still, one hidden layer can learn:
- XOR and other non-linear patterns
- Simple image features
- Text patterns
- Most "reasonable" functions
19) The bridge to deep learning
What you've built is the foundation of all neural networks:
Your network:
Input → [Linear → Activation] → [Linear → Activation] → Output
Deep network (same pattern, more layers):
Input → [Linear → Activation] → [Linear → Activation] → ... → Output
Key ideas that transfer:
- Forward pass: Compute predictions layer by layer
- Backpropagation: Compute gradients layer by layer (in reverse)
- Gradient descent: Update all weights to minimize loss
Modern frameworks (PyTorch, TensorFlow) automate the gradient computation, but the principles are exactly what you implemented.
Key takeaways
- Linear models have limits - XOR cannot be solved by logistic regression
- Hidden layers transform the space - Making non-linear problems linearly separable
- Activation functions add non-linearity - Without them, deep networks collapse to linear
- Backpropagation = chain rule - Compute gradients by working backwards
- You can build neural networks from scratch - No magic, just math
- This is the foundation of deep learning - Same ideas scale to massive networks
You're now ready for deep learning. The path from here:
- More layers (deep networks)
- Different architectures (CNNs, RNNs, Transformers)
- Better optimizers (Adam, RMSprop)
- Regularization techniques (dropout, batch norm)
But the core—forward pass, backprop, gradient descent—remains the same.