Neural Network from Scratch: Solving XOR
Neural Network from Scratch: Solving XOR
This capstone project brings together everything you've learned. You'll build a neural network from scratch that can solve the XOR problem—something logistic regression cannot do.
Background
The XOR (exclusive or) function:
Input Output
(0, 0) → 0
(0, 1) → 1
(1, 0) → 1
(1, 1) → 0
This is not linearly separable. No straight line can separate the 0s from the 1s. A neural network with a hidden layer can learn to transform the space and solve this.
Functions to implement
1. sigmoid(z) and sigmoid_derivative(z)
The activation function and its derivative.
sigmoid(z) = 1 / (1 + exp(-z))sigmoid_derivative(z) = sigmoid(z) * (1 - sigmoid(z))
2. initialize_weights(input_size, hidden_size, output_size)
Initialize network weights randomly.
- W1: (input_size, hidden_size) - small random values
- b1: (hidden_size,) - zeros
- W2: (hidden_size, output_size) - small random values
- b2: (output_size,) - zeros
- Use values between -0.5 and 0.5 for weights
3. forward(X, W1, b1, W2, b2)
Compute forward pass through the network.
- z1 = X @ W1 + b1
- h = sigmoid(z1)
- z2 = h @ W2 + b2
- y_pred = sigmoid(z2)
- Return: (y_pred, h, z1)
4. binary_cross_entropy(y_true, y_pred)
Compute the loss.
- Add epsilon (1e-15) to prevent log(0)
- Return mean of: -[y log(p) + (1-y) log(1-p)]
5. backward(X, y_true, y_pred, h, z1, W2)
Compute gradients via backpropagation.
- dz2 = y_pred - y_true
- dW2 = (1/m) * h.T @ dz2
- db2 = (1/m) * sum(dz2, axis=0)
- dh = dz2 @ W2.T
- dz1 = dh * sigmoid_derivative(z1)
- dW1 = (1/m) * X.T @ dz1
- db1 = (1/m) * sum(dz1, axis=0)
- Return: (dW1, db1, dW2, db2)
6. train(X, y, hidden_size, epochs, learning_rate)
Train the neural network.
- Initialize weights
- For each epoch: forward → loss → backward → update weights
- Return: (W1, b1, W2, b2, losses) where losses is a list of loss values
7. predict(X, W1, b1, W2, b2)
Make predictions (0 or 1) using trained weights.
- Run forward pass
- Return 1 if y_pred >= 0.5, else 0
Example
# XOR dataset
X = [[0, 0], [0, 1], [1, 0], [1, 1]]
y = [[0], [1], [1], [0]]
# Train
W1, b1, W2, b2, losses = train(X, y, hidden_size=4, epochs=10000, learning_rate=1.0)
# Predict
predictions = predict(X, W1, b1, W2, b2)
# Should be [[0], [1], [1], [0]] (or close to it)
# Loss should decrease
assert losses[-1] < losses[0]
Hints
- Matrix dimensions matter! Double-check shapes at each step.
- Use
sum(..., axis=0, keepdims=True)for bias gradients if needed. - The learning rate for XOR can be high (0.5 to 2.0).
- 4 hidden neurons is enough for XOR, but more is fine.
- 5000-10000 epochs should be sufficient for convergence.
What you'll prove
By completing this capstone, you demonstrate understanding of:
- Why linear models fail on non-linear problems
- How hidden layers transform feature space
- Forward propagation through a network
- Backpropagation and the chain rule
- Gradient descent for neural networks
This is the foundation of all deep learning.
Run tests to see results
No issues detected