Neural Network from Scratch: Solving XOR

hard · neural-networks, backpropagation, capstone

Neural Network from Scratch: Solving XOR

This capstone project brings together everything you've learned. You'll build a neural network from scratch that can solve the XOR problem—something logistic regression cannot do.

Background

The XOR (exclusive or) function:

Input      Output
(0, 0)  →    0
(0, 1)  →    1
(1, 0)  →    1
(1, 1)  →    0

This is not linearly separable. No straight line can separate the 0s from the 1s. A neural network with a hidden layer can learn to transform the space and solve this.

Functions to implement

1. sigmoid(z) and sigmoid_derivative(z)

The activation function and its derivative.

  • sigmoid(z) = 1 / (1 + exp(-z))
  • sigmoid_derivative(z) = sigmoid(z) * (1 - sigmoid(z))

2. initialize_weights(input_size, hidden_size, output_size)

Initialize network weights randomly.

  • W1: (input_size, hidden_size) - small random values
  • b1: (hidden_size,) - zeros
  • W2: (hidden_size, output_size) - small random values
  • b2: (output_size,) - zeros
  • Use values between -0.5 and 0.5 for weights

3. forward(X, W1, b1, W2, b2)

Compute forward pass through the network.

  • z1 = X @ W1 + b1
  • h = sigmoid(z1)
  • z2 = h @ W2 + b2
  • y_pred = sigmoid(z2)
  • Return: (y_pred, h, z1)

4. binary_cross_entropy(y_true, y_pred)

Compute the loss.

  • Add epsilon (1e-15) to prevent log(0)
  • Return mean of: -[y log(p) + (1-y) log(1-p)]

5. backward(X, y_true, y_pred, h, z1, W2)

Compute gradients via backpropagation.

  • dz2 = y_pred - y_true
  • dW2 = (1/m) * h.T @ dz2
  • db2 = (1/m) * sum(dz2, axis=0)
  • dh = dz2 @ W2.T
  • dz1 = dh * sigmoid_derivative(z1)
  • dW1 = (1/m) * X.T @ dz1
  • db1 = (1/m) * sum(dz1, axis=0)
  • Return: (dW1, db1, dW2, db2)

6. train(X, y, hidden_size, epochs, learning_rate)

Train the neural network.

  • Initialize weights
  • For each epoch: forward → loss → backward → update weights
  • Return: (W1, b1, W2, b2, losses) where losses is a list of loss values

7. predict(X, W1, b1, W2, b2)

Make predictions (0 or 1) using trained weights.

  • Run forward pass
  • Return 1 if y_pred >= 0.5, else 0

Example

# XOR dataset
X = [[0, 0], [0, 1], [1, 0], [1, 1]]
y = [[0], [1], [1], [0]]

# Train
W1, b1, W2, b2, losses = train(X, y, hidden_size=4, epochs=10000, learning_rate=1.0)

# Predict
predictions = predict(X, W1, b1, W2, b2)
# Should be [[0], [1], [1], [0]] (or close to it)

# Loss should decrease
assert losses[-1] < losses[0]

Hints

  1. Matrix dimensions matter! Double-check shapes at each step.
  2. Use sum(..., axis=0, keepdims=True) for bias gradients if needed.
  3. The learning rate for XOR can be high (0.5 to 2.0).
  4. 4 hidden neurons is enough for XOR, but more is fine.
  5. 5000-10000 epochs should be sufficient for convergence.

What you'll prove

By completing this capstone, you demonstrate understanding of:

  • Why linear models fail on non-linear problems
  • How hidden layers transform feature space
  • Forward propagation through a network
  • Backpropagation and the chain rule
  • Gradient descent for neural networks

This is the foundation of all deep learning.

Run tests to see results
No issues detected