Neural Network from Scratch: Solving XOR

hard · neural-networks, backpropagation, capstone

Neural Network from Scratch: Solving XOR

This capstone project brings together everything you've learned. You'll build a neural network from scratch that can solve the XOR problem—something logistic regression cannot do.

Background

The XOR (exclusive or) function:

Input      Output
(0, 0)  →    0
(0, 1)  →    1
(1, 0)  →    1
(1, 1)  →    0

This is not linearly separable. No straight line can separate the 0s from the 1s. A neural network with a hidden layer can learn to transform the space and solve this.

Functions to implement

1. `sigmoid(z)` and `sigmoid_derivative(z)`

The activation function and its derivative.

sigmoid(z) = 1 / (1 + exp(-z))
sigmoid_derivative(z) = sigmoid(z) * (1 - sigmoid(z))

2. `initialize_weights(input_size, hidden_size, output_size)`

Initialize network weights randomly.

W1: (input_size, hidden_size) - small random values
b1: (hidden_size,) - zeros
W2: (hidden_size, output_size) - small random values
b2: (output_size,) - zeros
Use values between -0.5 and 0.5 for weights

3. `forward(X, W1, b1, W2, b2)`

Compute forward pass through the network.

z1 = X @ W1 + b1
h = sigmoid(z1)
z2 = h @ W2 + b2
y_pred = sigmoid(z2)
Return: (y_pred, h, z1)

4. `binary_cross_entropy(y_true, y_pred)`

Compute the loss.

Add epsilon (1e-15) to prevent log(0)
Return mean of: -[y log(p) + (1-y) log(1-p)]

5. `backward(X, y_true, y_pred, h, z1, W2)`

Compute gradients via backpropagation.

dz2 = y_pred - y_true
dW2 = (1/m) * h.T @ dz2
db2 = (1/m) * sum(dz2, axis=0)
dh = dz2 @ W2.T
dz1 = dh * sigmoid_derivative(z1)
dW1 = (1/m) * X.T @ dz1
db1 = (1/m) * sum(dz1, axis=0)
Return: (dW1, db1, dW2, db2)

6. `train(X, y, hidden_size, epochs, learning_rate)`

Train the neural network.

Initialize weights
For each epoch: forward → loss → backward → update weights
Return: (W1, b1, W2, b2, losses) where losses is a list of loss values

7. `predict(X, W1, b1, W2, b2)`

Make predictions (0 or 1) using trained weights.

Run forward pass
Return 1 if y_pred >= 0.5, else 0

Example

# XOR dataset
X = [[0, 0], [0, 1], [1, 0], [1, 1]]
y = [[0], [1], [1], [0]]

# Train
W1, b1, W2, b2, losses = train(X, y, hidden_size=4, epochs=10000, learning_rate=1.0)

# Predict
predictions = predict(X, W1, b1, W2, b2)
# Should be [[0], [1], [1], [0]] (or close to it)

# Loss should decrease
assert losses[-1] < losses[0]

Hints

Matrix dimensions matter! Double-check shapes at each step.
Use sum(..., axis=0, keepdims=True) for bias gradients if needed.
The learning rate for XOR can be high (0.5 to 2.0).
4 hidden neurons is enough for XOR, but more is fine.
5000-10000 epochs should be sufficient for convergence.

What you'll prove

By completing this capstone, you demonstrate understanding of:

Why linear models fail on non-linear problems
How hidden layers transform feature space
Forward propagation through a network
Backpropagation and the chain rule
Gradient descent for neural networks

This is the foundation of all deep learning.

import math
import random
from typing import List, Tuple

Matrix = List[List[float]]
Vector = List[float]

def sigmoid(z: Matrix) -> Matrix:
    """TODO: Implement this function."""
    raise NotImplementedError()

def sigmoid_derivative(z: Matrix) -> Matrix:
    """TODO: Implement this function."""
    raise NotImplementedError()

def initialize_weights(
    input_size: int, hidden_size: int, output_size: int, seed: int = 42
) -> Tuple[Matrix, Vector, Matrix, Vector]:
    """TODO: Implement this function."""
    raise NotImplementedError()

def matmul(A: Matrix, B: Matrix) -> Matrix:
    """TODO: Implement this function."""
    raise NotImplementedError()

def add_bias(Z: Matrix, b: Vector) -> Matrix:
    """TODO: Implement this function."""
    raise NotImplementedError()

def transpose(A: Matrix) -> Matrix:
    """TODO: Implement this function."""
    raise NotImplementedError()

def forward(
    X: Matrix, W1: Matrix, b1: Vector, W2: Matrix, b2: Vector
) -> Tuple[Matrix, Matrix, Matrix]:
    """TODO: Implement this function."""
    raise NotImplementedError()

def binary_cross_entropy(y_true: Matrix, y_pred: Matrix) -> float:
    """TODO: Implement this function."""
    raise NotImplementedError()

def backward(
    X: Matrix,
    y_true: Matrix,
    y_pred: Matrix,
    h: Matrix,
    z1: Matrix,
    W2: Matrix,
) -> Tuple[Matrix, Vector, Matrix, Vector]:
    """TODO: Implement this function."""
    raise NotImplementedError()

def train(
    X: Matrix,
    y: Matrix,
    hidden_size: int,
    epochs: int,
    learning_rate: float,
    seed: int = 42,
) -> Tuple[Matrix, Vector, Matrix, Vector, List[float]]:
    """TODO: Implement this function."""
    raise NotImplementedError()

def predict(
    X: Matrix, W1: Matrix, b1: Vector, W2: Matrix, b2: Vector
) -> Matrix:
    """TODO: Implement this function."""
    raise NotImplementedError()