Gradient Descent & Optimization

Core Update

θ = θ - lr × ∇L(θ)
Go opposite to gradient
Repeat until convergence

Learning Rate

Too small: slow
Too large: diverge
Start with 0.001

Batch Types

Batch: all data, stable, slow
Stochastic: one sample, noisy, fast
Mini-batch: 32-128 samples, best balance

Momentum

velocity = β × velocity + gradient
Smooths oscillations
Typical β = 0.9

Adam Optimizer

Momentum + adaptive learning rate
Default for deep learning
lr=0.001, β1=0.9, β2=0.999

Learning Rate Schedules

Reduce lr during training
Step decay, cosine annealing
Warmup for stability

Practical Tips

Start with Adam
Monitor loss curves
Use early stopping
Normalize inputs

1 / 1

Use arrow keys or click edges to navigate. Press H to toggle help, F for fullscreen.