Gradient Descent & Optimization
Core Update
- θ = θ - lr × ∇L(θ)
- Go opposite to gradient
- Repeat until convergence
Learning Rate
- Too small: slow
- Too large: diverge
- Start with 0.001
Batch Types
- Batch: all data, stable, slow
- Stochastic: one sample, noisy, fast
- Mini-batch: 32-128 samples, best balance
Momentum
- velocity = β × velocity + gradient
- Smooths oscillations
- Typical β = 0.9
Adam Optimizer
- Momentum + adaptive learning rate
- Default for deep learning
- lr=0.001, β1=0.9, β2=0.999
Learning Rate Schedules
- Reduce lr during training
- Step decay, cosine annealing
- Warmup for stability
Practical Tips
- Start with Adam
- Monitor loss curves
- Use early stopping
- Normalize inputs
1 / 1