Gradient Descent & Optimization

Core Update

  • θ = θ - lr × ∇L(θ)
  • Go opposite to gradient
  • Repeat until convergence

Learning Rate

  • Too small: slow
  • Too large: diverge
  • Start with 0.001

Batch Types

  • Batch: all data, stable, slow
  • Stochastic: one sample, noisy, fast
  • Mini-batch: 32-128 samples, best balance

Momentum

  • velocity = β × velocity + gradient
  • Smooths oscillations
  • Typical β = 0.9

Adam Optimizer

  • Momentum + adaptive learning rate
  • Default for deep learning
  • lr=0.001, β1=0.9, β2=0.999

Learning Rate Schedules

  • Reduce lr during training
  • Step decay, cosine annealing
  • Warmup for stability

Practical Tips

  • Start with Adam
  • Monitor loss curves
  • Use early stopping
  • Normalize inputs
1 / 1
Use arrow keys or click edges to navigate. Press H to toggle help, F for fullscreen.