Gradient descent
Purpose: Minimize a loss function
Update Rule:
: Learning rate (step size). : Gradient of loss w.r.t. .
Variants:
- Batch GD: Uses full dataset (slow, precise).
- Stochastic GD (SGD): One random sample per step (fast, noisy).
- Mini-batch SGD: Compromise (common in deep learning).
Key Idea: Follow the steepest descent to find a (local) minimum.
Challenge: Choosing
Advanced Optimizers (Adam, RMSprop) adapt