Gradient descent

Purpose: Minimize a loss function C(w) by iteratively updating parameters w (weights/biases) in the direction of the negative gradient.

Update Rule:

wwαwC(w)

Variants:

  1. Batch GD: Uses full dataset (slow, precise).
  2. Stochastic GD (SGD): One random sample per step (fast, noisy).
  3. Mini-batch SGD: Compromise (common in deep learning).

Key Idea: Follow the steepest descent to find a (local) minimum.

Challenge: Choosing α (too small → slow; too large → overshoot).
Advanced Optimizers (Adam, RMSprop) adapt α dynamically.