Gradient descent

Batch GD: Uses full dataset (slow, precise).
Stochastic GD (SGD): One random sample per step (fast, noisy).
Mini-batch SGD: Compromise (common in deep learning).

Purpose: Minimize a loss function $C (w)$ by iteratively updating parameters $w$ (weights/biases) in the direction of the negative gradient.

Update Rule:

w \leftarrow w - α \nabla_{w} C (w)

Variants:

Key Idea: Follow the steepest descent to find a (local) minimum.

Challenge: Choosing $α$ (too small → slow; too large → overshoot).
Advanced Optimizers (Adam, RMSprop) adapt $α$ dynamically.