Cross-entropy and LLMs training
To understand how Large Language Models learn, we can map the mathematical concepts of information theory directly onto the mechanics of a guessing game. What follows is a complete breakdown of how Shannon entropy, cross-entropy, and perplexity govern the training of modern AI, using an extended version of "Guess Who?" as our guide.
1. The Game Board: Shannon Entropy
At the heart of any guessing game is uncertainty. In "Guess Who?", you face a board of
If you have no clues, your total uncertainty is captured by Shannon entropy. For a perfectly uniform distribution where every answer is equally likely, the initial entropy is:
As a sentence unfolds ("Maria was preparing..."), context acts like signals, giving rise to a distribution of probabilities for the following word.
2. The Two Players: Reality ( ) vs. The Model ( )
To understand how an LLM is trained, we introduce a twist: a probabilistic opponent.
Reality (
Moreover,
The Model's Belief (
- Are they sweating?
- How frequently are they blinking?
- Did they hesitate before sitting down?
- Are they leaning slightly to the left?
Call this vector of observations
If your opponent is sweating heavily and blinking rapidly, you might reason: historically, this combination of signals happens when they have picked someone they consider a risky, unusual choice — probably Maria or Carlos. Your
This is precisely what a neural network does. The "visual signals"
The entire weight matrix of the neural network — potentially hundreds of billions of parameters — exists solely to implement this mapping:
as accurately as possible.
The training problem is, in its entirety, the problem of making
4. The Collision: Cross-Entropy
If you already knew your opponent's true distribution
But you don't know
Cross-entropy is the average surprise you obtain because your mental model does not perfectly match reality:
It sums, over all possible characters (or words), the true probability of each outcome weighted by the surprise your strategy would feel upon seeing it. When
Cross-entropy is always at least as large as the true entropy
The Empirical Trick: How We Compute It Without Knowing
We never have access to the true distribution
When the model reads a training sentence, the actual next word — written by a human author — is treated as the ground truth for that moment. For that specific prediction,
This is the loss for a single training step. If the model assigned a tiny probability to the word that actually came next, this number is huge — the model was highly surprised. If it assigned high probability, the number is small — the model was well-calibrated for this context.
5. Training: Learning to Read the Signals
Before neural networks, AI minimised surprise by physically counting historical sequences (N-gram models): to predict what follows "Maria was preparing the...", you looked up a database to see how often "dinner" appeared versus "taxes" in that exact context. This fails because of the curse of dimensionality — language is too vast, and almost any specific long context will have been seen exactly zero times.
Modern LLMs do not count. Instead, they use cross-entropy as a grading system to iteratively improve the signal-reading function
-
Forward pass. The model observes the context
(everything before the next word), processes the relationships between concepts through its layers of attention and nonlinear transformations, and outputs a distribution . -
Calculate loss. The system evaluates
, checking how surprised the model was by what actually came next. -
Backpropagation. If the loss is high, calculus is used to compute how each weight in the network contributed to the error, and all weights are nudged slightly in the direction that would reduce it.
In the Guess Who? analogy: after each round, instead of tallying "how many times did sweating correlate with Maria?", you update your interpretation rules. You refine the function that maps signals to beliefs. After thousands of rounds, you have learned non-obvious conjunctions: sweating alone means little, but sweating combined with a leftward glance and a short hesitation is a very sharp signal for exactly two characters. The LLM equivalent is learning that individual words matter less than specific combinations of syntactic structure, semantic field, and discourse position — patterns that narrow the next-word distribution dramatically.
What is being learned, in both cases, is not a lookup table. It is a function that generalises to contexts never seen before.
6. The Final Score: Perplexity
Cross-entropy gives us the penalty in bits, which is not intuitively interpretable. To fix this, we exponentiate:
If cross-entropy is the average number of binary questions you must ask after processing your signals, perplexity is the effective number of characters still on the board.
- A cross-entropy loss of
bits gives a perplexity of : on average, after reading the full context and processing all the signals, the model is as uncertain as if it were choosing uniformly among 8 equally plausible words. - A perfectly calibrated model matching
exactly would have a perplexity of : absolute certainty, every time. - A completely untrained model guessing uniformly over a 50,000-word vocabulary has a perplexity of
.
In the extended Guess Who? analogy, perplexity tells you how many characters remain on your board after you have processed all the visible signals from your opponent. A well-trained player has driven that effective board size down from 24 to perhaps 2 or 3. A well-trained LLM has driven it from 50,000 to somewhere in the single digits for predictable text, rising for genuinely ambiguous or creative continuations.
The entire goal of pre-training a large language model is to drive perplexity as low as possible — which means learning, from the structure of billions of sentences, the most accurate possible mapping from observed context to predicted distribution.
7. A Note on the Limits of the Analogy
The extended Guess Who? analogy holds cleanly in one direction: the model's job is to learn a belief-generating function, not a fixed belief, and training is the process of improving that function through repeated exposure to ground-truth outcomes.
One seam worth noting: in the game, your opponent is imagined to have a genuine probabilistic disposition — they do not pick a single secret character; they are a distribution. This is the right picture for language too. Many words would be natural continuations of "Maria was preparing the..." — dinner, taxes, report, dough — and
This is why training requires not just many sentences, but many different sentences — each one is a single sample from the true distribution of language, and only by aggregating millions of samples does the empirical loss converge to the true cross-entropy
Summary
| Concept | Guess Who? | LLM |
|---|---|---|
| Uncertainty | Characters on the board | Words in the vocabulary |
| Opponent's hidden probabilistic preferences | True distribution of language | |
| Visible signals (sweating, blinking, ...) | Token embeddings, full context | |
| Your conditional belief given signals | Softmax output of the network | |
| Cross-entropy loss | Expected questions with your strategy | |
| Backpropagation | Updating your signal-interpretation rules | Adjusting network weights |
| Perplexity | Effective characters left after reading signals | Effective vocabulary size at prediction time |