Cross-entropy and LLMs training

To understand how Large Language Models learn, we can map the mathematical concepts of information theory directly onto the mechanics of a guessing game. What follows is a complete breakdown of how Shannon entropy, cross-entropy, and perplexity govern the training of modern AI, using an extended version of "Guess Who?" as our guide.

1. The Game Board: Shannon Entropy

At the heart of any guessing game is uncertainty. In "Guess Who?", you face a board of $N$ characters. In an LLM, the model faces a vocabulary of $N$ possible next words — typically around 50,000.

If you have no clues, your total uncertainty is captured by Shannon entropy. For a perfectly uniform distribution where every answer is equally likely, the initial entropy is:

H_{initial} = \log_{2} (N)

As a sentence unfolds ("Maria was preparing..."), context acts like signals, giving rise to a distribution of probabilities for the following word.

2. The Two Players: Reality ( $P$ ) vs. The Model ( $Q$ )

To understand how an LLM is trained, we introduce a twist: a probabilistic opponent.

Reality ( $P$ ): Your opponent does not pick a character and hide it as a single fixed secret. Instead, they are themselves probabilistic — on any given round, they might pick "Maria" with probability 0.50, "Alex" with 0.25, and so on, according to some hidden, complex set of preferences. You will never observe $P$ directly. It governs which words naturally follow others in human language, emerging from the full complexity of grammar, meaning, culture, and context. It is the true distribution of language.
Moreover, $P$ is not fixed. In Guess Who the distribution $P$ could depend on the internal state of the player, seen from external signals like sweat, blinking level, and so on. In LLMs the distribution depends on the previous words. So we would have something like $P (\cdot ∣ x)$ , a distribution conditioned to this vector of observations $x$ .

The Model's Belief ( $Q$ ): You are the other player. You don't know your opponent's preferences. You hold your own probability distribution over the 24 characters, based on whatever reasoning you have managed to develop so far. In an LLM, $Q$ is the predicted distribution over the vocabulary, generated by the neural network's current weights.
$Q$ is not a static table of guesses that you carry into every round unchanged. Before each round, you observe a set of visible signals from your opponent:

Are they sweating?
How frequently are they blinking?
Did they hesitate before sitting down?
Are they leaning slightly to the left?

Call this vector of observations $x$ . Your distribution is now a conditional belief $Q (\cdot ∣ x)$ : given everything I am currently observing, what is my probability distribution over the 24 characters?

If your opponent is sweating heavily and blinking rapidly, you might reason: historically, this combination of signals happens when they have picked someone they consider a risky, unusual choice — probably Maria or Carlos. Your $Q$ shifts accordingly. A different pattern of signals produces a different $Q$ . The same person, the same board, but a completely different distribution — because the context is different.

This is precisely what a neural network does. The "visual signals" $x$ are the token embeddings and positional encodings fed into the model. The "reasoning about what those signals mean" is the cascade of matrix multiplications and attention operations. The final softmax layer is $Q (\cdot ∣ x)$ — a full probability distribution over the vocabulary, computed fresh for this specific context. Every single token prediction involves recomputing $Q$ from scratch based on everything seen so far.

The entire weight matrix of the neural network — potentially hundreds of billions of parameters — exists solely to implement this mapping:

x_{context} ⟼ Q (\cdot ∣ x_{context})

as accurately as possible.

The training problem is, in its entirety, the problem of making $Q (\cdot ∣ x_{context})$ as close to $P (\cdot ∣ x_{context})$ as possible.

4. The Collision: Cross-Entropy

If you already knew your opponent's true distribution $P$ , you could design a perfectly optimal questioning strategy. The average cost of that optimal strategy is the true entropy $H (P)$ . In LLMs case, simply you would choose the more probably word....

But you don't know $P$ . You play using a strategy based on your conditional belief $Q (\cdot ∣ x)$ .

Cross-entropy is the average surprise you obtain because your mental model does not perfectly match reality:

H (P, Q) = - \sum_{i} P (x_{i}) \log_{2} Q (x_{i} ∣ x_{context})

It sums, over all possible characters (or words), the true probability of each outcome weighted by the surprise your strategy would feel upon seeing it. When $Q$ assigns low probability to events that $P$ says are common, the penalty is very big!

Cross-entropy is always at least as large as the true entropy $H (P)$ . The gap between them is the KL divergence $D_{K L} (P ∥ Q)$ , which I don't understand yet but is related to Information Geometry.

The Empirical Trick: How We Compute It Without Knowing $P$

We never have access to the true distribution $P$ of human language. So we use a practical approximation: billions of pages of human-written text serve as a stand-in for reality.

When the model reads a training sentence, the actual next word — written by a human author — is treated as the ground truth for that moment. For that specific prediction, $P = 1$ for the word that actually appeared, and $P = 0$ for everything else. This collapses the full cross-entropy formula into:

L = - \log_{2} Q (actual word ∣ x_{context})

This is the loss for a single training step. If the model assigned a tiny probability to the word that actually came next, this number is huge — the model was highly surprised. If it assigned high probability, the number is small — the model was well-calibrated for this context.

5. Training: Learning to Read the Signals

Before neural networks, AI minimised surprise by physically counting historical sequences (N-gram models): to predict what follows "Maria was preparing the...", you looked up a database to see how often "dinner" appeared versus "taxes" in that exact context. This fails because of the curse of dimensionality — language is too vast, and almost any specific long context will have been seen exactly zero times.

Modern LLMs do not count. Instead, they use cross-entropy as a grading system to iteratively improve the signal-reading function $x \mapsto Q (\cdot ∣ x)$ :

Forward pass. The model observes the context $x$ (everything before the next word), processes the relationships between concepts through its layers of attention and nonlinear transformations, and outputs a distribution $Q (\cdot ∣ x)$ .
Calculate loss. The system evaluates $- \log_{2} Q (actual word ∣ x)$ , checking how surprised the model was by what actually came next.
Backpropagation. If the loss is high, calculus is used to compute how each weight in the network contributed to the error, and all weights are nudged slightly in the direction that would reduce it.

In the Guess Who? analogy: after each round, instead of tallying "how many times did sweating correlate with Maria?", you update your interpretation rules. You refine the function that maps signals to beliefs. After thousands of rounds, you have learned non-obvious conjunctions: sweating alone means little, but sweating combined with a leftward glance and a short hesitation is a very sharp signal for exactly two characters. The LLM equivalent is learning that individual words matter less than specific combinations of syntactic structure, semantic field, and discourse position — patterns that narrow the next-word distribution dramatically.

What is being learned, in both cases, is not a lookup table. It is a function that generalises to contexts never seen before.

6. The Final Score: Perplexity

Cross-entropy gives us the penalty in bits, which is not intuitively interpretable. To fix this, we exponentiate:

Perplexity = 2^{H (P, Q)}

If cross-entropy is the average number of binary questions you must ask after processing your signals, perplexity is the effective number of characters still on the board.

A cross-entropy loss of $3$ bits gives a perplexity of $2^{3} = 8$ : on average, after reading the full context and processing all the signals, the model is as uncertain as if it were choosing uniformly among 8 equally plausible words.
A perfectly calibrated model matching $P$ exactly would have a perplexity of $1$ : absolute certainty, every time.
A completely untrained model guessing uniformly over a 50,000-word vocabulary has a perplexity of $50,000$ .

In the extended Guess Who? analogy, perplexity tells you how many characters remain on your board after you have processed all the visible signals from your opponent. A well-trained player has driven that effective board size down from 24 to perhaps 2 or 3. A well-trained LLM has driven it from 50,000 to somewhere in the single digits for predictable text, rising for genuinely ambiguous or creative continuations.

The entire goal of pre-training a large language model is to drive perplexity as low as possible — which means learning, from the structure of billions of sentences, the most accurate possible mapping from observed context to predicted distribution.

7. A Note on the Limits of the Analogy

The extended Guess Who? analogy holds cleanly in one direction: the model's job is to learn a belief-generating function, not a fixed belief, and training is the process of improving that function through repeated exposure to ground-truth outcomes.

One seam worth noting: in the game, your opponent is imagined to have a genuine probabilistic disposition — they do not pick a single secret character; they are a distribution. This is the right picture for language too. Many words would be natural continuations of "Maria was preparing the..." — dinner, taxes, report, dough — and $P$ genuinely assigns nonzero probability to all of them. The training trick of treating the observed word as $P = 1$ is an approximation: we see one draw from $P$ , not $P$ itself.

This is why training requires not just many sentences, but many different sentences — each one is a single sample from the true distribution of language, and only by aggregating millions of samples does the empirical loss converge to the true cross-entropy $H (P, Q)$ that we actually want to minimise.

Summary

Concept	Guess Who?	LLM
Uncertainty	Characters on the board	Words in the vocabulary
$P$	Opponent's hidden probabilistic preferences	True distribution of language
$x$	Visible signals (sweating, blinking, ...)	Token embeddings, full context
$Q (\cdot ∣ x)$	Your conditional belief given signals	Softmax output of the network
Cross-entropy loss	Expected questions with your strategy	$- \log_{2} Q (actual word ∣ x)$
Backpropagation	Updating your signal-interpretation rules	Adjusting network weights
Perplexity	Effective characters left after reading signals	Effective vocabulary size at prediction time