Shannon entropy

Continuous case: differential entropy.

Given a discrete probability distribution (p1,p2,,pn), the Shannon entropy measures the expected information content:

H=ipilog2pi.

It quantifies the uncertainty or randomness in the distribution. A higher entropy indicates a more uniform (less predictable) distribution. In a sense, is the "surpriseness" of a probability distribution.
In the discrete case, H=0 means the distribution is fully deterministic.

According to Maldacena in this video it is a measure of the missing information in a message. For example, the sentence "Maria was preparing..." can have different continuations. But "Maria was preparing the dinner" is closed. The first one has entropy.
Formally, this “missing information” corresponds to the probability distribution over possible continuations of the message: when many continuations are plausible with comparable weights, the entropy is high; when the continuation is essentially fixed, the entropy is low (ideally zero). Thus Maldacena’s description is an intuitive restatement of Shannon’s definition, where entropy quantifies the uncertainty in which outcome (or continuation) will occur.

Entropy and guessing games

In a guessing game ("Wordle" or "Guess who", for example), we have a space of answers S0, which has a certain measure μ. For example, the list of 5-letters words or the list of characters, whose measure is obtained by counting.
In each stage of the game we say we gain some information, meaning that we reduce the space of possible answers to a smaller subset of S1S0. The information is quantified by the formula

I=log2μ(S1)μ(S0).

For example, if we obtain information to reduce the uncertainty area to one half, we say we have obtained 1 bit

I=log212=1.

Pasted image 20250821160743.png

For that purpose we make a guess, a question, which induces a partition in the space S0. The other player, or the compute, tell us to what partition does the answer belong. The quality of our guess, i.e., the quality of the partition {Ai}, is given by the expected value of the information obtained from it, that is, the entropy, given by

H=ipi(log2pi),

where pi=μ(Ai)μ(S0). Observe that we are assuming all the final answers have the same probability, so the probability of belonging to an element of the partition is calculated simply by Laplace rule. Underlying this idea is the principle of maximum entropy, I think.

The central challenge in games like "Wordle" or "Guess Who?" is to acquire a specific amount of information while being constrained by the types of questions you're allowed to ask. At the heart of any guessing game is one ultimate, "master question" that would solve it instantly. For "Guess Who?", this question is "Which of the N characters is it?"

This question creates the most granular partition possible, where each of the N possible answers is its own unique subset. The informational value of this question is its entropy, calculated as:

Hmaster=1Nlog21N=log2(N)

This value represents the total information required to win the game. For a 24-character board, this is H=log2(24)4.585 bits. This quantity is also known as the initial entropy of the game state, as it measures our total uncertainty at the start.

The crucial rule of the game is that this master question is forbidden. The entire game is about substituting this single, powerful question with a series of smaller, less informative (but legal) questions.

For a mathematical proof that the minimum average number of binary questions required to determine an outcome of a random variable X is between H and H+1 see "Cover, T. M. and Thomas, J. A. (1991), Elements of Information Theory".

Related: