LLMs training

Here, I will be developing my understanding on AI and LLMs in particular. For the moment:

The pass forward

1. The Embedding Lookup

Suppose we want to introduce the sentence "Hi world". We have a Embedding Matrix ( $W_{E}$ ), a massive table where each row is a vector representing a word. At first, is filled with random values. Suppose it is of dimension $10000 \times 512$ , that is, we have 10000 words/tokens, each of one is codified by 512 "characteristics".
Then, we proceed this way:

Input Preparation: You represent "Hi" and "world" as two One-Hot vectors ( $v_{1}$ and $v_{2}$ ). If your vocabulary size is 10,000, these vectors have 10,000 elements—all zeros except for a $1$ at index $15$ and index $3, 256$ , respectively. You put them in matrix form.
The Operation: Then you multiply your $2 \times 10, 000$ input matrix by the $10, 000 \times 512$ Embedding Matrix, the "1s" act as a switch, extracting exactly the 15th and 3,256th rows.

The result is a Dense Matrix ( $2 \times 512$ ) where each word is now a meaningful point in a high-dimensional space.

2. Positional Encoding

Now, here is the catch: In a Transformer, the model does not know which word comes first. If you feed the matrix for "Hi world" or "world Hi", the Transformer (at this stage) sees the exact same set of vectors. It has no sense of time or sequence. To fix this without using Recurrent Networks (RNNs), we use Positional Encoding.

Before the vectors reach the "Perceptron-like" layers, we add a unique mathematical signal to each row:

We take the vector for "Hi" and add a specific "Position 1" vector to it.
We take the vector for "world" and add a "Position 2" vector to it.

Now, the vector for "Hi" carries two pieces of information: what the word is (meaning) and where it is (order).

3. Moving into Self-Attention

Once the words have their "positional tags," they enter the Self-Attention mechanism. This is where the magic happens: instead of processing tokens in isolation, the model calculates a Weighted Sum of the values in the sequence.

To understand how this "conversation" between words is actually computed, we must distinguish between what a token is and what it is looking for.

A. The Starting Point: "What I am"

Consider the sentence "The dog chased the black cat through the garden".
Before any transformation, a token's vector is the sum of its Semantic Embedding + its Positional Vector, let's call $x$ to the result. This combined vector represents its static identity:
"I am the word 'black' (concept of dark color) and I am at position 5 of the sentence."

B. The $W_{Q}$ Transformation: "What I seek" (Query Mode)

The static identity isn't a "question." To turn it into one, the model multiplies the vector by the weight matrix $W_{Q}$ . This transforms the identity into a specific Query (Q):

"I am a masculine singular adjective at position 5... therefore, I am looking for a masculine singular noun nearby (likely at position 2) to modify and describe."

The $W_{Q}$ matrix kind of transform the original embedding+position into "search mode".

C. The $W_{K}$ Transformation: "The Shop Window" (Key Mode)

For a query to succeed, it needs a match. Simultaneously, every token in the sequence is multiplied by $W_{K}$ to generate a Key (K). This puts tokens in "label mode" or "shop window mode." If position 6 contains the word "cat," its Key vector will effectively announce:

"Hey! I am a masculine singular noun at position 2."

D. The Match: The Dot Product

When the model computes the dot product ( $Q \cdot K^{T}$ ) between the Query of "black" and the Key of "cat," the vectors align. The mathematical "click" results in a high attention score, meaning "black" will assign significant weight to "cat." This matrix is called attention weights.

E. The weighted sum

We obtain a new vector $V$ from $x$ . It is a kind of summary of the whole info of $x$ . And we substitute $x$ by

x_{n e w_i} = \sum_{j} ({attention_weight}_{i, j} \cdot V_{j})

So we obtain a kind of modified sentence where each token has absorbe the information of its own position and the relation to the other words, by means of

x_{e n d} = x_{o r i g i n a l} + x_{n e w}

The training

See cross-entropy and LLMs training