LLMs training

Here, I will be developing my understanding on AI and LLMs in particular. For the moment:

The pass forward

1. The Embedding Lookup

Suppose we want to introduce the sentence "Hi world". We have a Embedding Matrix (WE), a massive table where each row is a vector representing a word. At first, is filled with random values. Suppose it is of dimension 10000×512, that is, we have 10000 words/tokens, each of one is codified by 512 "characteristics".
Then, we proceed this way:

  1. Input Preparation: You represent "Hi" and "world" as two One-Hot vectors (v1 and v2). If your vocabulary size is 10,000, these vectors have 10,000 elements—all zeros except for a 1 at index 15 and index 3,256, respectively. You put them in matrix form.
  2. The Operation: Then you multiply your 2×10,000 input matrix by the 10,000×512 Embedding Matrix, the "1s" act as a switch, extracting exactly the 15th and 3,256th rows.

The result is a Dense Matrix (2×512) where each word is now a meaningful point in a high-dimensional space.

2. Positional Encoding

Now, here is the catch: In a Transformer, the model does not know which word comes first. If you feed the matrix for "Hi world" or "world Hi", the Transformer (at this stage) sees the exact same set of vectors. It has no sense of time or sequence. To fix this without using Recurrent Networks (RNNs), we use Positional Encoding.

Before the vectors reach the "Perceptron-like" layers, we add a unique mathematical signal to each row:

Now, the vector for "Hi" carries two pieces of information: what the word is (meaning) and where it is (order).

3. Moving into Self-Attention

Once the words have their "positional tags," they enter the Self-Attention mechanism. This is where the magic happens: instead of processing tokens in isolation, the model calculates a Weighted Sum of the values in the sequence.

To understand how this "conversation" between words is actually computed, we must distinguish between what a token is and what it is looking for.

A. The Starting Point: "What I am"

Consider the sentence "The dog chased the black cat through the garden".
Before any transformation, a token's vector is the sum of its Semantic Embedding + its Positional Vector, let's call x to the result. This combined vector represents its static identity:
"I am the word 'black' (concept of dark color) and I am at position 5 of the sentence."

B. The WQ Transformation: "What I seek" (Query Mode)

The static identity isn't a "question." To turn it into one, the model multiplies the vector by the weight matrix WQ. This transforms the identity into a specific Query (Q):

"I am a masculine singular adjective at position 5... therefore, I am looking for a masculine singular noun nearby (likely at position 2) to modify and describe."

The WQ matrix kind of transform the original embedding+position into "search mode".

C. The WK Transformation: "The Shop Window" (Key Mode)

For a query to succeed, it needs a match. Simultaneously, every token in the sequence is multiplied by WK to generate a Key (K). This puts tokens in "label mode" or "shop window mode." If position 6 contains the word "cat," its Key vector will effectively announce:

"Hey! I am a masculine singular noun at position 2."

D. The Match: The Dot Product

When the model computes the dot product (QKT) between the Query of "black" and the Key of "cat," the vectors align. The mathematical "click" results in a high attention score, meaning "black" will assign significant weight to "cat." This matrix is called attention weights.

E. The weighted sum

We obtain a new vector V from x. It is a kind of summary of the whole info of x. And we substitute x by

xnew_i=j(attention_weighti,jVj)

So we obtain a kind of modified sentence where each token has absorbe the information of its own position and the relation to the other words, by means of

xend=xoriginal+xnew

The training

See cross-entropy and LLMs training