LLMs training
Here, I will be developing my understanding on AI and LLMs in particular. For the moment:
The pass forward
1. The Embedding Lookup
Suppose we want to introduce the sentence "Hi world". We have a Embedding Matrix (
Then, we proceed this way:
- Input Preparation: You represent "Hi" and "world" as two One-Hot vectors (
and ). If your vocabulary size is 10,000, these vectors have 10,000 elements—all zeros except for a at index and index , respectively. You put them in matrix form. - The Operation: Then you multiply your
input matrix by the Embedding Matrix, the "1s" act as a switch, extracting exactly the 15th and 3,256th rows.
The result is a Dense Matrix (
2. Positional Encoding
Now, here is the catch: In a Transformer, the model does not know which word comes first. If you feed the matrix for "Hi world" or "world Hi", the Transformer (at this stage) sees the exact same set of vectors. It has no sense of time or sequence. To fix this without using Recurrent Networks (RNNs), we use Positional Encoding.
Before the vectors reach the "Perceptron-like" layers, we add a unique mathematical signal to each row:
- We take the vector for "Hi" and add a specific "Position 1" vector to it.
- We take the vector for "world" and add a "Position 2" vector to it.
Now, the vector for "Hi" carries two pieces of information: what the word is (meaning) and where it is (order).
3. Moving into Self-Attention
Once the words have their "positional tags," they enter the Self-Attention mechanism. This is where the magic happens: instead of processing tokens in isolation, the model calculates a Weighted Sum of the values in the sequence.
To understand how this "conversation" between words is actually computed, we must distinguish between what a token is and what it is looking for.
A. The Starting Point: "What I am"
Consider the sentence "The dog chased the black cat through the garden".
Before any transformation, a token's vector is the sum of its Semantic Embedding + its Positional Vector, let's call
"I am the word 'black' (concept of dark color) and I am at position 5 of the sentence."
B. The Transformation: "What I seek" (Query Mode)
The static identity isn't a "question." To turn it into one, the model multiplies the vector by the weight matrix
"I am a masculine singular adjective at position 5... therefore, I am looking for a masculine singular noun nearby (likely at position 2) to modify and describe."
The
C. The Transformation: "The Shop Window" (Key Mode)
For a query to succeed, it needs a match. Simultaneously, every token in the sequence is multiplied by
"Hey! I am a masculine singular noun at position 2."
D. The Match: The Dot Product
When the model computes the dot product (
E. The weighted sum
We obtain a new vector
So we obtain a kind of modified sentence where each token has absorbe the information of its own position and the relation to the other words, by means of