Understanding GPT-3: From Embeddings to Predictions | Transformer

Understanding how GPT-3 works gives us a glimpse into the cutting-edge technology shaping how we interact with the world.

From powering chatbots and virtual assistants to helping businesses automate tasks and individuals learn more efficiently, Knowing how it transforms simple text into meaningful predictions can spark ideas for leveraging it in innovative ways and help us appreciate the complexity behind the AI we use daily.

GPT-3 starts by breaking down text into tiny pieces, called tokens, and assigning each a unique meaning through a massive dictionary of vectors. These vectors soak up context and relationships between words. This thread takes you on a journey through how GPT-3 transforms simple text into intelligent, context-aware predictions.

Let's understand what happens behind the curtain in the input and output layer of this transformer model in this thread.

Almost all of the actual computation are in matrix vector multiplication. Weights are the actual brains, that are learned during training. They determine how a model behaves.

First step of Text Processing:

Break up input into little chunks and turn those chunks into vectors. These chunks are called tokens i.e pieces of words or punctuation.

This process (known fancifully as tokenisation) frequently subdivides words

The model has predefined words vocabulary, some list of all possible words.

First matrix we will encounter i.e Embedding matrix has single column for each one of these words. These columns are what determines what vector each word turns into in that first step.

Turning words into vectors.

Word embeddings in GPT3 has 12,288 dimensions.

These embedding matrix who columns tells us what happens to each word, is the first pile of weight in our model.

Using GPT-3 numbers the vocabulary size is 50, 257. Again technically this consists not of words per se, but of tokens.
Embedding dimension is 12,288

617 Million weights are added to the pool of parameters in GPT3 just from embedding matrix. In the case of Transformer, you wanna think of vectors in the embedding space as not merly representing individual words, they encode information about the position of that word. We should think about these vector as having the capacity to soak in context.

Think about it as our understanding of a given word. The meaning of that word is clearly informed by the surroundings. Sometimes this includes context from long distance away. So when putting together a model that has the ability to predict what word comes next. The goal is to somehow empower it to incorporate context efficiently.

In the very first step, when we converted that sentences into array of vectors from using embedding matrix. Each one of the vector is simply plucked out of the embedding matrix, so initially each one can only encode the meaning of the single word without any input from the surrounding.

The primary goal of this network the vector flows through is to enable each one of these vectors to soak up a meaning that's much more rich and specific than what mere individual words could represents.

The network can only process fixed number of vectors at a time, known as context size.

For GPT-3 it was trained with a context size of 2048.

The data flowing through the network always looks like 2048 columns, and 12,288 dimensions.

The context size limits how much text the transformer can incorporate when it's making a prediction of next word.

This is why long conversation with certain chatbots loose the context of thread of conversation as the conversation grew too long.

What happens in the output layer?

Desired output is a probability distribution over all tokens that might come next.

If we have a seed text:

arry Potter was a highly unusual boy ... least favourite teacher, Professor ???

if the last word is professor
Context includes words like Harry Potter
Immediately preceding is least favourite teacher

With a well trained network that had built up the knowledge of Harry Porter, assigns high probability to the word Snape. This involves two different steps.

First one is to use another matrix that maps the very last vector in that context to a list of 50,000 values, one for each token in the vocabulary.
Then we apply a function i.e softmax that normalizes this into a probability distribution.

why are we using only the last vector embedding to make the prediction?

When after all in the last layers there are thousands of other vectors in the layer just sitting there with their own context rich meaning.

In the training process it turns out to be much more efficient if you use each one of those vectors in the final layer to simultaneously make a prediction for what would come immediately after it.

This another matrix that maps the very last vector in that context to a list of 50,000 values, one for each token in the vocabulary is called Unembeding matrix (Wu). All the weight matrices we see it's weight begin at random and values are learned during the training process.

This unembedding matrix has one row for each word in the vocabulary. i.e 50,000 rows
Each row has the same number of elements as the embedding dimension i.e 12, 288.

Which adds another 617 Million parameters in the network to the pool of 175 Billion.

Application of Softmax function:

If we want a sequence of numbers to act as probability distribution, A distribution over all possible next words, then each value has to be between 0 & 1. We need all of them to add up to 1.

However in the deep learning, each calculation is a matrix vector multiplication, the outputs you get by defaults don't abide by this at all. Softmax is the standard way to turn an arbitrary list of numbers into a valid distribution in such a way that

The largest values end up around 1
The smallest values end up around 0

Temperature in the softmax,

T is larger, more weight to the lower values
T is smaller, bigger values will dominate more.

T = 0, meaning all the weight goes to the maximum value in the probability distribution.

This is the reason, ChatGPT, limits it up to 00, if the temperature is more than

The ability to weave context into predictions, creating outputs that feel natural and insightful. Its memory lies in vast matrices of learned weights, and its power comes from turning numbers into probabilities through the softmax function

#AI #MachineLearning #GPT3 #NaturalLanguageProcessing #Tokenization #WordEmbeddings #ContextualUnderstanding #Softmax #LanguageModels #AIExplained #DeepLearning #ArtificialIntelligence #TechInnovation #AIApplications