An attention score refers to a value that quantifies how much focus or importance a model assigns to a specific element within an input sequence.
It is a numerical value that represents how much focus a transformer-based model (e.g. BERT, GPT) should be given to each token when generating output. It helps the model decide which words in the input are most relevant when predicting the next word.
Is the Attention score directly predicting the next word?
NO!
It helps the model determine which words in the input sequence are most relevant to the current word when making a prediction.
If the input is:
"The cat sat on the mat."
When predicting "mat", the model may assign high attention scores to "on" and "the", since they are most relevant.
When predicting "cat", the model assigns a higher attention score to "The", since it provides important context.
When Does Attention Predict the Next Word?
In Autoregressive Models (like GPT)
These models generate words sequentially, predicting the next word based on previous words using attention to look at past words when making the next-word prediction.
In Encoder-Decoder Models (like Transformers in Machine Translation)
The encoder learns contextual embeddings by attending to all words in the sentence.
The decoder then generates words one by one, attending to relevant encoder outputs.
Why Are Attention Scores Required in LLMs (Large Language Models)?
Traditional models used fixed word embeddings like Word2Vec or GloVe, but they lacked dynamic context understanding.
Without attention, words have static meanings and long-range dependencies are hard to capture.
With attention scores, Large Language Models (LLMs) like GPT, BERT, and T5 can process and generate text efficiently. The attention mechanism enables LLMs to understand long-range dependencies, assign contextual relevance, and improve comprehension of complex sentences.
How is an Attention Score Computed?
Query, Key, and Value Vectors
Each input token (word, pixel, etc.) is mapped into three vectors:
Query (Q): Represents the current token trying to find relevant information.
Key (K): Represents the significance of tokens relative to others.
Value (V): The actual information stored in the token.
Score Calculation
The attention score is computed by taking the dot product of the Query and Key vectors:
Score = Q⋅K^T
This measures the similarity between the query and key vectors.
Softmax Normalization
The raw attention scores computed using the dot-product of Query (Q) and Key (K) can be any real number, including negative values. However, we need these scores to be meaningful when deciding how much attention should be paid to each token.
Softmax ensures all attention scores sum to 1, making them interpretable as probabilities.
Weighted Sum of Values
The final attention output is computed as a weighted sum of Value (V) vectors, using attention scores as weights.
High Attention Score indicates a token is highly relevant to another token in the sequence.
Low Attention Score suggest minimal influence between tokens.
An attention score refers to a value that quantifies how much focus or importance a model assigns to a specific element within an input sequence.
It is a numerical value that represents how much focus a transformer-based model (e.g. BERT, GPT) should be given to each token when generating output. It helps the model decide which words in the input are most relevant when predicting the next word.
Is the Attention score directly predicting the next word?
NO!
It helps the model determine which words in the input sequence are most relevant to the current word when making a prediction.
If the input is:
"The cat sat on the mat."
When Does Attention Predict the Next Word?
Why Are Attention Scores Required in LLMs (Large Language Models)?
Traditional models used fixed word embeddings like Word2Vec or GloVe, but they lacked dynamic context understanding.
Without attention, words have static meanings and long-range dependencies are hard to capture.
With attention scores, Large Language Models (LLMs) like GPT, BERT, and T5 can process and generate text efficiently. The attention mechanism enables LLMs to understand long-range dependencies, assign contextual relevance, and improve comprehension of complex sentences.
How is an Attention Score Computed?
Query, Key, and Value Vectors
Each input token (word, pixel, etc.) is mapped into three vectors:
Score Calculation
Score = Q⋅K^T
Softmax Normalization
The raw attention scores computed using the dot-product of Query (Q) and Key (K) can be any real number, including negative values. However, we need these scores to be meaningful when deciding how much attention should be paid to each token.
Softmax ensures all attention scores sum to 1, making them interpretable as probabilities.
Weighted Sum of Values
The final attention output is computed as a weighted sum of Value (V) vectors, using attention scores as weights.
High Attention Score indicates a token is highly relevant to another token in the sequence.
Low Attention Score suggest minimal influence between tokens.