Resources:

A. Vaswani et al., “Attention Is All You Need,” 2017.

Google Skills, “Introduction – Transformer Models and BERT Model: Overview | Google Skills,” Google Skills, 2022. https://www.skills.google/course_templates/538

“Transformer: A Novel Neural Network Architecture for Language Understanding,” Research.google, 2017. https://research.google/blog/transformer-a-novel-neural-network-architecture-for-language-understanding/.

“Transformer Explainer: LLM Transformer Model Visually Explained,” Github.io, 2017. https://poloclub.github.io/transformer-explainer/ (accessed Jan. 25, 2026).

Youtube.com, 2026. https://www.youtube.com/watch?v=Ub3GoFaUcds.

Transformer

‌Transformer is an encoder and decoder model. It uses a mechanism called Attention.

Transformer model consists of encoder and decoder.

Analogy

Imagine you are at a loud party. You hear fragments of conversations. To understand a specific sentence, your brain does three things instantly:

Focus (Attention): You filter out background noise and focus on specific words.
Contextualize (Self-Attention): If someone says, “I poured the money into the bank,” you look at the word “money” to understand that “bank” means a financial institution, not a river bank. The Transformer does this for every word in a sentence simultaneously.
Parallel Processing: Unlike previous AI (RNNs) that read one word at a time (like a human reading a book), a Transformer looks at the entire sentence at once (like looking at a painting). This makes it incredibly fast.

Transformer Architecture

The Transformer architecture is an Encoder-Decoder structure. However, unlike its predecessors, it relies entirely on Attention mechanisms without any Recurrent (RNN) or Convolutional (CNN) layers.

1. Input Embeddings & Positional Encoding
These two components work together to translate human language (words and order) into a format that a machine can process (numbers and patterns).

Embeddings: The model converts words into vectors (lists of numbers) where similar words have similar numbers (e.g., “King” and “Queen” are mathematically close).
Positional Encoding: Since the Transformer processes all words simultaneously (parallel), it has no inherent sense of order (unlike an RNN which knows word #1 comes before word #2).
- The Fix: The paper Vaswani et al. 2017 adds a mathematical signal (using sine and cosine functions) to the embeddings to give the model information about the position of each word in the sequence.
  https://www.ibm.com/think/topics/positional-encoding

Embeddings example codes and output.

nn.Embedding is just a Matrix (a grid of numbers) that is learnable.

Rows: The number of words in your vocabulary (vocab).
Columns: The number of dimensions per word (d_model).

The tensor shape is [1, 3, 4].

1: Batch (1 sentence)
3: Sequence (3 words: “AI”, “is”, “Future”)
4: Dimension (4 numbers per word)

The tensor lists the words in order:

Row 1 (First List): [-0.3948, -1.9287, -1.0266, 5.2556] → “AI”
Row 2 (Second List): [-1.4930, 2.0102, -0.5137, 0.9530] → “is”
Row 3 (Third List): [-1.3304, -0.7253, -2.9007, -0.4992] → “Future”

What self.lut(x) did:

The code saw the ID for “AI” (which was 1).
It went to the Embedding Matrix (the LUT).
It grabbed Row #1.
It multiplied those numbers by $\sqrt{4}$ (which is 2).
The result is the vector you see: [-0.3948, ...].

Positional Encoding example codes and output.

2. The Core Mechanism: Scaled Dot-Product Attention

Attention Mechanism, which is the core “brain” of the Transformer. This is where the model compares words to each other to understand context. This is the engine of the Transformer. It calculates how much focus (weight) one word should put on other words. It uses three vectors for every token:

Query (Q): What the token is looking for.
Key (K): What the token offers to others.
Value (V): The actual information the token holds.

The model calculates the match between the Query and the Key. If they match well, it extracts the Value.

https://introml.mit.edu/notes/transformers.html

3. Multi-Head Attention

Looking at a sentence in just one way isn’t enough.

Concept: Instead of one “brain” looking at the sentence, they use multiple “heads” (usually 8 or more).
Why?
- Head 1 might focus on syntax (noun-verb relationship).
- Head 2 might focus on semantics (meaning of “it” referring to “animal”).
- Head 3 might focus on tone.
The results of all heads are concatenated (joined) and projected to a final output.

https://introml.mit.edu/notes/transformers.html

4. Feed-Forward Networks

The Feed-Forward Network (FFN) is the second major sub-layer in the Transformer block. If the Attention mechanism is the “social” part of the model (words talking to each other), the FFN is the “introverted” part (each word thinking about what it just learned). After attention, the data passes through a standard fully connected neural network. This transforms the attention information into a format the next layer can use.

After the Multi-Head Attention step, every word vector has been updated with information from other words.
- Before Attention: “Bank” just meant “Bank.”
- After Attention: “Bank” now contains context from “River” (so it knows it’s a river bank, not a financial bank).
Now, the FFN takes this new “context-aware” vector and processes it to extract deeper features. Crucially, this happens Position-wise, meaning the FFN looks at each word individually. The word “Bank” is processed without looking at “River” or any other word at this stage.

5. Add & Norm (Residual Connections)

To prevent the model from “forgetting” the original input as it goes deeper, the input of a layer is added to its output (Residual connection), followed by Layer Normalization. This stabilizes training.
https://apxml.com/courses/introduction-to-transformer-models/chapter-3-transformer-encoder-decoder-architecture/add-norm-layers

Other Resources:

Youtube.com, 2026. https://www.youtube.com/watch?v=XfpMkf4rD6E (accessed Jan. 27, 2026).

Transformer Part I – Theory, Original Paper etc.

Transformer

Analogy