Deep Dive into LLMs: Under the Hood

December 28, 2025 | Series: AI Engineering 101 | Tags: LLM , Transformers , Architecture , Inference , Fine-tuning

Large Language Models have gone from research curiosities to critical infrastructure. If you’re building with them, you can’t treat them as magic boxes forever. At some point, you’ll hit a latency issue, a cost problem, or an output quality bug - and you’ll need to understand what’s actually happening inside.

This note is my attempt to unpack that.

1. Neural Networks: The Foundation

Before we can understand LLMs, we need to understand neural networks. LLMs are just very large, specialized neural networks - so let’s build up from first principles.

Neural networks are the core of Deep Learning - the subset of Machine Learning that uses multi-layered networks to learn representations from data. If you’re unclear on where Deep Learning fits in the AI landscape, see my note on AI vs ML vs DL.

What is a Neural Network?

A neural network is a function that transforms inputs into outputs through layers of simple operations. Think of it as a pipeline: data goes in, gets transformed step by step, and a result comes out.

The basic unit is a neuron. A neuron does three things:

Takes multiple inputs (numbers)
Multiplies each by a weight (importance)
Adds them up, applies an activation function, and outputs a number

output = activation(w1*x1 + w2*x2 + w3*x3 + ... + bias)

Weights are the learnable parameters - the “knowledge” of the network. When we say a model has 70 billion parameters, we mean 70 billion of these weights.

Layers: Stacking Neurons

A layer is a collection of neurons that process input together. Neural networks stack multiple layers:

Input Layer: Receives raw data (for LLMs, this is token embeddings)
Hidden Layers: Transform data progressively. Each layer learns more abstract patterns.
Output Layer: Produces the final result (for LLMs, probabilities over vocabulary)

The “deep” in deep learning means many hidden layers. GPT-3 has 96 layers. More layers = more capacity to learn complex patterns.

How Learning Works

Initially, weights are random - the network outputs garbage. Training adjusts these weights to minimize errors.

The process:

Forward pass: Input flows through the network, producing an output
Loss calculation: Compare output to the correct answer. How wrong are we?
Backward pass: Calculate how each weight contributed to the error (using calculus - specifically, gradients)
Update weights: Nudge each weight slightly to reduce the error

Repeat this millions of times on millions of examples. Gradually, the weights settle into values that make good predictions.

This is called gradient descent - we’re descending down a “landscape” of errors to find the lowest point.

Optimizers used in LLMs: Pure gradient descent is slow. Modern LLMs use sophisticated variants:

Adam (Adaptive Moment Estimation): Maintains per-parameter learning rates and momentum. The workhorse of deep learning.
AdamW: Adam with proper weight decay (regularization). Used by GPT, Llama, and most modern LLMs.
Learning rate schedulers: Start with high learning rate, decay over training (warmup + cosine decay is common).

Matrix Operations: The Real Computation

In practice, we don’t compute neurons one by one. We use matrices (2D arrays of numbers) and matrix multiplication to process everything in parallel.

If we have:

Input: a matrix X with shape [batch_size, input_features]
Weights: a matrix W with shape [input_features, output_features]

Then:

Output = X × W + bias

This single matrix multiplication replaces thousands of individual neuron computations. GPUs are extremely good at matrix multiplication - that’s why they’re essential for deep learning.

Key insight for understanding LLMs: Everything that happens inside is matrix operations. Text becomes matrices, gets transformed by multiplying with weight matrices, and produces output matrices.

2. Tokenization: Text to Numbers

Neural networks need numbers. Text is not numbers. Tokenization bridges this gap.

Tokenization breaks text into chunks (tokens) and maps each to an integer ID.

"Hello world" → ["Hello", " world"] → [15496, 995]

Tokens aren’t always words. Common words might be single tokens; rare words get split:

"unhappiness" → ["un", "happiness"] → 2 tokens
"the" → ["the"] → 1 token

Byte-Pair Encoding (BPE) is the standard algorithm. It starts with characters and iteratively merges the most frequent pairs until it hits a vocabulary size (typically 32k-100k tokens).

Why this matters to you:

Billing is per token, not per word
Context limits are in tokens (~1000 tokens ≈ 750 words)
Rare words cost more tokens (bad for non-English languages, code, etc.)

3. Embeddings: Building the Input Matrix

Now we have token IDs like [15496, 995]. But these are just arbitrary integers - the model doesn’t know that 15496 means “Hello”.

We need to convert these IDs into vectors that capture meaning. This is where embeddings come in.

The Embedding Matrix

The model has a learned Embedding Matrix of shape [vocab_size, embedding_dim].

vocab_size: number of tokens in vocabulary (e.g., 50,000)
embedding_dim: size of each embedding vector (e.g., 4096)

Each row of this matrix is the embedding for one token. To get an embedding, we simply look up the row for that token ID.

Token ID 15496 → look up row 15496 → [0.12, -0.34, 0.98, ...] (4096 numbers)

From Tokens to Input Matrix

If our input has 5 tokens, we look up 5 rows, giving us:

Input Matrix X: shape [5, 4096]
- Row 0: embedding for token 0
- Row 1: embedding for token 1
- ...
- Row 4: embedding for token 4

This input matrix X is what flows through the entire Transformer. Every subsequent operation transforms this matrix - changing its values but keeping its shape (roughly).

Why Embeddings Work

These embedding vectors are learned during training. The model adjusts them so that:

Similar words have similar vectors
Relationships are preserved geometrically

Famous example: King - Man + Woman ≈ Queen

This vector arithmetic works because the model learned that the direction from “Man” to “Woman” is similar to the direction from “King” to “Queen”. Meaning becomes geometry.

4. The Transformer: Transforming the Input Matrix

The Transformer architecture was introduced in the landmark 2017 paper “Attention Is All You Need” by Vaswani et al. It replaced previous sequence models (RNNs, LSTMs) with a purely attention-based architecture that could be parallelized efficiently on GPUs. Every modern LLM - GPT, Llama, Claude, Gemini - is built on this foundation.

Here’s the big picture of what happens inside an LLM:

Input Text
    ↓
[Tokenization] → Token IDs: [1532, 38, 2898, 318, 1049]
    ↓
[Embedding Lookup] → Input Matrix X: shape [5, 4096]
    ↓
[Transformer Layer 1] → Transformed Matrix: shape [5, 4096]
    ↓
[Transformer Layer 2] → Transformed Matrix: shape [5, 4096]
    ↓
... (repeat 96 times for GPT-3)
    ↓
[Final Layer Norm] → Output Matrix: shape [5, 4096]
    ↓
[Output Projection] → Logits: shape [5, 50000]
    ↓
[Softmax on last row] → Probabilities for next token

Each Transformer layer refines the representations. Early layers might capture syntax; later layers capture meaning, context, and reasoning.

Important: These Transformer layers ARE the “layers” of the neural network. When we say “GPT-3 has 96 layers,” we mean 96 Transformer layers stacked on top of each other.

Let’s break down what happens inside each Transformer layer.

4.1 Positional Encoding: Knowing Word Order

There’s a problem: nothing so far tells the model that “Dog bites man” and “Man bites dog” are different. The embedding for “Dog” is the same regardless of position.

Positional Encoding adds position information to each embedding. Before entering the Transformer, we modify our input matrix:

X = X + PositionalEncoding

The original Transformer used mathematical functions (sine and cosine waves at different frequencies). Modern LLMs often learn position embeddings directly or use techniques like Rotary Position Embeddings (RoPE).

Why this matters: Positional encoding is why models have context limits. The scheme must handle positions it saw during training.

4.2 Self-Attention: How Tokens Talk to Each Other

This is the heart of the Transformer. Self-attention lets each token “look at” all other tokens and gather relevant information.

The Intuition

Consider: “The cat sat on the mat because it was tired.”

What does “it” refer to? The cat. To understand “it”, the model needs to attend to “cat” - to pull information from that position into the current position.

Self-attention computes, for every token, which other tokens it should pay attention to.

The Math: From Input to Output

Let’s trace exactly what happens. We start with our input matrix X with shape [seq_len, d_model] (e.g., [5, 4096]).

Step 1: Create Q, K, V matrices

We project X into three different representations using learned weight matrices:

Q = X × W_Q    (Query: what am I looking for?)
K = X × W_K    (Key: what do I contain?)  
V = X × W_V    (Value: what information to pass along?)

Where:

W_Q, W_K, W_V are learned weight matrices, each of shape [d_model, d_k]
Result: Q, K, V each have shape [seq_len, d_k]

Think of it this way:

Query for each token asks: “What information do I need?”
Key for each token advertises: “Here’s what I have”
Value for each token holds: “Here’s the actual information”

Step 2: Compute attention scores

How much should token i attend to token j? We measure similarity between token i’s Query and token j’s Key using dot product:

Attention Scores = Q × K^T

Where K^T means K transposed (rows become columns).

Result: a matrix of shape [seq_len, seq_len] where position (i, j) tells us how much token i should attend to token j.

Step 3: Scale the scores

The dot products can get very large, which causes problems for the next step. We scale down:

Scaled Scores = Attention Scores / √d_k

Dividing by √d_k (square root of the key dimension) keeps values in a reasonable range.

Step 4: Apply softmax

Convert scores to probabilities. For each row (each token), softmax makes the values sum to 1:

Attention Weights = softmax(Scaled Scores)

Now each row contains a probability distribution over all tokens - how much attention to pay to each.

Step 5: Weighted sum of values

Finally, use these attention weights to compute a weighted average of the Value vectors:

Attention Output = Attention Weights × V

Result: shape [seq_len, d_k] - a new representation where each token has gathered information from other tokens based on relevance.

The full formula:

Attention(Q, K, V) = softmax(Q × K^T / √d_k) × V

Masked Attention

For language models, we add one constraint: a token can’t see future tokens (that would be cheating - we’re trying to predict them!).

Masking sets attention scores for future positions to negative infinity before softmax. This makes those weights become 0 after softmax.

4.3 Multi-Head Attention: Multiple Perspectives

One attention mechanism can only focus on one type of relationship. But language is complex - we might need to track syntax, semantics, and coreference simultaneously.

Multi-Head Attention runs several attention operations in parallel, each with its own W_Q, W_K, W_V. Each “head” can learn to attend to different patterns:

One head might track subject-verb relationships
Another might focus on nearby words
Another might handle coreference (like “it” referring to “cat”)

The outputs from all heads are concatenated and projected:

MultiHead = Concat(head_1, head_2, ..., head_h) × W_O

Typical LLMs use 32-128 heads.

4.4 Feed-Forward Network: Processing Each Position

After attention, each position passes through a Feed-Forward Network (FFN) - a simple two-layer neural network applied independently to each token:

FFN(x) = ReLU(x × W1 + b1) × W2 + b2

Where:

W1 expands the dimension (e.g., 4096 → 16384)
ReLU is an activation function: ReLU(x) = max(0, x)
W2 projects back down (16384 → 4096)

The FFN is often described as where the model stores “knowledge”. The expanded middle layer creates space for encoding facts and patterns.

Why do we need FFN? Attention alone isn’t enough:

Attention = “gather” - it combines information from different positions (weighted averaging), but doesn’t really process that information
FFN = “think” - it’s a non-linear transformation that processes the combined information
Research shows factual knowledge (“Paris is the capital of France”) is encoded in FFN weights, not attention weights
Without FFN, the model would just be mixing tokens without doing any real computation

4.5 Residual Connections & Layer Normalization

After each sub-layer (attention or FFN), we add the input back to the output:

output = LayerNorm(x + Sublayer(x))

Two things are happening here:

Residual Connection (x + Sublayer(x)): We add the original input to the output. This helps gradients flow during training - without it, gradients vanish as they propagate backward through 96 layers.

LayerNorm is a function (not just addition) that normalizes the values:

LayerNorm(x) = γ × (x - mean) / √(variance + ε) + β

It computes the mean and variance of x, normalizes to mean=0 and variance=1, then applies learnable scale (γ) and shift (β). This keeps activations in a reasonable range - without it, values would explode or vanish through many layers.

Putting It Together: One Transformer Layer

One Transformer layer does:

1. attention_out = MultiHeadAttention(X)
2. X = LayerNorm(X + attention_out)     ← residual connection
3. ffn_out = FFN(X)
4. X = LayerNorm(X + ffn_out)           ← residual connection
5. Return X

The input and output have the same shape. Stack 96 of these layers (for GPT-3) and you have an LLM.

What Gets Learned During Training (The Parameters)

When we say a model has “70 billion parameters,” we mean all the learnable weights across the entire network. For each Transformer layer, the model learns:

Component	What’s Learned	Count per Layer
Attention	W_Q, W_K, W_V for each head	3 × num_heads matrices
Attention	W_O (output projection)	1 matrix
FFN	W1 (expand) and W2 (contract)	2 large matrices
LayerNorm	γ (scale) and β (shift)	2 vectors per norm

Plus the Embedding Matrix (shared across the model) and sometimes the Output Projection.

Multiply by 96 layers and you get billions of parameters. All of these are randomly initialized, then adjusted through training to minimize prediction error.

5. Getting the Output: From Matrix to Words

After all Transformer layers, we have an output matrix of shape [seq_len, d_model]. How do we get the next word?

The Output Projection

We only care about the last position (that’s where we predict the next token). We take that row and multiply by an output projection matrix:

logits = output_row × W_vocab

Where W_vocab has shape [d_model, vocab_size] (e.g., [4096, 50000]).

Result: a vector of 50,000 numbers (one per possible token) - these are called logits.

Softmax: Logits to Probabilities

Apply softmax to convert logits to probabilities:

P(token_i) = exp(logit_i) / Σ exp(logit_j)

Now we have a probability distribution over all possible next tokens.

Choosing the Next Token

Several strategies:

Greedy: Pick the highest probability token. Deterministic but often repetitive.

Temperature: Scale logits before softmax. Higher temperature = flatter distribution = more randomness.

Top-K: Only consider the K most likely tokens, then sample.

Top-P (Nucleus): Include tokens until cumulative probability exceeds P (e.g., 0.9).

In practice, APIs use combinations of these.

Autoregressive Generation

LLMs generate one token at a time:

Process prompt → get probability for next token
Sample a token
Append to input
Repeat

This is autoregressive - each token depends on all previous tokens.

The KV Cache

Here’s an optimization insight: when generating token 10, we need attention with tokens 1-9. But tokens 1-9 haven’t changed! Why recompute their Keys and Values?

The KV Cache stores K and V for all previous tokens. For each new token, we only compute Q, K, V for that token, then reuse cached values.

Trade-off:

Without cache: O(n²) compute per token
With cache: O(n) compute per token
Cost: Memory grows with sequence length

This is why long-context models need massive VRAM - they’re caching KV for the entire context.

6. Training: From Random to Intelligent

LLM training happens in phases.

Phase 1: Pre-training

Goal: Learn language patterns and world knowledge.

Data: Trillions of tokens - web scrapes, books, Wikipedia, GitHub.

Objective: Next-token prediction. Given tokens 1 to n, predict token n+1.

Result: A “base model” good at completing text but not following instructions. Ask “What is 2+2?” and it might respond “What is 3+3?” - continuing the pattern rather than answering.

Scale: This is expensive. GPT-4 scale pre-training costs tens of millions of dollars.

Phase 2: Supervised Fine-Tuning (SFT)

Goal: Teach instruction-following.

Data: Curated (prompt, response) pairs.

Process: Continue training on this data.

Result: An “Instruct” model that answers questions.

Phase 3: Alignment (RLHF / DPO)

Goal: Align with human preferences - helpful, harmless, honest.

RLHF:

Generate multiple responses
Humans rank them
Train a reward model on rankings
Use RL to optimize the LLM for high reward

DPO: Simpler alternative - optimize directly on preference pairs without a reward model.

7. Scaling Laws

Performance scales predictably with compute, data, and parameters.

Chinchilla Scaling Laws (2022):

For fixed compute, there’s an optimal model size / data ratio
Many early LLMs were “undertrained” - too big for their data
Llama models follow this more closely

Emergent Abilities: Some capabilities only appear at scale - small models can’t do chain-of-thought reasoning. This makes capability prediction hard.

8. Practical Considerations

Quantization: Reduce precision (FP32 → FP16 → INT8 → INT4) to shrink models and speed up inference with minimal quality loss.

Distillation: Train a small “student” to mimic a large “teacher”. The student learns from soft probability distributions.

Context Length: Attention is O(n²). Techniques like sparse attention, sliding windows, and grouped-query attention (GQA) help manage this.

Summary

Concept	What It Is	Why It Matters
Neural Network	Layers of matrix operations	The fundamental compute unit
Tokenization	Text → integer IDs	Determines cost and limits
Embeddings	IDs → Input Matrix	Meaning as geometry
Self-Attention	Q, K, V → weighted combination	Core reasoning mechanism
FFN	Two-layer network per position	Where knowledge is stored
Residual Connections	Add input to output	Makes deep networks trainable
KV Cache	Store K, V for past tokens	Makes generation fast
Pre-training → SFT → RLHF	Three-phase training	Random → smart → aligned

This covers the core mechanics. There’s more to explore - mixture of experts, efficient attention variants, multimodal extensions - but if you understand what’s here, you can read the papers and follow along.

References & Further Reading

Videos

3Blue1Brown - “But what is a neural network?” - YouTube - The best visual introduction to neural networks.
3Blue1Brown - “Attention in transformers, visually explained” - YouTube - Beautiful visualization of how attention works.
Andrej Karpathy - “Let’s build GPT: from scratch, in code” - YouTube - 2-hour walkthrough building a GPT from scratch. Highly recommended.
Andrej Karpathy - “Intro to Large Language Models” - YouTube - 1-hour overview of what LLMs are and how they work.
StatQuest - “Transformer Neural Networks Clearly Explained” - YouTube - Clear, step-by-step explanation with visuals.

Articles & Blog Posts

Jay Alammar - “The Illustrated Transformer” - Blog - The go-to visual guide for understanding Transformers.
Lilian Weng - “The Transformer Family” - Blog - Comprehensive overview of Transformer variants.
Chip Huyen - “Building LLM applications for production” - Blog - Practical engineering considerations.
Sebastian Raschka - “Understanding Large Language Models” - Blog - In-depth technical breakdown.

Papers (for the curious)

“Attention Is All You Need” (2017) - arXiv - The original Transformer paper.
“Language Models are Few-Shot Learners” (GPT-3, 2020) - arXiv - Emergent abilities at scale.
“Training Compute-Optimal Large Language Models” (Chinchilla, 2022) - arXiv - Scaling laws.