Deep Dive into LLMs: Under the Hood

| Series: AI Engineering 101 | Tags: LLM , Transformers , Architecture , Inference , Fine-tuning

Large Language Models have gone from research curiosities to critical infrastructure. If you’re building with them, you can’t treat them as magic boxes forever. At some point, you’ll hit a latency issue, a cost problem, or an output quality bug - and you’ll need to understand what’s actually happening inside.

This note is my attempt to unpack that.

1. Neural Networks: The Foundation

Before we can understand LLMs, we need to understand neural networks. LLMs are just very large, specialized neural networks - so let’s build up from first principles.

Neural networks are the core of Deep Learning - the subset of Machine Learning that uses multi-layered networks to learn representations from data. If you’re unclear on where Deep Learning fits in the AI landscape, see my note on AI vs ML vs DL.

What is a Neural Network?

A neural network is a function that transforms inputs into outputs through layers of simple operations. Think of it as a pipeline: data goes in, gets transformed step by step, and a result comes out.

The basic unit is a neuron. A neuron does three things:

  1. Takes multiple inputs (numbers)
  2. Multiplies each by a weight (importance)
  3. Adds them up, applies an activation function, and outputs a number
output = activation(w1*x1 + w2*x2 + w3*x3 + ... + bias)

Weights are the learnable parameters - the “knowledge” of the network. When we say a model has 70 billion parameters, we mean 70 billion of these weights.

Layers: Stacking Neurons

A layer is a collection of neurons that process input together. Neural networks stack multiple layers:

The “deep” in deep learning means many hidden layers. GPT-3 has 96 layers. More layers = more capacity to learn complex patterns.

How Learning Works

Initially, weights are random - the network outputs garbage. Training adjusts these weights to minimize errors.

The process:

  1. Forward pass: Input flows through the network, producing an output
  2. Loss calculation: Compare output to the correct answer. How wrong are we?
  3. Backward pass: Calculate how each weight contributed to the error (using calculus - specifically, gradients)
  4. Update weights: Nudge each weight slightly to reduce the error

Repeat this millions of times on millions of examples. Gradually, the weights settle into values that make good predictions.

This is called gradient descent - we’re descending down a “landscape” of errors to find the lowest point.

Optimizers used in LLMs: Pure gradient descent is slow. Modern LLMs use sophisticated variants:

Matrix Operations: The Real Computation

In practice, we don’t compute neurons one by one. We use matrices (2D arrays of numbers) and matrix multiplication to process everything in parallel.

If we have:

Then:

Output = X × W + bias

This single matrix multiplication replaces thousands of individual neuron computations. GPUs are extremely good at matrix multiplication - that’s why they’re essential for deep learning.

Key insight for understanding LLMs: Everything that happens inside is matrix operations. Text becomes matrices, gets transformed by multiplying with weight matrices, and produces output matrices.

2. Tokenization: Text to Numbers

Neural networks need numbers. Text is not numbers. Tokenization bridges this gap.

Tokenization breaks text into chunks (tokens) and maps each to an integer ID.

Tokens aren’t always words. Common words might be single tokens; rare words get split:

Byte-Pair Encoding (BPE) is the standard algorithm. It starts with characters and iteratively merges the most frequent pairs until it hits a vocabulary size (typically 32k-100k tokens).

Tokenization example

Why this matters to you:

3. Embeddings: Building the Input Matrix

Now we have token IDs like [15496, 995]. But these are just arbitrary integers - the model doesn’t know that 15496 means “Hello”.

We need to convert these IDs into vectors that capture meaning. This is where embeddings come in.

The Embedding Matrix

The model has a learned Embedding Matrix of shape [vocab_size, embedding_dim].

Each row of this matrix is the embedding for one token. To get an embedding, we simply look up the row for that token ID.

Token ID 15496 → look up row 15496 → [0.12, -0.34, 0.98, ...] (4096 numbers)
Embedding Lookup

From Tokens to Input Matrix

If our input has 5 tokens, we look up 5 rows, giving us:

Input Matrix X: shape [5, 4096]
- Row 0: embedding for token 0
- Row 1: embedding for token 1
- ...
- Row 4: embedding for token 4

This input matrix X is what flows through the entire Transformer. Every subsequent operation transforms this matrix - changing its values but keeping its shape (roughly).

Why Embeddings Work

These embedding vectors are learned during training. The model adjusts them so that:

Famous example: King - Man + Woman ≈ Queen

This vector arithmetic works because the model learned that the direction from “Man” to “Woman” is similar to the direction from “King” to “Queen”. Meaning becomes geometry.

4. The Transformer: Transforming the Input Matrix

The Transformer architecture was introduced in the landmark 2017 paper “Attention Is All You Need” by Vaswani et al. It replaced previous sequence models (RNNs, LSTMs) with a purely attention-based architecture that could be parallelized efficiently on GPUs. Every modern LLM - GPT, Llama, Claude, Gemini - is built on this foundation.

Here’s the big picture of what happens inside an LLM:

Input Text
    ↓
[Tokenization] → Token IDs: [1532, 38, 2898, 318, 1049]
    ↓
[Embedding Lookup] → Input Matrix X: shape [5, 4096]
    ↓
[Transformer Layer 1] → Transformed Matrix: shape [5, 4096]
    ↓
[Transformer Layer 2] → Transformed Matrix: shape [5, 4096]
    ↓
... (repeat 96 times for GPT-3)
    ↓
[Final Layer Norm] → Output Matrix: shape [5, 4096]
    ↓
[Output Projection] → Logits: shape [5, 50000]
    ↓
[Softmax on last row] → Probabilities for next token

Each Transformer layer refines the representations. Early layers might capture syntax; later layers capture meaning, context, and reasoning.

Important: These Transformer layers ARE the “layers” of the neural network. When we say “GPT-3 has 96 layers,” we mean 96 Transformer layers stacked on top of each other.

Transformer Architecture

Let’s break down what happens inside each Transformer layer.

4.1 Positional Encoding: Knowing Word Order

There’s a problem: nothing so far tells the model that “Dog bites man” and “Man bites dog” are different. The embedding for “Dog” is the same regardless of position.

Positional Encoding adds position information to each embedding. Before entering the Transformer, we modify our input matrix:

X = X + PositionalEncoding

The original Transformer used mathematical functions (sine and cosine waves at different frequencies). Modern LLMs often learn position embeddings directly or use techniques like Rotary Position Embeddings (RoPE).

Why this matters: Positional encoding is why models have context limits. The scheme must handle positions it saw during training.

4.2 Self-Attention: How Tokens Talk to Each Other

This is the heart of the Transformer. Self-attention lets each token “look at” all other tokens and gather relevant information.

The Intuition

Consider: “The cat sat on the mat because it was tired.”

What does “it” refer to? The cat. To understand “it”, the model needs to attend to “cat” - to pull information from that position into the current position.

Self-Attention Visualization

Self-attention computes, for every token, which other tokens it should pay attention to.

The Math: From Input to Output

Let’s trace exactly what happens. We start with our input matrix X with shape [seq_len, d_model] (e.g., [5, 4096]).

Step 1: Create Q, K, V matrices

We project X into three different representations using learned weight matrices:

Q = X × W_Q    (Query: what am I looking for?)
K = X × W_K    (Key: what do I contain?)  
V = X × W_V    (Value: what information to pass along?)

Where:

Think of it this way:

Q, K, V Computation

Step 2: Compute attention scores

How much should token i attend to token j? We measure similarity between token i’s Query and token j’s Key using dot product:

Attention Scores = Q × K^T

Where K^T means K transposed (rows become columns).

Result: a matrix of shape [seq_len, seq_len] where position (i, j) tells us how much token i should attend to token j.

Step 3: Scale the scores

The dot products can get very large, which causes problems for the next step. We scale down:

Scaled Scores = Attention Scores / √d_k

Dividing by √d_k (square root of the key dimension) keeps values in a reasonable range.

Step 4: Apply softmax

Convert scores to probabilities. For each row (each token), softmax makes the values sum to 1:

Attention Weights = softmax(Scaled Scores)

Now each row contains a probability distribution over all tokens - how much attention to pay to each.

Step 5: Weighted sum of values

Finally, use these attention weights to compute a weighted average of the Value vectors:

Attention Output = Attention Weights × V

Result: shape [seq_len, d_k] - a new representation where each token has gathered information from other tokens based on relevance.

The full formula:

Attention(Q, K, V) = softmax(Q × K^T / √d_k) × V

Masked Attention

For language models, we add one constraint: a token can’t see future tokens (that would be cheating - we’re trying to predict them!).

Masking sets attention scores for future positions to negative infinity before softmax. This makes those weights become 0 after softmax.

4.3 Multi-Head Attention: Multiple Perspectives

One attention mechanism can only focus on one type of relationship. But language is complex - we might need to track syntax, semantics, and coreference simultaneously.

Multi-Head Attention runs several attention operations in parallel, each with its own W_Q, W_K, W_V. Each “head” can learn to attend to different patterns:

The outputs from all heads are concatenated and projected:

MultiHead = Concat(head_1, head_2, ..., head_h) × W_O

Typical LLMs use 32-128 heads.

4.4 Feed-Forward Network: Processing Each Position

After attention, each position passes through a Feed-Forward Network (FFN) - a simple two-layer neural network applied independently to each token:

FFN(x) = ReLU(x × W1 + b1) × W2 + b2

Where:

The FFN is often described as where the model stores “knowledge”. The expanded middle layer creates space for encoding facts and patterns.

Why do we need FFN? Attention alone isn’t enough:

4.5 Residual Connections & Layer Normalization

After each sub-layer (attention or FFN), we add the input back to the output:

output = LayerNorm(x + Sublayer(x))

Two things are happening here:

Residual Connection (x + Sublayer(x)): We add the original input to the output. This helps gradients flow during training - without it, gradients vanish as they propagate backward through 96 layers.

LayerNorm is a function (not just addition) that normalizes the values:

LayerNorm(x) = γ × (x - mean) / √(variance + ε) + β

It computes the mean and variance of x, normalizes to mean=0 and variance=1, then applies learnable scale (γ) and shift (β). This keeps activations in a reasonable range - without it, values would explode or vanish through many layers.

Putting It Together: One Transformer Layer

One Transformer layer does:

1. attention_out = MultiHeadAttention(X)
2. X = LayerNorm(X + attention_out)     ← residual connection
3. ffn_out = FFN(X)
4. X = LayerNorm(X + ffn_out)           ← residual connection
5. Return X

The input and output have the same shape. Stack 96 of these layers (for GPT-3) and you have an LLM.

What Gets Learned During Training (The Parameters)

When we say a model has “70 billion parameters,” we mean all the learnable weights across the entire network. For each Transformer layer, the model learns:

ComponentWhat’s LearnedCount per Layer
AttentionW_Q, W_K, W_V for each head3 × num_heads matrices
AttentionW_O (output projection)1 matrix
FFNW1 (expand) and W2 (contract)2 large matrices
LayerNormγ (scale) and β (shift)2 vectors per norm

Plus the Embedding Matrix (shared across the model) and sometimes the Output Projection.

Multiply by 96 layers and you get billions of parameters. All of these are randomly initialized, then adjusted through training to minimize prediction error.

5. Getting the Output: From Matrix to Words

After all Transformer layers, we have an output matrix of shape [seq_len, d_model]. How do we get the next word?

The Output Projection

We only care about the last position (that’s where we predict the next token). We take that row and multiply by an output projection matrix:

logits = output_row × W_vocab

Where W_vocab has shape [d_model, vocab_size] (e.g., [4096, 50000]).

Result: a vector of 50,000 numbers (one per possible token) - these are called logits.

Softmax: Logits to Probabilities

Apply softmax to convert logits to probabilities:

P(token_i) = exp(logit_i) / Σ exp(logit_j)

Now we have a probability distribution over all possible next tokens.

Choosing the Next Token

Several strategies:

Greedy: Pick the highest probability token. Deterministic but often repetitive.

Temperature: Scale logits before softmax. Higher temperature = flatter distribution = more randomness.

Top-K: Only consider the K most likely tokens, then sample.

Top-P (Nucleus): Include tokens until cumulative probability exceeds P (e.g., 0.9).

In practice, APIs use combinations of these.

Autoregressive Generation

LLMs generate one token at a time:

  1. Process prompt → get probability for next token
  2. Sample a token
  3. Append to input
  4. Repeat

This is autoregressive - each token depends on all previous tokens.

The KV Cache

Here’s an optimization insight: when generating token 10, we need attention with tokens 1-9. But tokens 1-9 haven’t changed! Why recompute their Keys and Values?

The KV Cache stores K and V for all previous tokens. For each new token, we only compute Q, K, V for that token, then reuse cached values.

Trade-off:

This is why long-context models need massive VRAM - they’re caching KV for the entire context.

6. Training: From Random to Intelligent

LLM training happens in phases.

Phase 1: Pre-training

Goal: Learn language patterns and world knowledge.

Data: Trillions of tokens - web scrapes, books, Wikipedia, GitHub.

Objective: Next-token prediction. Given tokens 1 to n, predict token n+1.

Result: A “base model” good at completing text but not following instructions. Ask “What is 2+2?” and it might respond “What is 3+3?” - continuing the pattern rather than answering.

Scale: This is expensive. GPT-4 scale pre-training costs tens of millions of dollars.

Phase 2: Supervised Fine-Tuning (SFT)

Goal: Teach instruction-following.

Data: Curated (prompt, response) pairs.

Process: Continue training on this data.

Result: An “Instruct” model that answers questions.

Phase 3: Alignment (RLHF / DPO)

Goal: Align with human preferences - helpful, harmless, honest.

RLHF:

  1. Generate multiple responses
  2. Humans rank them
  3. Train a reward model on rankings
  4. Use RL to optimize the LLM for high reward

DPO: Simpler alternative - optimize directly on preference pairs without a reward model.

7. Scaling Laws

Performance scales predictably with compute, data, and parameters.

Chinchilla Scaling Laws (2022):

Emergent Abilities: Some capabilities only appear at scale - small models can’t do chain-of-thought reasoning. This makes capability prediction hard.

8. Practical Considerations

Quantization: Reduce precision (FP32 → FP16 → INT8 → INT4) to shrink models and speed up inference with minimal quality loss.

Distillation: Train a small “student” to mimic a large “teacher”. The student learns from soft probability distributions.

Context Length: Attention is O(n²). Techniques like sparse attention, sliding windows, and grouped-query attention (GQA) help manage this.

Summary

ConceptWhat It IsWhy It Matters
Neural NetworkLayers of matrix operationsThe fundamental compute unit
TokenizationText → integer IDsDetermines cost and limits
EmbeddingsIDs → Input MatrixMeaning as geometry
Self-AttentionQ, K, V → weighted combinationCore reasoning mechanism
FFNTwo-layer network per positionWhere knowledge is stored
Residual ConnectionsAdd input to outputMakes deep networks trainable
KV CacheStore K, V for past tokensMakes generation fast
Pre-training → SFT → RLHFThree-phase trainingRandom → smart → aligned

This covers the core mechanics. There’s more to explore - mixture of experts, efficient attention variants, multimodal extensions - but if you understand what’s here, you can read the papers and follow along.

References & Further Reading

Videos

  1. 3Blue1Brown - “But what is a neural network?” - YouTube - The best visual introduction to neural networks.

  2. 3Blue1Brown - “Attention in transformers, visually explained” - YouTube - Beautiful visualization of how attention works.

  3. Andrej Karpathy - “Let’s build GPT: from scratch, in code” - YouTube - 2-hour walkthrough building a GPT from scratch. Highly recommended.

  4. Andrej Karpathy - “Intro to Large Language Models” - YouTube - 1-hour overview of what LLMs are and how they work.

  5. StatQuest - “Transformer Neural Networks Clearly Explained” - YouTube - Clear, step-by-step explanation with visuals.

Articles & Blog Posts

  1. Jay Alammar - “The Illustrated Transformer” - Blog - The go-to visual guide for understanding Transformers.

  2. Lilian Weng - “The Transformer Family” - Blog - Comprehensive overview of Transformer variants.

  3. Chip Huyen - “Building LLM applications for production” - Blog - Practical engineering considerations.

  4. Sebastian Raschka - “Understanding Large Language Models” - Blog - In-depth technical breakdown.

Papers (for the curious)

  1. “Attention Is All You Need” (2017) - arXiv - The original Transformer paper.

  2. “Language Models are Few-Shot Learners” (GPT-3, 2020) - arXiv - Emergent abilities at scale.

  3. “Training Compute-Optimal Large Language Models” (Chinchilla, 2022) - arXiv - Scaling laws.