--- title: A Guide to Large Language Models (LLMs) --- # A Guide to Large Language Models (LLMs) _Note: knowledge of attention and transformer architecture by the reader is assumed_ ## Introduction to Large Language Models These notes discuss some important early transformer models, scaling of language models, and various post-training approaches. ## Transformer Models (legacy ones, but these are the ones covered in lecture) ### BERT BERT (Bidirectional Encoder Representation) was introduced in 2018 as a transformer-based model developed by Google. BERT is just the encoder half of the transformer architecture

**Note that the multi-head self-attention and feed-forward layers of the encoder is often repeated multiple times in practice and often with LayerNorm and residual connections. This step is often abstracted into what is referred to as a transformer block-- the original BERT paper used 12** --- Details on the "Embedding" Stage: The input words are tokenized and turned into 3 matrices: word embedding matrix, positional embedding matrix, segment/sentence embedding matrix (tokens in the same sentence get assigned the same segment embedding vector). ![image](https://hackmd.io/_uploads/Sk-Epe5ubl.png) These 3 matrices are then added together via element-wise addition, so $$ \text{Final Input Embedding} = \text{Word Embedding} + \text{Positional Embedding} + \text{Segment Embedding} $$ This Final Input Embeddings matrix is the output of the "Embedding" block in the above image. --- The key innovation of BERT that set it apart from the traditional transformer encoder was its training task. Encoder-decoder transformer models of 2017 trained with the objective of next word prediction, which while suited for that type of task, was suboptimal for a holistic understanding of context. BERT proposed a new objective: *masked language modeling (MLM)*, which invovles randomly masking out a subset of input words (typically ~15%) and asking the model to predict the original words based on the full, unmasked context. For example, the sentence _"I like [MASK] coffee"_ might appear in training-- the model is tasked with recovering the missing word ("black") by attending to the left ("I like") and right ("coffee") contexts.

Attending from both directions means that BERT benefits from fully bidirectional context, and better word representation. The loss function of MLM is simply cross-entropy applied only at the masked positions. In addition to MLM, the original BERT paper also introduced a next sentence prediction (NSP) objective in which the model is given two segments of text and must predict whether the second segment follows the first in the original text data. While its use has since been debated (and often removed in later models), it was initially introduced to help BERT model high level relationships between sentences. --- ### T5/BART Google Research released T5 (Text-to-Text Transfer Transformer), an encoder-decoder model in 2020.

Up until this point, transformers required task-specific architectures: if we wanted to use classification, we might add a softmax head over a [CLS] word embedding, or if we wanted to translate, we'd use a sequence-to-sequence decoder trained on bilingual pairs. T5 proposed a generalization: reframing everything (translation, classification, Q&A, summarization) as a *text input --> text output* problem. This required training the encoder and decoder together in a single forward/backward pass, and a carefully aligned training strategy. Instead of next-word prediction, T5 trained with a _span corruption_ objective, which was similar to MLM with some key differences. Instead of masking individual words, span corruption masks contiguous sequences of words and replaces them with "sentinel words," which the model is trained to generate in order. For instance, - Input: "The \ sat on the \" - Output: \ cat \ mat" This objective encourages the model not just to guess masked tokens but to reconstruct coherent, semantically meaningful spans. In a sense, it's applying the intuition behind BERT in a generative setting. --- Around the same time, Facebook AI released BART (Bidirectional and Auto-Regressive Transformers), which also adopted encoder-decoder transformer architecture with a distinct approach to training. BART employed a more diverse set of corruption techniques that included masking, sentence permutation, and word deletion and insertion. What made T5 and BART so portable and streamlined was how efficiently they could be fine-tuned. Each of their pre-training objectives created models that already possessed strong general language understanding and generation capabilities, requiring only small adjustments to excel at downstream tasks.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Raffel et. al 2019

--- ### GPT The Generative Pre-trained Transformer (GPT) was introduced by OpenAI in 2018 as a decoder-only transformer architecture-- no encoder, and no cross-attention. In many ways, GPTs depart from the "transforming" of text that the vanilla transformer was made for (translation, most notably) and double-down on text generation through _autoregressive next-word prediction_, which is simply the process making predictions and sequentially feeding them back through the model at every iteration, yielding a sequence that grows from left to right. For example, given "I" -> predict "like", then given "I like " -> predict "black", then given "I like black" -> predict "coffee", and so on.

**(Go to next image to see info about Masked Self-Attention)** Recall the definition of language modelling developed by Claude Shannon in his _Prediction and Entropy of Printed English_ (1951) paper. Given a sequence of words $ (x_1, x_2, \dots, x_n) $, we want to maximize $ P(x_1, x_2, \dots, x_n) $. We know by the chain rule of probabilities that $$ P(x_1, x_2, \dots, x_n) = P(x_1)P(x_2|x_1)P(x_3|x_1, x_2)\dots P(x_n|x_1, \dots, x_{n-1}) = \prod_{i=1}^{n}P(x_i | x_1, x_2, \dots, x_{i-1}).$$ When there are probabilities, there are probability distributions, and this distribution is exactly the one GPTs are attempting to approximate. Why did GPTs ditch encoders? Recall what encoders are used for in the first place: capturing contextual relationships between all words bidirectionally (previous and future words). GPTs don't want this, and they explicitly block access to future words as they generate in real-time using _masked (causal) self-attention_ in each transformer block. Think of the original task of the transformer of translating text from one language to another, where we first need to understand (encode) what the original language is saying and then produce an output (decode) in another language. With GPTs, we are just generating text and all we need is the context sequence we already have. At the heart of the GPT architecture is the aforementioned masked self-attention mechanism, which is just like the vanilla self-attention we have seen previously but without forward connections-- each word can only attend to itself and to words before it.

Consider the input "I like black coffee"; "black" only attends to "I", "like", and "black", not "coffee". To implement this no-looking-ahead mechanism, recall the attention formula $ \text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} \right) V $. If we add a causal mask $ M $, defined as $$ M = \begin{bmatrix} 0 & -\infty & \cdots & -\infty \\ 0 & 0 & \cdots & -\infty \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 0 \end{bmatrix}, $$ to $ \frac{QK^\top}{\sqrt{d_k}} $, we can effectively remove any bidirectionality and enforce a backwards-only context (remember that softmaxes flatten large negative numbers to 0). Besides the omission of the encoder and cross-attention, nothing is new about GPT architecture: word and positonal embeddings added together, passed through several decoder layers (masked self-attention and feed-forward networks), through a linear classifier that projects embeddings into wordspace, and finally through a softmax that outputs a categorical distribution over the vocabulary. As soon as the next word is sampled, it is concatenated to the original input and the process repeats. ## Scaling Up: GPT-2 and GPT-3 While the original 2018 GPT demonstrated that decoder-only transformers were feasible, it was **GPT-2** (2019) that truly captured the AI community's attention. **What made GPT-2 remarkable:** - 100× the size of GPT-1 - Scaling revealed **emergent capabilities** : the model began performing tasks it was never explicitly trained for, without any fine-tuning - Among these: **zero-shot** and **few-shot learning**, collectively known as **in-context learning** — the ability to adapt behavior from patterns in the prompt alone, whether given no examples or just a few > *Example of an emergent zero-shot behavior: > Prompting "Translate from English to French: 'The book is on the table.'" often yielded fluent French , although the model is not tuned with translation supervision on a translation dataset.* --- **Scaling Laws** While GPT-2's emergent behaviors were exciting, what convinced people to invest millions and billions of dollars were the scaling laws. Kaplan et al. (2020) showed empirical trends that **increasing parameters, training data, and compute improves performance smoothly and predictably**, and scale emerged as a primary driver of capability. With scaling laws, people can reasonably train smaller models and see how the performance scales before investing massive amounts of time and money into training big models. There are two key relationships: **1. Compute cost:** $$ C = C_0 N D $$ Compute grows linearly with model size (N) and dataset size (D) — doubling both requires ~4× the compute. **2. Power Law for Test Loss:** $$L = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + L_0$$ Loss decreases with scale following an inverse power law, with diminishing returns. Constants $ A $, $ B $, $ L_0 $ capture irreducible error; $ \alpha, \beta \approx 0.07 $ – $ 0.1 $ describe scaling efficiency.

Scaling Laws for Neural Language Models, Kaplan et. al 2020

OpenAI put everything they had behind these findings and cranked up scale wildly. GPT-3, released in 2020, was over one-hundred times larger than GPT-2 at 175 billion parameters and was trained on over 300 billion tokens of data (more on tokens later). Emergent behavior began to appear with greater regularity: GPT-3 could perform multi-step arithmetic, solve analogy problems, write code, and complete SAT-style reading comprehension questions with minimal prompting. The ability to interpret and act on a task description with minimal instruction (zero- or one-shot) gave the model a special kind of flexibility that allowed it to adapt to new problems with nothing more than the right prompt phrasing needed. The emergent behaviors of GPT-2 and GPT-3 suggested an exciting possibility: that **pre-training** on large, naturally occurring text corpora could allow a model to **implicitly learn** a wide range of language tasks, without relying on manually curated, task-specific datasets. GPT-3 had not mastered each task individually, but it had been so relentlessly trained to predict the next token correctly across a vast range of contexts that it implicitly had to learn the abstract patterns of logic and instruction-following needed to solve different types of problems. What made this scaling so powerful was that it rested on a remarkably simple and unified training objective. Rather than teaching a model each task separately, **next-token prediction** subsumes translation, summarization, question answering, and more under a single goal — and crucially, this objective scales: as model size and data grow, so too does the breadth and quality of what the model learns to do. Practically, it also allows the model to process an entire sequence in parallel with teacher-forcing while still preserving the left-to-right structure required for generation. Moreover, because the "correct" next word is just the actual next word, the model doesn't need human-provided labels! This is what makes GPTs _self-supervised_-- they learn entirely from raw text with the data as its own supervisor. This is a crucial driver of implementing LLMs at scale that cannot be understated. Human-supervised learning would require assigning "correct" next words for every possible sequence in the English language, a task that would be not just impractical but virtually impossible. Other LLMs like BERT and T5 are perfectly capable of self-supervision too by giving the data the driver's seat.

## Pre-Training + Post-Training: How to Build a Large Language Model The original GPT was trained in two stages: first, a large-scaled unsupervised language modeling phase, referred to as _pre-training_, and a second supervised _post-training_ phase on specific downstream tasks like question answering. ### Pre-Training In pre-training, LLMs are trained on a massive corpus of unstructured text data using the causal language modeling objective, which simply minimizes cross-entropy, the negative log-likelihood of the actual words across a dataset, where $ x_t $ is the next word: $$𝓛_{pre-train} = -∑_{t=1}^n \log P(x_t | x_1, x_2, ..., x_{t-1})$$ How is this computed in practice during training? $ n $ is a hyperparameter called the _context window_ or _context length_, and it specifies how many words the model can attend to in a forward pass. Every piece of text that the LLM sees is broken up into these $ n $-sized chunks, and $ n $ is what specifies the maximum size of input to the model. Recall, however, that we compute the joint distribution $ P(x_t | x_1, x_2, \dots, x_{t-1}) $ by explicitly expanding it via the chain rule of probabilities-- it's important to remember that we are indeed also computing $ P(x_2|x_1), P(x_3|x_1, x_2), $ and so on. While the context window is the maximum length of a sequence that the model sees, it indeed still calculates many other subsequences as it moves from left to right thanks to masked self-attention. Therefore, even though our context window might be 1,024 or 8,000 words, we know perfectly well how to deal with short inputs like "I like black coffee". So far, we've been referring to these $ x $ vectors as words, but that's actually not how it's done in practice. Instead of operating at the level of full words, LLMs process the $ x $ vectors in sequence as _tokens_, which are subword units often derived using Byte Pair Encoding (BPE). Words are split into smaller, reusable fragments (e.g. "un", "break", "able") that collectively cover language morphology. Although perhaps unintuitive to humans, _tokenization_ is hugely advantageous in providing extra granularity, reducing vocabulary size, and even handling rare or unseen words. Consider Oxford's 2022 Word of the Year: "goblinmode." For a model trained on, say, Wikipedia articles up to the year 2018, it's plausible to say that such a word might have never once appeared in the text data. Nevertheless, what tokenization essentially allows us to break down an input like "goblinmode" into "goblin" and "mode" tokens, which, when attended together might tell us about what "goblinmode" means.

The pre-training phase is the workhorse of building a large language model. By learning to predict the next word over billions of tokens, LLMs acquire a broad understanding of grammar, syntax, word meanings and relationships, facts, and long-range dependencies. The result of a converged pretrained model is the _base model_. While they might have the full knowledge of the dataset they were trained on baked into their weights, it's crucial to understand that *base models are not assistants nor chatbots.* If you download the weights of a base model like LLaMA-2-7B and provide a text input like, "What's 10 / 5?" you might be surprised to find that you won't get an answer like, "2" or "10 / 5 is 2." Instead, you're more likely to get something along the lines of, "We know the answer to this question is 2, this represents a basic mathematical truth. Mathematical truths are an example of objective knowledge that are necessarily true, and it is indeed impossible to think of the answer being anything other than 2. Kant argued that mathematical truths are synthetic a priori, and that these types of truths are..." and so on. Base models are nothing more than a compression of their datasets, and if you feed them input, they will just keep on sampling from learned probability distributions ad infinitum, certainly with no particular orientation towards helping you solve a problem. Consider the following scenario, taken directly from OpenAI's user-testing on GPT-3: ```python # Prompt: # What is the purpose of the list C in the code below? def binomial_coefficient(n, r): C = [0 for i in range(r + 1)] C[0] = 1 for i in range(1, n + 1): j = min(i, r) while j > 0: C[j] += C[j - 1] j -= 1 return C[r] ``` Of course, the answer that the prompter wants here is an explanation detailing that $ C $ is storing binomial coefficients as we iterate through $ n $ and $ r $. Nevertheless, GPT-3's output was rather underwhelming: ``` A. to store the value of C[0] B. to store the value of C[1] C. to store the value of C[i] D. to store the value of C[i - 1]. ``` What happened here is that GPT-3 says to itself, "Ah, I've seen this exact question on the 2014 AP Computer Science A exam; I know what comes next!" and it proceeds to spit out what indeed came next, but not what we want from it. Base models this big can memorize very well, and actions need to be taken to move them away from this behavior. You might have been surprised earlier to learn that GPT-3 was released in 2020 given that it was the model behind ChatGPT, which really caught on in late 2022 and early 2023. But the original GPT-3, powerful as it was, was still a base model, trained purely on predicting the next token in a sequence with no explicit goal beyond minimizing language modeling loss. While the model showed that it didn't need any fine-tuning, the gap between usability and capability brought the need for _post-training_, which helped modify behavior that made a powerful but clumsy base model into a genuinely useful tool. ### Post-Training Post-training is the set of techniques applied after pre-training to make a model useful, aligned, and safe for human interaction. Zero-shot and few-shot learning had largely done away with task-specific fine-tuning, but as we saw with the coding question scenario, there were still rough edges to attend to. The first fix was instruction tuning, popularized by **InstructGPT** — a 2022 variant of GPT-3 fine-tuned on just 12,000 instruction-response pairs. Simple as it sounds, this went a long way: the model stopped producing unhelpful completions like literal A-B-C-D outputs or responses that ignored the user's intent, and instead learned to interpret natural prompts and respond helpfully. But it was only a first step. Even trained on good examples, the model could still produce subtly wrong answers, or responses that were verbose, condescending, or simply not what the user wanted. The trouble is that language quality is inherently hard to quantify — there's no objective metric for whether an explanation is clear or patronizing. Yet humans can usually agree when something falls flat: "Now this might be a bit hard for you to understand, but..." is pretty obviously condescending. And so, humans were brought directly into the loop with **reinforcement learning from human feedback (RLHF).** Annotators ranked multiple completions for the same prompt, which trained a reward model that nudged the LLM toward outputs more consistent with human taste, helpfulness, and factuality — this is exactly what's happening when you're asked to choose between two of Claude's responses. We'll cover reinforcement learning and RLHF in more detail later in the course.

InstructGPT, Ouyang, et. al 2022

RLHF laid the groundwork, and post-training continued to improve from there. One surprising finding was the importance of **prompt engineering**: how you prompt the model makes a meaningful difference in reasoning quality. Simply adding intermediate steps — like saying "Let's think step by step" or providing an example question and step-by-step solution — caused models to dramatically improve on arithmetic, logic, and common-sense reasoning. Known as **chain-of-thought prompting**, this works because it lets the model unfold its reasoning across multiple tokens, breaking the problem into smaller subparts rather than jumping straight to a potentially flawed conclusion. Arithmetic remained clunky and error-prone until researchers made what in retrospect seems an obvious move: just let the model use a calculator. Instead of predicting what follows "77 + 33 =", those tokens are handed off to an API call. This strategy, dubbed **Toolformer**, was extended to search engines, translation APIs, and beyond — rather than trying to memorize everything, the model learns to delegate. Yet for all these advances, **hallucinations** remain a stubborn problem. LLMs sometimes produce text that is fluent and confident but factually wrong — invented citations, misquoted laws, incorrect dates. This isn't a bug that slipped through; it's a fundamental consequence of how these models work. A model trained to predict likely sequences has no built-in mechanism for verifying truth. Post-training can reduce hallucinations through better RLHF reward models, retrieval-augmented generation, and tool use, but cannot eliminate them entirely — and that remains one of the field's open challenges. The story doesn't end there. By 2025, the frontier had shifted from simply scaling pretraining to scaling reinforcement learning itself. Researchers began asking the same question they'd asked of pretraining compute: if we keep throwing more RL at a model, can we predict how much better it gets? A paper from Meta, ScaleRL(arXiv:2510.13786), proposed the first empirical answer — fitting curves over early RL training runs to predict final performance, much like the power-law scaling laws we saw for pretraining. The intuition is straightforward: RL training curves tend to spike quickly early on and then saturate slowly, and it turns out you can forecast that saturation from just the first quarter of training compute. ![image](https://hackmd.io/_uploads/HkD3OWq_-e.png) _The Art of Scaling Reinforcement Learning Compute for LLMs_ (Khatri et al., 2025) RL scaling laws serve a different purpose than pretraining ones. Pretraining scaling laws tell you how to configure your one big, expensive run from the start. RL scaling laws are more surgical — they help you squeeze out the last few percentage points of performance from an already-trained base model, and to decide which RL algorithm and setup is worth committing to at scale. In that sense, scaling RL is still more art than science, but the tools to make it a science are starting to materialize.