AI
Large Language Models
Predicting the next token — at civilisation scale
Overview
A Large Language Model (LLM) is a neural network—almost always a Transformer—trained on hundreds of billions of tokens of text to predict the next token in a sequence. This deceptively simple objective produces models capable of reasoning, coding, translation, summarisation, and conversation. Scale is the defining factor: more parameters + more data + more compute reliably yield better emergent capabilities.
Key Concepts
- Pre-training: the model predicts the next token across a massive web-scale corpus, learning grammar, facts, and reasoning patterns
- Tokenisation: text is split into subword units (BPE, SentencePiece); GPT-4 uses ~100 000-token vocabulary
- Autoregressive generation: the model generates output one token at a time, sampling from a probability distribution
- Temperature and top-p sampling: controls randomness—low temperature = deterministic, high = creative
- Instruction tuning (SFT): fine-tuning on curated prompt–response pairs teaches the model to follow instructions
Key Facts
- Scaling laws (Hoffmann et al., "Chinchilla", 2022) show optimal training requires ~20 tokens per parameter
- GPT-4 has an estimated 1.8 trillion parameters across a mixture-of-experts architecture
- Emergent abilities—capabilities that appear suddenly at scale—include multi-step arithmetic, chain-of-thought reasoning, and code generation
- Llama 3 (Meta, 2024) demonstrated that open-weight models match closed proprietary models on many benchmarks
- Energy cost: training GPT-3 consumed an estimated 1 287 MWh—roughly the annual electricity use of 120 US homes