Large Language Models

Predicting the next token — at civilisation scale

llmgptscaling

Overview

A Large Language Model (LLM) is a neural network—almost always a Transformer—trained on hundreds of billions of tokens of text to predict the next token in a sequence. This deceptively simple objective produces models capable of reasoning, coding, translation, summarisation, and conversation. Scale is the defining factor: more parameters + more data + more compute reliably yield better emergent capabilities.

Key Concepts

Pre-training: the model predicts the next token across a massive web-scale corpus, learning grammar, facts, and reasoning patterns
Tokenisation: text is split into subword units (BPE, SentencePiece); GPT-4 uses ~100 000-token vocabulary
Autoregressive generation: the model generates output one token at a time, sampling from a probability distribution
Temperature and top-p sampling: controls randomness—low temperature = deterministic, high = creative
Instruction tuning (SFT): fine-tuning on curated prompt–response pairs teaches the model to follow instructions

Key Facts

Scaling laws (Hoffmann et al., "Chinchilla", 2022) show optimal training requires ~20 tokens per parameter
GPT-4 has an estimated 1.8 trillion parameters across a mixture-of-experts architecture
Emergent abilities—capabilities that appear suddenly at scale—include multi-step arithmetic, chain-of-thought reasoning, and code generation
Llama 3 (Meta, 2024) demonstrated that open-weight models match closed proprietary models on many benchmarks
Energy cost: training GPT-3 consumed an estimated 1 287 MWh—roughly the annual electricity use of 120 US homes

Large Language Models

Overview

Key Concepts

Key Facts

Related

Transformers

Neural Networks

Embeddings

Fine-tuning

RLHF

Context Window

Hallucination