Context Window

The working memory of a language model

tokensattentionmemory

Overview

The context window is the total amount of text—measured in tokens—that a language model can see and reason over at one time. Everything outside it is invisible to the model. It functions as the model's working memory: conversation history, documents, tool outputs, and the current question must all fit within this limit. Early GPT-3 had a 4 096-token window; modern models like Claude 3.7 support up to 200 000 tokens (~150 000 words).

Key Concepts

Tokenisation: text is split into subword pieces (~0.75 words per token on average for English)
Positional encoding: each token receives information about its position in the sequence
Self-attention over the full context: every token attends to every other token—this is what makes larger windows expensive (O(n²) memory and compute)
Context stuffing: documents, retrieved chunks (RAG), tool results, and system prompts are concatenated into the context
Lost-in-the-middle problem: models perform worse on information placed in the middle of very long contexts than at the start or end

Key Facts

One token ≈ 4 characters ≈ 0.75 words in English; code, numbers, and non-Latin scripts can be far less efficient
GPT-3 (2020): 4 096 tokens → GPT-4 Turbo (2024): 128 000 tokens → Claude 3.7 (2025): 200 000 tokens
Gemini 1.5 Pro demonstrated a 1-million-token context window, fitting entire codebases or feature films
Context caching (Anthropic, Google) stores processed context server-side for reuse, cutting costs on repeated prompts by up to 90%
The quadratic scaling of attention means doubling the context roughly quadruples the compute required

Context Window

Overview

Key Concepts

Key Facts

Related

Large Language Models

Transformers

RAG

Prompt Engineering