EN DA
AI
AI

Context Window

The working memory of a language model

tokensattentionmemory

Overview

The context window is the total amount of text—measured in tokens—that a language model can see and reason over at one time. Everything outside it is invisible to the model. It functions as the model's working memory: conversation history, documents, tool outputs, and the current question must all fit within this limit. Early GPT-3 had a 4 096-token window; modern models like Claude 3.7 support up to 200 000 tokens (~150 000 words).

Key Concepts

  • Tokenisation: text is split into subword pieces (~0.75 words per token on average for English)
  • Positional encoding: each token receives information about its position in the sequence
  • Self-attention over the full context: every token attends to every other token—this is what makes larger windows expensive (O(n²) memory and compute)
  • Context stuffing: documents, retrieved chunks (RAG), tool results, and system prompts are concatenated into the context
  • Lost-in-the-middle problem: models perform worse on information placed in the middle of very long contexts than at the start or end

Key Facts

  • One token ≈ 4 characters ≈ 0.75 words in English; code, numbers, and non-Latin scripts can be far less efficient
  • GPT-3 (2020): 4 096 tokens → GPT-4 Turbo (2024): 128 000 tokens → Claude 3.7 (2025): 200 000 tokens
  • Gemini 1.5 Pro demonstrated a 1-million-token context window, fitting entire codebases or feature films
  • Context caching (Anthropic, Google) stores processed context server-side for reuse, cutting costs on repeated prompts by up to 90%
  • The quadratic scaling of attention means doubling the context roughly quadruples the compute required