AI
Transformers
The architecture that changed AI
Overview
The Transformer architecture, introduced in "Attention is All You Need" (Vaswani et al., 2017), replaced recurrent networks with a fully attention-based model. By computing relationships between all tokens simultaneously (self-attention), Transformers scale massively with data and compute, enabling models like GPT, BERT, and Claude.
Key Concepts
- Self-attention: computes pairwise token relationships in parallel
- Positional encoding: injects sequence order into token embeddings
- Multi-head attention: attends to multiple information subspaces simultaneously
- Feed-forward layers: non-linear transformation of attended representations
- Layer normalisation and residual connections for stable training
Key Facts
- The original paper proposed the architecture for machine translation
- BERT, GPT, T5, and Claude all derive from the Transformer
- Transformer computing cost scales quadratically with sequence length (O(n²))
- Vision Transformers (ViT) extended the architecture to image patches