EN DA
AI
AI

Transformers

The architecture that changed AI

architectureattentionllm

Overview

The Transformer architecture, introduced in "Attention is All You Need" (Vaswani et al., 2017), replaced recurrent networks with a fully attention-based model. By computing relationships between all tokens simultaneously (self-attention), Transformers scale massively with data and compute, enabling models like GPT, BERT, and Claude.

Key Concepts

  • Self-attention: computes pairwise token relationships in parallel
  • Positional encoding: injects sequence order into token embeddings
  • Multi-head attention: attends to multiple information subspaces simultaneously
  • Feed-forward layers: non-linear transformation of attended representations
  • Layer normalisation and residual connections for stable training

Key Facts

  • The original paper proposed the architecture for machine translation
  • BERT, GPT, T5, and Claude all derive from the Transformer
  • Transformer computing cost scales quadratically with sequence length (O(n²))
  • Vision Transformers (ViT) extended the architecture to image patches