Transformers

The architecture that changed AI

architectureattentionllm

Overview

The Transformer architecture, introduced in "Attention is All You Need" (Vaswani et al., 2017), replaced recurrent networks with a fully attention-based model. By computing relationships between all tokens simultaneously (self-attention), Transformers scale massively with data and compute, enabling models like GPT, BERT, and Claude.

Key Concepts

Self-attention: computes pairwise token relationships in parallel
Positional encoding: injects sequence order into token embeddings
Multi-head attention: attends to multiple information subspaces simultaneously
Feed-forward layers: non-linear transformation of attended representations
Layer normalisation and residual connections for stable training

Key Facts

The original paper proposed the architecture for machine translation
BERT, GPT, T5, and Claude all derive from the Transformer
Transformer computing cost scales quadratically with sequence length (O(n²))
Vision Transformers (ViT) extended the architecture to image patches

Transformers

Overview

Key Concepts

Key Facts

Related

RAG

Prompt Engineering

Agentic Workflows

Neural Networks

Embeddings

Large Language Models