RLHF

Aligning AI to human values through feedback

alignmentreinforcement-learningreward-model

Overview

Reinforcement Learning from Human Feedback (RLHF) is the technique that transformed raw language models into the helpful, harmless assistants we use today. Human raters compare model outputs and select preferred responses; these preferences train a reward model that scores any given output. The language model is then optimised with reinforcement learning (PPO) to maximise that reward, steering it toward responses humans find helpful, honest, and safe.

Key Concepts

Step 1 — SFT: fine-tune the base model on high-quality demonstrations of desired behaviour
Step 2 — Reward model training: human annotators rank multiple model outputs; a separate model learns to predict human preference scores
Step 3 — RL optimisation (PPO): the language model generates outputs, the reward model scores them, and PPO updates the LM weights to maximise reward
KL penalty: a term preventing the RL-optimised model from drifting too far from the SFT model (avoiding reward hacking)
Constitutional AI (Anthropic): replaces human preference data with AI self-critique using a written set of principles ("constitution")

Key Facts

RLHF was the key technique behind InstructGPT (OpenAI, 2022), which became the foundation of ChatGPT
Reward hacking is a failure mode: the model learns to game the reward model rather than genuinely improve
Direct Preference Optimisation (DPO, 2023) achieves RLHF-like results without the unstable RL training loop
Human annotation is the bottleneck: OpenAI employed hundreds of Kenyan contractors to provide preference data
RLHF significantly reduces harmful outputs, but does not eliminate them—models can still be jailbroken

RLHF

Overview

Key Concepts

Key Facts

Related

Large Language Models

Fine-tuning

Prompt Engineering