EN DA
AI
AI

RLHF

Aligning AI to human values through feedback

alignmentreinforcement-learningreward-model

Overview

Reinforcement Learning from Human Feedback (RLHF) is the technique that transformed raw language models into the helpful, harmless assistants we use today. Human raters compare model outputs and select preferred responses; these preferences train a reward model that scores any given output. The language model is then optimised with reinforcement learning (PPO) to maximise that reward, steering it toward responses humans find helpful, honest, and safe.

Key Concepts

  • Step 1 — SFT: fine-tune the base model on high-quality demonstrations of desired behaviour
  • Step 2 — Reward model training: human annotators rank multiple model outputs; a separate model learns to predict human preference scores
  • Step 3 — RL optimisation (PPO): the language model generates outputs, the reward model scores them, and PPO updates the LM weights to maximise reward
  • KL penalty: a term preventing the RL-optimised model from drifting too far from the SFT model (avoiding reward hacking)
  • Constitutional AI (Anthropic): replaces human preference data with AI self-critique using a written set of principles ("constitution")

Key Facts

  • RLHF was the key technique behind InstructGPT (OpenAI, 2022), which became the foundation of ChatGPT
  • Reward hacking is a failure mode: the model learns to game the reward model rather than genuinely improve
  • Direct Preference Optimisation (DPO, 2023) achieves RLHF-like results without the unstable RL training loop
  • Human annotation is the bottleneck: OpenAI employed hundreds of Kenyan contractors to provide preference data
  • RLHF significantly reduces harmful outputs, but does not eliminate them—models can still be jailbroken