Reinforcement Learning from Human Feedback (RLHF) — AI Glossary

A language model fresh off pre-training is a next-token predictor. It can write fluently, but it doesn’t know how to follow instructions, refuses little, and has no particular preference for being helpful or truthful over just sounding plausible. RLHF is the process that changes this — aligning a raw model to human values and preferences by optimising it against signals of what humans actually want.

Every major instruction-following model — GPT-4, Claude, Gemini, LLaMA chat variants — uses RLHF or a close descendant of it. It is, more than any other technique, why modern AI assistants feel like assistants.

The Three-Stage Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

The base pre-trained model is fine-tuned on a curated dataset of prompt-response pairs written by human annotators. These examples demonstrate the desired behaviour: following instructions, being helpful, acknowledging uncertainty, refusing harmful requests. This stage produces an “SFT model” that begins to behave assistant-like, but whose quality ceiling is limited by how many high-quality demonstrations can be practically written.

Stage 2: Reward Model Training

Human annotators are shown the same prompt with multiple model responses (typically 4–9 completions) and asked to rank them by quality — helpfulness, accuracy, harmlessness, and tone. This is easier and faster than writing ideal responses from scratch, and generates preference data at scale.

These human preference rankings are used to train a separate reward model (RM) — a neural network that takes a prompt and a response and outputs a scalar score predicting how much a human would prefer that response. The reward model learns to approximate human judgement so it can be queried cheaply and at scale during the next stage.

Stage 3: RL Optimisation (PPO)

The SFT model is now treated as a policy to be optimised. Using the Proximal Policy Optimization (PPO) algorithm, the model generates responses, the reward model scores them, and the policy is updated to produce responses that score higher. A KL-divergence penalty keeps the RL-optimised model from straying too far from the SFT model (preventing reward hacking and degeneration).

This loop runs for many iterations across a diverse set of prompts, gradually steering the model toward responses humans consistently prefer.

Why RLHF Works

Pre-training maximises next-token prediction on internet text. Internet text contains helpful and unhelpful content in roughly equal measure, with no signal about which is which. RLHF injects a direct signal: human preference. The model learns that certain patterns of response (clear, accurate, appropriately cautious, structured) earn higher scores than others (verbose, evasive, false, harmful). Over thousands of optimisation steps, these preferences shape the model’s behaviour at the distributional level.

Crucially, the reward model enables scalable supervision: once trained, it can evaluate millions of model outputs without requiring a human to read each one, making RL feasible at the scale of large language models.

Reward Hacking and Its Mitigations

A known failure mode of RLHF is reward hacking: the model discovers patterns that fool the reward model without actually being high-quality. Common examples include:

Producing very long responses (reward models trained on length-quality correlations can overweight length)
Using excessive hedging or sycophantic phrasing
Generating confident-sounding but incorrect answers

Mitigations include:

KL penalty — penalise large deviations from the SFT model
Diverse annotators — reduce reward model bias by using annotators with varied backgrounds and perspectives
Iterative reward model retraining — periodically retrain the RM on new outputs to close discovered exploits

Key Variants and Successors

RLAIF (Reinforcement Learning from AI Feedback)

Instead of human annotators ranking responses, a capable AI model (typically a larger or more capable version of the model being trained) generates the preference rankings. Dramatically reduces annotation cost and enables much larger datasets. Anthropic’s Constitutional AI (CAI) is a prominent example — the model critiques its own outputs against a set of written principles, generating synthetic preference data that a reward model is trained on.

DPO (Direct Preference Optimisation)

DPO eliminates the reward model entirely. It directly optimises the policy model on preference pairs using a contrastive loss — preferred responses are reinforced, rejected responses are down-weighted — in a single supervised training pass. Simpler to implement and more stable than PPO, with competitive quality. As of 2025, DPO and its variants (IPO, KTO, ORPO) have largely displaced PPO in many open-source fine-tuning pipelines.

PPO vs. DPO in Practice

	PPO (classic RLHF)	DPO
Requires reward model	Yes	No
Training stability	Lower (more hyperparameters)	Higher
Implementation complexity	High	Low
Quality ceiling	Slightly higher	Slightly lower
Used by	OpenAI, DeepMind	Many open-source pipelines

RLHF’s Limitations

Annotator subjectivity: Human preferences are inconsistent, culturally biased, and vary by annotator. The reward model inherits these biases.
Misalignment between proxies and values: Human annotators tend to prefer confident, fluent, detailed responses — which correlates imperfectly with accuracy and honesty.
Sycophancy: RLHF models learn to tell users what they want to hear, because that tends to earn higher preference scores. This is an active research problem.
Scalable oversight: As models become more capable than the humans evaluating them, it becomes harder to provide reliable preference signal — humans can’t always tell which of two expert-level responses is correct.

Impact

RLHF is the technique that made large language models commercially viable as assistants. The 2022 InstructGPT paper (the precursor to ChatGPT) demonstrated that a 1.3B parameter model fine-tuned with RLHF was preferred by human evaluators over a 175B GPT-3 model without it. That result — that alignment technique matters more than raw scale — validated RLHF as the central investment in post-training and triggered the current generation of assistant-focused AI development.

2025–2026: The Post-Training Stack Has Diversified

Classic PPO-based RLHF is no longer the only or even dominant approach. The field has converged on a modular post-training stack where different techniques handle different alignment objectives:

RLVR (Reinforcement Learning with Verifiable Rewards) has become the primary method for training reasoning models. Rather than relying on human preference judgements, RLVR uses automatically verifiable reward signals — a math problem is correct or not, code either passes tests or fails. OpenAI’s o-series and DeepSeek-R1 both train heavily with RLVR. It sidesteps annotator subjectivity entirely for tasks where ground truth is computable.

GRPO (Group Relative Policy Optimisation) was introduced by DeepSeek and has gained rapid adoption. Instead of a learned reward model, GRPO generates a group of responses for each prompt and rewards them relative to each other — the model learns to prefer its own better responses. Simpler than PPO (no separate reward model, no critic network) while matching or exceeding PPO quality on reasoning tasks. DeepSeek-R1 and several open-source reasoning models use GRPO as their primary RL method.

DAPO (Decoupled Clip and Dynamic Sampling Policy Optimisation) extended GRPO with improved training stability — decoupling clip ranges for the actor and reference model, and dynamically filtering prompts where all samples score identically (providing no gradient signal). Released by ByteDance in early 2025, DAPO outperforms GRPO on AIME math benchmarks.

SimPO and KTO are preference optimisation variants that further simplify DPO — SimPO removes the reference model dependency; KTO uses a prospect theory-inspired loss that can train on unpaired feedback (individual responses rated good/bad rather than compared pairs), making data collection substantially cheaper.

The current stack in practice (2026):

SFT — instruction following on high-quality demonstrations
Preference optimisation — DPO, SimPO, or KTO for helpfulness and harmlessness alignment
RL with verifiable or AI-generated rewards — GRPO/DAPO/RLVR for reasoning capability, or RLAIF for general quality at scale

Anthropic’s Claude 4 has shifted to RLAIF at scale to avoid human annotation bottlenecks, using a capable model to generate preference rankings that would be impractical to collect from humans at the required volume.

Ready to build?

Leverage AI technologies to build your product stack

Superteams can help you build, deploy and launch AI application stacks using open source technologies — from architecture through to production.

Talk to Superteams