Lesson 7 · 10 min

RLHF, DPO, and "alignment" — briefly

Why "instruct" models exist, and why you probably shouldn't do RLHF yourself.

Two phases of training a useful LLM

Phase 1 — Pretraining — predict next token on the entire internet. Produces a model that's brilliant at continuing text but bad at following instructions. Looks like autocomplete, not an assistant.

Phase 2 — Alignment — teach the model to follow instructions, be helpful, and refuse harmful requests. Two main techniques:

SFT (Supervised Fine-Tuning) — show good (prompt, response) pairs. What we covered in ft-04.
RLHF (Reinforcement Learning from Human Feedback) — train a reward model on human preference labels ("A is better than B"), then use RL (typically PPO) to optimize the LLM to score well on the reward model.

DPO (Direct Preference Optimization) sidesteps the reward model and RL — it optimizes the LLM directly on preference pairs. Simpler, more stable, increasingly the default for open-source alignment.