Lesson 6 · 11 min
Preference data and RLHF datasets
Fine-tuning with DPO or RLHF requires preference data — (prompt, chosen, rejected) triplets. Generating synthetic preference data at scale is different from generating instruction data.
What preference data looks like
Direct Preference Optimization (DPO) and RLHF both require datasets of the form:
{ prompt, chosen_response, rejected_response }The model learns to prefer chosen over rejected given prompt. The difficulty: generating meaningful preference pairs requires creating responses at different quality levels — one genuinely better than the other.
For DPO, the typical synthetic approach:
- Generate a good response with a capable model (e.g. Claude Sonnet) →
chosen - Generate a deliberately flawed response with a constrained prompt →
rejected - Verify the ordering is correct (the chosen response is actually better)