Lesson 6 · 11 min

Preference data and RLHF datasets

Fine-tuning with DPO or RLHF requires preference data — (prompt, chosen, rejected) triplets. Generating synthetic preference data at scale is different from generating instruction data.

What preference data looks like

Direct Preference Optimization (DPO) and RLHF both require datasets of the form:

{ prompt, chosen_response, rejected_response }

The model learns to prefer chosen over rejected given prompt. The difficulty: generating meaningful preference pairs requires creating responses at different quality levels — one genuinely better than the other.

For DPO, the typical synthetic approach:

Generate a good response with a capable model (e.g. Claude Sonnet) → chosen
Generate a deliberately flawed response with a constrained prompt → rejected
Verify the ordering is correct (the chosen response is actually better)