Lesson 3 · 12 min
Building a training dataset
Bad data destroys good models. Good data is half the work — and most of where you should spend time.
The 90/10 rule
90% of fine-tune outcomes come from the dataset, not the hyperparameters. Beginners spend their time tuning learning rate; experts spend their time curating data.
A fine-tune dataset for supervised fine-tuning (SFT) is a list of (input, expected_output) pairs. Common formats:
{"messages": [
{"role": "system", "content": "You are a tax assistant."},
{"role": "user", "content": "Can I deduct my home office?"},
{"role": "assistant", "content": "Yes, if it is exclusively used for business..."}
]}