Lesson 6 · 11 min

Evaluating a fine-tuned model

Train loss going down means *something* is happening. Whether it's the right thing is a separate question.

Three layers of evaluation

1. Loss / perplexity (cheap, dumb)

Track train and validation loss. A growing gap means overfitting. Necessary but not sufficient — low loss doesn't mean good outputs.

2. Task-specific automated metrics

For classification: accuracy, F1. For extraction: exact-match on JSON keys. For format adherence: parse rate. For style: regex/LLM-as-judge against rubrics.

3. Side-by-side qualitative review

Generate from base model and fine-tune on the same prompts. Look at 50 outputs by hand. No automated metric replaces eyeballs for catching subtle regressions in tone, helpfulness, or refusal patterns.