Skip to main content

Lesson 3 · 10 min

Quality filtering: removing bad examples before they corrupt training

A training set is only as good as its worst examples. Three filtering passes — rule-based, model-based, and human spot-check — catch different categories of noise before they corrupt the fine-tuning run.

Why filtering matters more for synthetic data

With real labeled data, a noisy label is a human error — random and hard to predict. With synthetic data, noisy labels can be systematic — the LLM makes the same mistake repeatedly across similar examples. One systematic error pattern in 1,000 examples teaches the model the wrong behavior 1,000 times.

Three categories of synthetic data noise:

Format failures — the output doesn't match the expected structure (wrong JSON schema, markdown instead of plain text, extra preamble)

Label errors — the LLM labeled the example incorrectly (a negative sentiment example labeled positive, a wrong category)

Vacuous responses — technically valid but adds no training signal ("Great question! Here is my response:..." with no actual content)