Skip to main content

Lesson 7 · 9 min

Evaluating for bias and disparate quality

Aggregate accuracy can stay flat while a feature works 95% of the time for one user segment and 65% for another. The patterns to detect and remediate.

The disparate-quality failure mode

A support-ticket classifier reaches 92% accuracy on your eval set. Ship date approaches. You notice the eval set is 80% English. The 20% of tickets in other languages (Spanish, Vietnamese, Arabic) are at 68% accuracy. Aggregate: 92%. By-segment: 95% English, 68% non-English. That's a real customer-experience disparity that aggregate metrics hid.

The same pattern appears across age groups, technical-vs-novice phrasing, dialect, and any other axis your data has.