Skip to main content

Lesson 4 · 10 min

Augmenting rare classes and edge cases

Real data is skewed — common cases dominate and rare failures are underrepresented. Targeted synthetic augmentation fills the gaps that matter most for production reliability.

The long tail problem

In any real-world classification or extraction task, the data distribution follows a power law:

  • 80% of examples cover 20% of categories
  • The rare cases (the remaining 20% of examples spread across 80% of categories) are exactly the ones that fail in production

A model trained on this distribution learns the common cases well and fails on rare but important ones. Synthetic augmentation lets you deliberately balance the distribution.

Common augmentation targets:

  • Rare categories — intent classes with < 50 real examples
  • Adversarial inputs — inputs designed to trigger failure modes
  • Edge cases — boundary conditions, ambiguous cases, multi-label examples
  • Domain shifts — slightly different register, terminology, or format than the training distribution