Lesson 8 · 17 min

Capstone: build a legal document analysis training set

End-to-end: generate 2,000 synthetic (clause, label) pairs for a contract clause classifier, apply three-pass quality filtering, augment underrepresented classes, and validate with a real held-out eval set.

The task

Build a training set for a contract clause classifier that categorizes clauses into: indemnification, limitation_of_liability, termination, payment, intellectual_property, governing_law, confidentiality, dispute_resolution.

You have 80 real labeled clauses (10 per class). You need at least 200 per class for reliable fine-tuning.

Pipeline:

Analyze the real distribution — find which classes need the most augmentation
Generate synthetic clauses per class using self-instruct style
Apply rule-based + model-based quality filtering
Augment adversarial examples (clauses that span two categories)
Reserve the 80 real examples for evaluation only