Lesson 8 · 17 min
Capstone: build a legal document analysis training set
End-to-end: generate 2,000 synthetic (clause, label) pairs for a contract clause classifier, apply three-pass quality filtering, augment underrepresented classes, and validate with a real held-out eval set.
The task
Build a training set for a contract clause classifier that categorizes clauses into: indemnification, limitation_of_liability, termination, payment, intellectual_property, governing_law, confidentiality, dispute_resolution.
You have 80 real labeled clauses (10 per class). You need at least 200 per class for reliable fine-tuning.
Pipeline:
- Analyze the real distribution — find which classes need the most augmentation
- Generate synthetic clauses per class using self-instruct style
- Apply rule-based + model-based quality filtering
- Augment adversarial examples (clauses that span two categories)
- Reserve the 80 real examples for evaluation only