Lesson 6 · 12 min

Automated red teaming and building a security eval set

Red teaming an LLM application means systematically trying to break it before attackers do. Automated red teaming with an adversarial LLM generates attack variants at scale — building the security eval set your CI pipeline needs.

Why manual red teaming is not enough

Manual red teaming (engineers or security researchers trying to break the system) finds many real vulnerabilities. But it has a fundamental limitation: human attackers converge on known patterns. The model was likely trained on those same patterns — so the most obvious attacks often fail while creative variants succeed.

Automated red teaming uses an adversarial LLM (the "attacker") to generate novel attack prompts against the target system (the "defender"). The attacker is prompted to be creative, persistent, and to try variations that human testers wouldn't think of.

This produces two valuable outputs:

Live vulnerabilities to fix before deployment
A security eval set — a catalog of attack prompts and expected defenses that runs in CI