Why "Duolingo for AI" doesn't work — and what does

Duolingo's loop works because language has cheap right/wrong feedback. Generative AI doesn't. We had to build different validation primitives.

The trap

Every "Duolingo for X" startup hits the same wall.

Duolingo's success comes from one mechanic: rapid binary feedback. Translate "Hola" → "Hello". Right or wrong. The streak, the XP, the heart system all hinge on that primitive.

Generative AI breaks the primitive.

"Write a prompt that summarizes this paper in 80 words with a TL;DR section" doesn't have a binary answer. There are good prompts, bad prompts, and ten variations that all work. A binary grader either lets bad prompts through or rejects good ones.

So you can't just ship the Duolingo loop on AI content. The validation layer has to be different.

What we built instead

Three validation primitives, used per beat type:

1. Multiple-choice with adversarial distractors

Used for concept checks. The wrong answers are plausibly correct — the kind of thing engineers actually argue about. "Why is a negative constraint often more effective than a positive one?" has four real-sounding answers, only one of which captures the mechanism.

2. Rubric-graded prompt challenges

For prompt engineering exercises, we ship a rubric: must include certain keywords (role, constraints, schema), must not include common anti-patterns (apologies, weak instructions), word-count bounds, optional bonus regex for structure. The rubric scores from 0 to 100; passing is 70+.

This is not a grade. It's a signal. The bonus criteria reward sophistication; the must-not catches sloppy prompts. The score gives the learner a reason to iterate.

3. Code execution, both languages

JavaScript runs in a sandboxed Function; Python runs through Pyodide-in-Worker. Both capture stdout. Lessons specify expected substrings — when output matches, the learner gets a "Matches expected" badge.

This is the closest analogue to Duolingo's binary feedback we can ship without misleading the learner. A Python expression that prints the right answer is right. One that prints the wrong answer is wrong. Compile errors are catchable. Logical errors are visible.

What we abandoned

Things that sound educational but don't work:

AI-grades-AI on every prompt. Calling an LLM as judge on every challenge is expensive, slow, and the judge has its own biases. We use rubrics for the structured cases and reserve LLM-judging for capstone reviews.
Heart loss on wrong answers. Punishing failure on hard generative tasks creates anxiety, not retention. We use spaced repetition instead — wrong answers come back tomorrow, and right ones march out into the future.
Pure-quiz pacing. Every lesson would be 80% reading and 20% quiz. Boring. We mix it up: text, callouts, diagrams (custom SVG), code, prompt challenges, ordering exercises. Every beat is a different cognitive mode.

What it feels like

A lesson is ~8-12 minutes. Five to nine beats. Maybe two are pure reading. The rest are interactive. The runner reveals one beat at a time — you can't skip ahead, but you also can't lose.

Ship a streak. Earn XP. Move into the spaced-repetition queue. Eventually pass a CertQuests practice exam and have something to show for it.

That's the loop. Not Duolingo's loop. But it works for what we're teaching.