Eval-Driven Development

AI writes the code. Deterministic code verifies the AI. No hallucinations, no hand-waving — just assertions that pass or fail. The development loop that makes everything else sustainable.

Eval-Driven Development — AI

Here's a pattern that keeps showing up. You ask an AI agent to build something complex — an algorithm, a data pipeline, a parser, a workflow. The agent delivers code that looks right. It runs without errors. The output seems reasonable. You glance at it, nod, and move on.

Two weeks later, you discover the output was subtly wrong the entire time.

The instinct at this point is to use another AI to check the first one. And that's not wrong — running multiple agents to review each other's work catches real mistakes. I do it myself. But it's not sufficient. AI models share correlated failure modes. The same architectural blind spots that caused the first model to miss something can cause the second model to miss it too. Multi-agent review is a useful layer. It's not a ground truth.

The ground truth is deterministic code. And here's the key insight: you can absolutely use AI to write that code. The eval itself can be AI-generated. What matters is that once written, the eval is deterministic — it runs the same way every time, produces the same result, and cannot hallucinate. The authorship is irrelevant. The nature of the artifact is everything.

The Concept

Eval-driven development is a discipline: for every non-trivial piece of AI-generated output, you write a corresponding piece of traditional, deterministic code that programmatically verifies whether the output is correct.

Not "looks correct." Not "probably correct." Provably correct, by the standards of code that runs the same way every time, has no opinion, and cannot hallucinate.

This takes many forms:

  • Assertions — the output must satisfy specific mathematical properties, boundary conditions, or invariants
  • Schema validators — the structure must match an exact specification, every field present, every type correct
  • Diff checks — the output must produce identical results to a known-good reference implementation on the same inputs
  • Property-based tests — the output must hold true across thousands of randomly generated inputs, not just the three examples you thought of
  • Integration tests — the output must actually work when plugged into the real system, not just in isolation
  • Snapshot tests — the output must match a previously approved baseline, and any deviation gets flagged for human review

None of this is new. This is testing. This is what engineers have always done. The difference is that now, the thing being tested isn't code you wrote — it's code an AI wrote. And that changes the stakes, because your intuition about where bugs hide doesn't apply when you didn't write the code.

Why This Works

When an AI agent generates output, it's probabilistic. It produces the most likely next token, shaped by training data and context. Most of the time, this is remarkably good. But "most of the time" is not "all of the time," and the failures are not random — they're confidently wrong. The agent doesn't flag uncertainty. It doesn't say "I'm not sure about this edge case." It just produces output with the same fluency whether it's right or wrong.

Deterministic code has none of these problems.

  • An assertion either passes or fails. There's no "probably passes."
  • A schema validator doesn't care how confident the AI was. The field is there or it isn't.
  • A diff check doesn't negotiate. The output matches or it doesn't.
  • A property-based test doesn't get tired after 50 cases. It runs 10,000 and reports every failure.

Code doesn't hallucinate. That's the entire value proposition. In a world where your primary author is probabilistic, your primary verifier must be deterministic.

The New Development Loop

Traditional development: write code → write tests → run tests → fix code.

Eval-driven development flips the order: write the eval first. Define what correct looks like in code before the agent writes a single line. Then let the agent generate. Then run the eval. If it fails, feed the failure back to the agent and iterate.

The loop looks like this:

  1. Define the contract — what must be true about the output? What are the invariants, the edge cases, the boundary conditions?
  2. Write the eval — traditional code that checks every condition. No AI in this step. Pure, boring, deterministic logic.
  3. Let the agent generate — give it the problem, the constraints, and ideally the eval itself so it knows what it's being measured against.
  4. Run the eval — pass or fail. No ambiguity.
  5. Iterate on failure — feed the eval output back to the agent. "Your output failed this assertion because X. Fix it." This is where agents excel — targeted correction with specific feedback.

If this sounds like Test-Driven Development, that's because it is. Kent Beck pioneered TDD decades ago, and the core insight — define correctness before writing implementation — is more relevant now than when he first proposed it. The implementation author changed from human to AI. The need for upfront correctness criteria didn't.

When AI-as-Judge Is Fine (and When It's Not)

Let's be clear: using multiple agents to cross-check each other is a legitimate and useful practice. If Agent A writes code and Agent B reviews it, Agent B will catch some real bugs. I use this pattern regularly — it works. But it works the way a second opinion works in medicine: valuable, but not a lab result.

For genuinely subjective output — prose, marketing copy, design suggestions — AI-as-judge is often the best you can get. There's no assertion that captures "is this good?" But for anything with a verifiable answer, deterministic evals are the standard. The multi-agent review can sit on top. It just can't be the only layer.

But for anything with a verifiable answer, deterministic evals win:

  • Code generation — does it compile? Do the tests pass? Does it handle the edge cases?
  • Data transformations — does the output schema match? Are the row counts right? Do the aggregations sum correctly?
  • API integrations — does the request match the spec? Does the response parse correctly?
  • Algorithms — does it produce the correct output for known inputs? Does it satisfy the time complexity requirement?
  • Configuration — is the YAML/JSON valid? Do all references resolve? Are there no circular dependencies?

The rule is simple: if you can write code that checks it, write code that checks it. Only fall back to AI-as-judge when the output is genuinely subjective. Most engineering output isn't.

Closing the Verification Gap

I wrote previously about the verification gap — the growing distance between what AI agents can produce and what humans can verify. Agents generate code faster than anyone can review it. The output volume exceeds human attention bandwidth. The gap widens with every model improvement.

Evals are how you close that gap without slowing down. You don't review every line the agent wrote — you write the criteria for correctness once, and the eval runs in milliseconds every time. The agent can generate a thousand iterations and the eval catches every failure without you reading a single line of generated code.

This is the leverage that makes AI-assisted development sustainable. Without evals, you're either trusting blindly (dangerous) or reviewing everything manually (slow, defeats the purpose). With evals, you're trusting verified output — which is the only kind of trust that scales.

Making It Practical

If you're working with AI agents daily, here's how to start:

  • Start with the highest-risk output. What AI-generated code, if wrong, would cause the most damage? Write evals for that first.
  • Make evals part of your prompt. Tell the agent: "Here's the test suite your code must pass." Agents perform dramatically better when they know the success criteria upfront.
  • Use the agent to write the eval, then verify the eval yourself. The eval is typically simpler than the implementation. It's much easier to verify a 20-line test than a 200-line algorithm. Put your human attention where it has the most leverage.
  • Run evals in CI, not just locally. If the eval isn't automated, it won't survive the first deadline.
  • Treat eval failures as the agent's problem, not yours. Feed failures back to the agent with the specific assertion that broke. Let it iterate. This is what agents are good at — targeted fixes with clear feedback.

The Boring Part Is the Important Part

There's nothing glamorous about writing assertions. Nobody posts on LinkedIn about their property-based test suite. Evals are the least exciting part of working with AI.

They're also the part that keeps everything from quietly falling apart.

AI agents are getting better every month. They write more code, faster, across more languages and domains. The output volume will only increase. If you don't have deterministic checks on that output, you're building on sand — and the building is getting taller every day.

Let AI write the code. Write the code that checks the code. That's eval-driven development. That's the discipline that makes everything else sustainable.


The Verification Gap

Spec-Driven Agentic Development

KISS Your AI Workflow