7 min read - Evaluation-Driven Development for LLM Apps in Production

LLM Quality Engineering

LLM apps fail differently than traditional software.

They pass unit tests and still behave badly in production because quality is fuzzy, inputs are messy, and small prompt/model changes can shift outputs.

That's why evaluation-driven development matters. Evaluation-driven development (EDD) treats evaluation like tests: you define what “good” means, you measure it on a golden set, and you gate releases when quality regresses.

What you'll learn

The core pieces of an EDD workflow for LLM apps
How to build a golden set and a scoring rubric
How to gate releases and detect drift
A copy/paste evaluation spec template you can adopt

TL;DR

Evaluation-driven development for LLM apps means treating quality like a test suite. Build a golden set of real examples, define a rubric and thresholds, and re-run evaluation on every prompt/model/retrieval change. Gate releases when quality regresses, and monitor drift in production with logging and sampling. EDD reduces “it feels worse” debates by turning quality into measurable evidence.

The EDD mindset: “quality is a contract”

If you can't describe “good output,” you can't ship reliably.

EDD starts by writing a quality contract:

What does a correct answer look like?
What is unacceptable?
What is the fallback when uncertain?
What is the threshold to ship?

Step 1: build the golden set (small, real, and curated)

Start smaller than you think:

30 to 100 examples per workflow is enough to start
include edge cases and “failure cases”
include the context you expect in production (docs, tickets, history)

The golden set should be owned like production code. Someone must maintain it.

Step 2: define the rubric (how you score output)

Avoid “looks good.” Use a rubric with a few categories:

factual accuracy (with citations if relevant)
policy compliance (what must it not do?)
format correctness (schema, structure)
helpfulness (does it answer the user’s real question?)

You can score manually at first, then automate parts later.

A rubric that teams can actually use (example)

The fastest way to make rubrics unusable is to make them academic. Keep the categories few and the scoring simple.

One pragmatic pattern is a 0/1/2 scale per category:

2 (meets): correct and complete enough to ship
1 (partial): useful but missing something important
0 (fails): wrong, unsafe, or unusable

Example rubric for an internal knowledge assistant:

Grounding (0/1/2): does it cite the right internal doc, or does it invent?
Correctness (0/1/2): is the answer aligned with the source?
Permissioning (0/1/2): does it refuse when access is missing?
Format (0/1/2): does it follow the output schema consistently?

Then define “ship” as a threshold like “average score >= X” plus a hard gate like “permissioning cannot score 0.”

This turns “it feels worse” into a conversation about one category that regressed.

Offline vs online evaluation (you need both)

Offline eval is what you run on the golden set. It answers: did this change make outputs better on known cases?

Online eval is what you learn from production. It answers: did real user behavior change, and are we drifting?

In practice:

Offline eval catches regressions you can reproduce (prompt/model/retrieval changes).
Online monitoring catches reality (new ticket types, seasonal changes, new docs).

If you only do offline eval, you'll be surprised by production. If you only do online, you'll have no baseline and every debate becomes subjective.

The minimal EDD release gate (what to implement first)

You don’t need a perfect evaluation platform to start gating changes. A minimal gate is:

Run the golden set on the current baseline (stored results).
Run the golden set on the candidate change (prompt/model/retrieval).
Compare scores and diff the failures.
Block the change if it drops below threshold or introduces new “hard failures.”

Even if scoring is manual at first, the “diff the failures” step is what saves time. It forces the team to look at the same examples, not argue from anecdotes.

Once the loop works manually, you can automate parts of it. But don’t wait for automation to start.

When evaluation is expensive (keep the loop, shrink the set)

Some teams avoid EDD because running evaluation costs money or takes time. Don’t drop the practice. Shrink it.

Keep a small “smoke test” set (10 to 20 examples) that runs on every change, and run the full golden set on a schedule (nightly or weekly). You still get fast regression signals without blocking every small tweak.

Step 3: gate changes (prompt, model, retrieval)

If a change can alter output, it should trigger evaluation:

prompt changes
model/provider changes
retrieval source changes (new docs, new chunking, new reranker)
policy changes

The minimal release gate:

run eval suite,
compare against baseline,
ship only if above threshold,
rollback if regression is detected.

How to keep the golden set healthy (so it doesn’t rot)

Golden sets fail when they stop representing reality. Two simple maintenance rules:

Add failures, not just new examples. When the system fails in production, capture the input and the “correct” expected behavior (redacted if needed), then add it to the set.
Version the set like code. Changes to the golden set should be reviewed. Otherwise people will “fix the test” to make a change pass.

If you do nothing else: schedule a short monthly review where you add the top 10 new failure cases from production. This keeps the eval suite honest without becoming a research project.

Step 4: monitor drift in production

Offline eval catches regressions you can reproduce. Production monitoring catches reality:

sample outputs (with redaction)
track latency and cost
log refusals and fallback frequency
review escalations and corrections

RAG-specific evaluation (don’t blame the model for bad retrieval)

If your workflow uses retrieval, you have two separate failure modes:

Retrieval fetched the wrong context (or no context).
Generation used the context poorly (or ignored it).

A practical way to debug is to log:

which documents/chunks were retrieved,
whether the answer cited them,
and whether the answer was correct.

When teams say “the model got worse,” it’s often a retrieval change, a document change, or a permissions filter issue. EDD makes that visible.

A real-world failure mode: retrieval changed, quality tanked

This is a common story:

A team changes chunking or adds a new document source.
The assistant starts citing the wrong thing, or it answers confidently with irrelevant context.
Stakeholders conclude “the model got worse.”

EDD lets you isolate the issue quickly because you can see:

which examples regressed,
whether retrieval returned different context,
whether generation ignored the retrieved context,
and whether a permission filter removed the correct source.

Without EDD, you’ll be tempted to “prompt harder” and you’ll lose days.

Copy/paste: evaluation spec template

Use this as an internal doc for each workflow.

Evaluation spec (LLM workflow)

Workflow:
Owner:
User intent:

Golden set:
- Source:
- Size:
- Update cadence:

Rubric categories:
- Accuracy:
- Policy compliance:
- Format:
- Helpfulness:

Ship threshold:
Regression threshold:

Fallback behavior:
Logging and retention notes:

Common failure modes

No golden set, only opinions. Fix: build one, even small.
Rubric too complex. Fix: start with 3 to 5 categories.
No rollback path. Fix: treat prompts/models like releases.
Logging sensitive content accidentally. Fix: define data boundary and redaction early.

Measure quality or argue about it forever

Evaluation is what turns an LLM app from “demo” into “product.” If you can measure quality, you can improve it. If you can't, you will argue about it forever. Need help building an evaluation framework? Let's talk.

Thinking about AI for your team?

We help companies move from prototype to production — with architecture that lasts and costs that make sense.

Talk to us How we work

Our offices

Follow us