In 2026 there is no excuse for shipping an LLM feature without evals. There is, however, an epidemic of teams doing exactly that — pushing a prompt change to prod, eyeballing two examples, declaring victory. Then a customer notices the assistant got dumber, and nobody can answer when, why, or by how much.
This is the eval architecture we run on every LLM feature we ship. It's not novel. It's just disciplined.
The minimum bar: an eval set, a metric, a CI gate
If you have nothing, get these three things first.
1. The eval set (50–500 examples)
A spreadsheet — yes, a spreadsheet — of real inputs you've seen in production, paired with what a good answer looks like. This isn't synthetic. It's your data. Add to it every time a customer reports a bug or you find a weird output.
Sizes that matter:
- <50 examples: statistical noise drowns the signal. You will think a regression is a win.
- 50–200: good enough to catch most regressions on a single feature.
- 200–500: good enough to make decisions about model swaps.
- >500: diminishing returns unless your feature has wide input variance.
2. The metric (one per dimension that matters)
Pick one number per quality you care about. We typically run 3–4:
- Correctness (does the answer say the right thing) — usually scored by an LLM-as-judge with a rubric.
- Format compliance (does it return valid JSON, or follow the structure) — deterministic check.
- Refusal rate (does it refuse things it shouldn't, or comply with things it shouldn't) — deterministic.
- Latency p95 — measured directly.
Avoid composite scores ("AI quality: 87%"). They hide everything.
3. The CI gate
On every PR that touches a prompt, model name, system message, or retrieval logic, the eval runs. The PR cannot merge if any metric drops more than X% from the main branch. We use a 2% drop threshold on correctness and 0% on format compliance — non-negotiable.
The eval ladder: when to invest in what
Once the basics work, the natural maturity progression looks like this:
Level 1 — Manual eval set, LLM-judge correctness
What we just described. Catches obvious regressions.
Level 2 — Sliced eval set
Tag examples by user persona, language, edge case category. Run metrics per slice. The first time you do this you'll find an entire customer segment your prompt regressed for while overall numbers held. This is the highest-leverage upgrade.
Level 3 — Adversarial / red-team set
A separate set of inputs designed to break the model: prompt injections, edge cases, jailbreak attempts, ambiguous instructions. Run on every PR. Different threshold (0% regression).
Level 4 — Production sampling
Sample 1–5% of real production traffic, label it (human or LLM-judge), feed it back into the eval set. The eval set evolves with the product.
Level 5 — A/B in production
When the eval delta is small but you suspect real-world impact, ship behind a flag and measure user-side metrics — task completion, thumbs-up rate, follow-up question rate. This catches the things evals miss.
Most teams should be at level 2–3. Level 5 is for mature products with serious traffic.
The LLM-as-judge trap
Using an LLM to grade an LLM is fine — if you do three things:
- Use a stronger model than the one you're grading. Grading Claude Sonnet output with Haiku is a recipe for false confidence.
- Use a rubric, not a vibe. "Rate this 1–5 for helpfulness" is meaningless. Define what 1, 3, and 5 look like, with examples.
- Sanity-check against humans periodically. Once a month, have a human grade 30 examples. If judge–human agreement drops below ~85%, your rubric is broken.
The teams that skip step 3 end up with judges that drift, scoring everything 4/5, and the metric becomes meaningless slowly enough that nobody notices. — Internal eval review note, 2025
Per-section evals (and why one global score hides the truth)
We touched on this in the DocEngine pipeline: a single document quality score is useless. It's an average that hides the parts that need attention.
The same holds for evals. If your feature produces structured output, score each part independently. If your feature is conversational, score by turn type (clarifying question, answer, refusal, follow-up). One global score will always be too coarse to act on.
What we don't measure (and why)
Things we deliberately do not optimise for in the eval suite:
- "Helpfulness" or "user satisfaction" in the abstract — too fuzzy. We measure task completion on specific tasks instead.
- Token cost as part of correctness — measured separately. Conflating them creates perverse incentives.
- Subjective tone — unless tone is the feature. Otherwise it bikesheds reviewers.
The cultural shift
The hardest part of getting evals into a team isn't the tooling. It's the cultural shift from "the model seems better" to "the eval shows a 3.2% gain on slice A and a 0.8% regression on slice B."
Once a team has evals running, three things change:
- Prompt changes get reviewed like code.
- Model upgrades become a measurable decision, not a holy war.
- Customers stop being your regression test.
That last one is the whole point.
If you're building anything with an LLM in 2026 and you can't answer the question "how do you know your last change was an improvement?" in one sentence — that's the work to do this quarter. Everything else can wait.



