LLM evals in CI: stop shipping AI features on vibes

In 2026 there is no excuse for shipping an LLM feature without evals. There is, however, an epidemic of teams doing exactly that — pushing a prompt change to prod, eyeballing two examples, declaring victory. Then a customer notices the assistant got dumber, and nobody can answer when, why, or by how much.

This is the eval architecture we run on every LLM feature we ship. It's not novel. It's just disciplined.

The minimum bar: an eval set, a metric, a CI gate

If you have nothing, get these three things first.

1. The eval set (50–500 examples)

A spreadsheet — yes, a spreadsheet — of real inputs you've seen in production, paired with what a good answer looks like. This isn't synthetic. It's your data. Add to it every time a customer reports a bug or you find a weird output.

Sizes that matter:

<50 examples: statistical noise drowns the signal. You will think a regression is a win.
50–200: good enough to catch most regressions on a single feature.
200–500: good enough to make decisions about model swaps.
>500: diminishing returns unless your feature has wide input variance.

2. The metric (one per dimension that matters)

Pick one number per quality you care about. We typically run 3–4:

Correctness (does the answer say the right thing) — usually scored by an LLM-as-judge with a rubric.
Format compliance (does it return valid JSON, or follow the structure) — deterministic check.
Refusal rate (does it refuse things it shouldn't, or comply with things it shouldn't) — deterministic.
Latency p95 — measured directly.

Avoid composite scores ("AI quality: 87%"). They hide everything.

3. The CI gate

On every PR that touches a prompt, model name, system message, or retrieval logic, the eval runs. The PR cannot merge if any metric drops more than X% from the main branch. We use a 2% drop threshold on correctness and 0% on format compliance — non-negotiable.

Start here, not above here. That's the minimum bar. Most teams haven't reached it. Get there before chasing anything below.

The eval ladder: when to invest in what

Once the basics work, the natural maturity progression looks like this:

Level 1 — Manual eval set, LLM-judge correctness

What we just described. Catches obvious regressions.

Level 2 — Sliced eval set

Tag examples by user persona, language, edge case category. Run metrics per slice. The first time you do this you'll find an entire customer segment your prompt regressed for while overall numbers held. This is the highest-leverage upgrade.

Level 3 — Adversarial / red-team set

A separate set of inputs designed to break the model: prompt injections, edge cases, jailbreak attempts, ambiguous instructions. Run on every PR. Different threshold (0% regression).

Level 4 — Production sampling

Sample 1–5% of real production traffic, label it (human or LLM-judge), feed it back into the eval set. The eval set evolves with the product.

Level 5 — A/B in production

When the eval delta is small but you suspect real-world impact, ship behind a flag and measure user-side metrics — task completion, thumbs-up rate, follow-up question rate. This catches the things evals miss.

Most teams should be at level 2–3. Level 5 is for mature products with serious traffic.

The LLM-as-judge trap

Using an LLM to grade an LLM is fine — if you do three things:

Use a stronger model than the one you're grading. Grading Claude Sonnet output with Haiku is a recipe for false confidence.
Use a rubric, not a vibe. "Rate this 1–5 for helpfulness" is meaningless. Define what 1, 3, and 5 look like, with examples.
Sanity-check against humans periodically. Once a month, have a human grade 30 examples. If judge–human agreement drops below ~85%, your rubric is broken.

The teams that skip step 3 end up with judges that drift, scoring everything 4/5, and the metric becomes meaningless slowly enough that nobody notices. — Internal eval review note, 2025

Per-section evals (and why one global score hides the truth)

We touched on this in the DocEngine pipeline: a single document quality score is useless. It's an average that hides the parts that need attention.

The same holds for evals. If your feature produces structured output, score each part independently. If your feature is conversational, score by turn type (clarifying question, answer, refusal, follow-up). One global score will always be too coarse to act on.

What we don't measure (and why)

Things we deliberately do not optimise for in the eval suite:

"Helpfulness" or "user satisfaction" in the abstract — too fuzzy. We measure task completion on specific tasks instead.
Token cost as part of correctness — measured separately. Conflating them creates perverse incentives.
Subjective tone — unless tone is the feature. Otherwise it bikesheds reviewers.

The cultural shift

The hardest part of getting evals into a team isn't the tooling. It's the cultural shift from "the model seems better" to "the eval shows a 3.2% gain on slice A and a 0.8% regression on slice B."

Once a team has evals running, three things change:

Prompt changes get reviewed like code.
Model upgrades become a measurable decision, not a holy war.
Customers stop being your regression test.

That last one is the whole point.

If you're building anything with an LLM in 2026 and you can't answer the question "how do you know your last change was an improvement?" in one sentence — that's the work to do this quarter. Everything else can wait.

LLM evals in CI: stop shipping AI features on vibes.

The minimum bar: an eval set, a metric, a CI gate

1. The eval set (50–500 examples)

2. The metric (one per dimension that matters)

3. The CI gate

The eval ladder: when to invest in what

Level 1 — Manual eval set, LLM-judge correctness

Level 2 — Sliced eval set

Level 3 — Adversarial / red-team set

Level 4 — Production sampling

Level 5 — A/B in production

The LLM-as-judge trap

Per-section evals (and why one global score hides the truth)

What we don't measure (and why)

The cultural shift

More from the engineering team

Multi-tenant SaaS on Postgres: patterns that survive.

Real-time co-editing under 500ms.

Evals or it didn't happen.

Get the long-reads in your inbox.