Agentic AI in the enterprise: what shipped, what didn't, in 2026

In Q4 2025, every enterprise on our books had an "agentic AI" line item in their 2026 plan. By Q1 2026, most of them quietly downgraded it to "AI assistant." This is what we learned watching, building, and occasionally rescuing those projects — and the patterns we'd actually ship into production today.

What "agentic" came to mean (and why it's slippery)

The word agentic stretched to cover at least four different things in 2025:

A chatbot that can call one or two tools (calendar, search, email).
A goal-driven loop — give it an objective, it plans, executes, observes, replans.
A multi-agent system — specialist agents collaborating via a coordinator.
A fully autonomous worker — does its job for a week without a human.

Type 1 ships every day. Type 4 doesn't exist anywhere production-grade in 2026, regardless of vendor demos. The interesting category, where most real enterprise value lives, is type 2 — but in a far more constrained shape than the frameworks advertise.

What's actually working in production

Pattern 1 — The narrow, deterministic agent

Single objective. 3–8 tools. Strict turn budget (usually 5–10). Hard fallback to human after that.

Examples shipping right now:

Sales pipeline triage agents that read inbound enquiries, enrich from CRM, classify, and draft a reply for human review.
Compliance pre-screen agents that check a contract against 20–30 policies and surface the failures with citations.
Internal helpdesk agents that resolve known-issue tickets end-to-end and escalate the rest.

These are unglamorous and they work. The trick is they're not really agents in the philosophical sense — they're a workflow with branching, where the LLM picks the branch.

Pattern 2 — Plan-execute-verify

The model produces a plan, the plan is reviewed by either a deterministic validator or a human, then a second LLM call executes one step at a time with verification between steps. This trades latency for reliability.

The trick most teams miss: the planner and the executor should be different prompts and ideally different models. A reasoning-strong model plans; a faster model executes. We routinely run Claude or GPT-class for planning and a Haiku-class model for execution, cutting cost by ~60% with no measurable accuracy loss.

We have not seen a multi-agent framework outperform a well-designed single-agent loop on any client problem in 2025. None. — Internal arch review, Q1 2026

What didn't ship (and probably won't, soon)

Multi-agent frameworks for general problems

The CrewAI / AutoGen / LangGraph "swarm of specialists" pattern looks great in demos. In production, the failure modes are brutal:

Agents argue. Two agents disagree on the next step. The coordinator picks one. The other agent is now operating on stale context.
Latency stacks. Each handoff is 1–3 LLM calls. A 5-step plan with 4 agents is 20+ calls. Users wait 30 seconds for an answer.
Debugging is impossible. When something goes wrong six hops in, "look at the logs" means reading 40 pages of model outputs across four agents. No one does it.

The error-compounding maths. A 95%-reliable step run 50 times has a 7.7% chance of completing without an error. Cumulative drift is real and it's not improved much by chain-of-thought, reflection, or any of the other dressing-up tricks.

Long-running autonomous "employees"

The vendor pitch — "Agent X works overnight, you wake up to a finished feature" — runs into a wall called error compounding. The teams shipping "long-running" agents in production are quietly running 5-step loops with human verification at the end. Which is fine. It's just not what was advertised.

Two patterns that quietly outperform

1. The structured-output loop

Skip the agent framework. Use a model that supports strict structured output (every major provider does in 2026). Define a JSON schema for "next action." Loop. Validate. Execute. Re-prompt.

while not done and steps < BUDGET:
    plan = llm.next_action(context, schema=ActionSchema)
    result = execute(plan.tool, plan.args)
    context = update(context, plan, result)
    steps += 1

That's the entire system. No graph framework. No agent class hierarchy. We've shipped this pattern into 5 production deployments in the last 12 months. It outperforms LangGraph on every metric we care about (latency, cost, debuggability) on small-to-medium agent problems.

2. The spreadsheet-of-prompts

For repeatable, high-volume use cases — review every contract, classify every ticket, summarise every meeting — you don't need an agent. You need a prompt that's tested against 200 real examples in a CI eval (see LLM evals in CI) and a queue worker. That's it.

We migrated one client off a "multi-agent" contract review system to a prompt-plus-queue setup. They run 4× the volume on 1/8th the cost.

What to put in your 2026 plan, honestly

If the question on the table is "do we invest in agentic AI in 2026," our advice on most engagements:

Pick one workflow that has measurable business value, predictable inputs, and a human verifier already in the loop.
Build the smallest possible structured-output loop — 3–5 tools, 5-step budget, hard fallback.
Wrap it in evals so you know when it regresses.
Ship that. Then look at what your users actually try to do with it. Then decide if you need a framework.

Most teams that started with a framework spent 6 months in plumbing. The teams that started with a single workflow shipped value in 6 weeks and earned the right to expand.

The agentic future is real. It just looks much more boring than the keynotes.

Agentic AI in the enterprise: what shipped, what didn't, in 2026.

What "agentic" came to mean (and why it's slippery)

What's actually working in production

Pattern 1 — The narrow, deterministic agent

Pattern 2 — Plan-execute-verify

What didn't ship (and probably won't, soon)

Multi-agent frameworks for general problems

Long-running autonomous "employees"

Two patterns that quietly outperform

1. The structured-output loop

2. The spreadsheet-of-prompts

What to put in your 2026 plan, honestly

More from the engineering team

Multi-tenant SaaS on Postgres: patterns that survive.

Real-time co-editing under 500ms.

Evals or it didn't happen.

Get the long-reads in your inbox.