Ask any delivery lead what swallows the most calendar time at the start of a project. It isn't the build. It isn't even the architecture. It's the paper trail — turning a fuzzy business intent into a Business Requirements Document, deriving a Software Requirements Document from it, then a Functional Requirements Specification, then epics, stories, and tasks. Three to five weeks, sometimes more, of senior people writing — and rewriting — the same ideas in different formats.
DocEngine collapses that into a guided AI pipeline. You give it a structured BRD; it derives the rest, with per-section quality grades (A through F) so you know exactly where the AI is confident and where you need to step in. Below is a walkthrough of the pipeline, the architecture under it, and the design decisions we'd make again — and the ones we wouldn't.
What the pipeline actually does
Plain-text intent
│
▼
┌────────────┐
│ BRD │ ← AI generates structured BRD
└────────────┘
│
▼
┌────────────┐
│ SRD │ ← derived from BRD with traceability links
└────────────┘
│
▼
┌────────────┐
│ FRS │ ← derived from SRD
└────────────┘
│
▼
┌────────────┐
│ Epics │
└────────────┘
│
▼
┌────────────┐
│ Stories │
└────────────┘
│
▼
┌────────────┐
│ Tasks │
└────────────┘
Every node carries a quality grade that the AI assigns to itself, scored on completeness, ambiguity, testability, and consistency with upstream nodes. A "C" or worse is a flag for human review. "A" sections you can usually ship.
Outputs export to PDF, Word, Excel, or Markdown — because the consumer of an SRD is rarely the consumer of a backlog.
Why GraphRAG, not flat embeddings
The naive way to build something like this is: store everything in a vector database, retrieve top-k similar chunks, ask the model to write the next document.
That falls apart fast. Three reasons:
- Document derivations are structural, not semantic. A user story is traceable to a functional requirement, which is traceable to a system requirement, which is traceable to a business requirement. A flat vector store loses that hierarchy.
- You need to answer "where did this come from?" Auditors, regulators, and tech leads all eventually ask the same question. Embeddings give you "this is similar to" — not "this was derived from."
- Cross-document consistency requires graph traversal. When the AI updates an epic, every story under it might need re-evaluation. That's a graph walk, not a similarity search.
We built DocEngine on a knowledge graph where nodes are artefact sections and edges are typed: derives-from, implements, tests, conflicts-with, references. The LLM retrieves a subgraph — not a list of chunks — when generating the next document. This is the GraphRAG pattern, and it makes a measurable difference: in our internal evals, traceability accuracy is ~3.2× higher than a flat-RAG baseline on the same input.
Embeddings give you "this is similar to" — not "this was derived from." For a requirements pipeline, that distinction is the whole product. — Tarek, DocEngine architecture notes
Multi-provider AI: governance, not bragging
DocEngine ships with adapters for OpenAI, Anthropic Claude, Azure OpenAI, and OpenRouter. This isn't a "we have many models" feature. It's a governance feature.
Different customers have different non-negotiables:
- A UK financial-services client needs Azure OpenAI in a UK region with no vendor logging.
- A US health-tech needs Anthropic Claude under a BAA.
- A GCC government department needs Azure OpenAI in a regional sovereign-cloud deployment.
- A scrappy startup just wants the cheapest tokens that work.
If your AI product is hard-wired to one provider, you can't sell into the regulated tier. We learned this the hard way with an earlier internal prototype, and rebuilt the model layer behind a thin adapter interface before shipping DocEngine externally.
Per-section quality grades — the feature we almost cut
Early DocEngine builds gave the user a single overall confidence number per document. Customers ignored it. Of course they did — a BRD is 40 pages of mixed-quality content, and one number tells you nothing about where the issues are.
Switching to per-section grades A–F changed the product. Reviewers now scan to the C/D/F sections first and ignore the A/B sections, which cut review time by ~70% in our pilot accounts. The grade is computed by a separate evaluator pass — not by the generator scoring its own work, which (predictably) is biased toward high marks.
If you're building anything with LLM output that humans must review, don't show one global score. Score the parts. We wrote about this approach to evaluation more broadly in LLM evals in CI: stop shipping AI features on vibes.
What we'd do differently
Two regrets, for transparency:
- We over-templated the BRD format too early. The first version of DocEngine had a strict 14-section template. Customers wanted to bring their own templates, especially regulated ones. We rebuilt the template layer to be customer-configurable in v2 — should have been v1.
- We waited too long to add the editing UX. The first releases were "generate and download." Customers wanted to refine in-place, section by section, with the AI as a co-editor. Adding inline editing took two months — it would have taken two weeks if we'd designed for it from day one.
What it costs you not to have this
Take a 12-person delivery team starting a new programme. The pre-build phase — BRD, SRD, FRS, backlog — typically eats 3–5 weeks of senior time before a single line of code is written. At GCC consulting rates that's a six-figure spend before any value is delivered.
DocEngine collapses the first draft of that into minutes. Not the final version — humans still own decisions, ambiguity, and trade-offs. But the blank-page phase, the part nobody enjoys, is gone.
That's the part we built for.
DocEngine is a Graffitecs build. See the case study at /disciplines/docengine.html, or sign up at app.docengine.dev.



