Engineering · RAG · June 12, 2026 · 9 min read
Production RAG: the architecture, the evals, and the gotchas
A field guide to shipping retrieval-augmented generation that's accurate, grounded and safe — with the architecture, an eval harness, and the traps that bite teams in production.
Most RAG demos look magical and most RAG deployments disappoint. The gap is rarely the model — it's retrieval quality, evaluation, and the unglamorous plumbing around them. Here's the architecture we ship, and the mistakes we've learned to avoid.
The pipeline at a glance
Naive LLM vs grounded RAG
| Naive LLM | Grounded RAG | |
|---|---|---|
| Source of truth | Model weights (the open web) | Your documents |
| Hallucination risk | High | Low (constrained to context) |
| Citations | None | Every claim |
| Data control | Leaves your perimeter | Stays private |
| Freshness | Frozen at training | Live (re-index any time) |
Retrieval, in code
# embed the query, fetch top-k, then constrain the prompt to that context
hits = vector_store.search(embed(query), top_k=6, filter={"tenant": tenant_id})
context = "\n\n".join(f"[{i+1}] {h.text}" for i, h in enumerate(hits))
answer = llm.complete(
system="Answer ONLY from the context. Cite sources as [n]. If unsure, say so.",
user=f"Context:\n{context}\n\nQuestion: {query}",
)
Measure it, or it'll regress
Ship an eval harness before you widen access, and keep it running in production:
| Metric | What it catches | Target |
|---|---|---|
| Groundedness | Claims not supported by context | > 95% |
| Answer accuracy | Wrong-but-confident answers | > 90% |
| Refusal rate | Answering when it shouldn't | Calibrated |
Citations aren't decoration — they're the audit trail that makes GenAI trustworthy at work.
The gotchas, in one list
- Stale index — wire re-indexing to your source-of-truth, not a quarterly cron.
- No access control — retrieval must respect per-user permissions, or you leak.
- Eval drift — refresh your eval set as the corpus and questions evolve.
- Cost creep — cache embeddings, right-size
top_k, and monitor token spend.
Get retrieval, evals and guardrails right and a RAG assistant stops being a demo — and starts being something you can put in front of customers.