Engineering · RAG · June 12, 2026 · 9 min read

Production RAG: the architecture, the evals, and the gotchas

A field guide to shipping retrieval-augmented generation that's accurate, grounded and safe — with the architecture, an eval harness, and the traps that bite teams in production.

Most RAG demos look magical and most RAG deployments disappoint. The gap is rarely the model — it's retrieval quality, evaluation, and the unglamorous plumbing around them. Here's the architecture we ship, and the mistakes we've learned to avoid.

The pipeline at a glance

A query is embedded, matched against your indexed documents, and the LLM answers only from what's retrieved — with citations.

Naive LLM vs grounded RAG

	Naive LLM	Grounded RAG
Source of truth	Model weights (the open web)	Your documents
Hallucination risk	High	Low (constrained to context)
Citations	None	Every claim
Data control	Leaves your perimeter	Stays private
Freshness	Frozen at training	Live (re-index any time)

Gotcha #1 — chunking dominates quality. Most "the model is dumb" complaints are actually retrieval misses. Chunk on semantic boundaries, keep 10–15% overlap, and store metadata (title, section) for filtering.

Retrieval, in code

# embed the query, fetch top-k, then constrain the prompt to that context
hits = vector_store.search(embed(query), top_k=6, filter={"tenant": tenant_id})
context = "\n\n".join(f"[{i+1}] {h.text}" for i, h in enumerate(hits))

answer = llm.complete(
    system="Answer ONLY from the context. Cite sources as [n]. If unsure, say so.",
    user=f"Context:\n{context}\n\nQuestion: {query}",
)

Measure it, or it'll regress

Ship an eval harness before you widen access, and keep it running in production:

Metric	What it catches	Target
Groundedness	Claims not supported by context	> 95%
Answer accuracy	Wrong-but-confident answers	> 90%
Refusal rate	Answering when it shouldn't	Calibrated

Citations aren't decoration — they're the audit trail that makes GenAI trustworthy at work.

The gotchas, in one list

Stale index — wire re-indexing to your source-of-truth, not a quarterly cron.
No access control — retrieval must respect per-user permissions, or you leak.
Eval drift — refresh your eval set as the corpus and questions evolve.
Cost creep — cache embeddings, right-size top_k, and monitor token spend.

Get retrieval, evals and guardrails right and a RAG assistant stops being a demo — and starts being something you can put in front of customers.

/ go deeper

Take the GenAI Opportunity Scorecard Explore Generative AI