← back to blog

Engineering · RAG · June 12, 2026 · 9 min read

Production RAG: the architecture, the evals, and the gotchas

A field guide to shipping retrieval-augmented generation that's accurate, grounded and safe — with the architecture, an eval harness, and the traps that bite teams in production.

Most RAG demos look magical and most RAG deployments disappoint. The gap is rarely the model — it's retrieval quality, evaluation, and the unglamorous plumbing around them. Here's the architecture we ship, and the mistakes we've learned to avoid.

The pipeline at a glance

Query Retriever(embed + search) Vector store Your docs LLM+ guardrails Cited answer
A query is embedded, matched against your indexed documents, and the LLM answers only from what's retrieved — with citations.

Naive LLM vs grounded RAG

Naive LLMGrounded RAG
Source of truthModel weights (the open web)Your documents
Hallucination riskHighLow (constrained to context)
CitationsNoneEvery claim
Data controlLeaves your perimeterStays private
FreshnessFrozen at trainingLive (re-index any time)
Gotcha #1 — chunking dominates quality. Most "the model is dumb" complaints are actually retrieval misses. Chunk on semantic boundaries, keep 10–15% overlap, and store metadata (title, section) for filtering.

Retrieval, in code

# embed the query, fetch top-k, then constrain the prompt to that context
hits = vector_store.search(embed(query), top_k=6, filter={"tenant": tenant_id})
context = "\n\n".join(f"[{i+1}] {h.text}" for i, h in enumerate(hits))

answer = llm.complete(
    system="Answer ONLY from the context. Cite sources as [n]. If unsure, say so.",
    user=f"Context:\n{context}\n\nQuestion: {query}",
)

Measure it, or it'll regress

Ship an eval harness before you widen access, and keep it running in production:

MetricWhat it catchesTarget
GroundednessClaims not supported by context> 95%
Answer accuracyWrong-but-confident answers> 90%
Refusal rateAnswering when it shouldn'tCalibrated

Citations aren't decoration — they're the audit trail that makes GenAI trustworthy at work.

The gotchas, in one list

  • Stale index — wire re-indexing to your source-of-truth, not a quarterly cron.
  • No access control — retrieval must respect per-user permissions, or you leak.
  • Eval drift — refresh your eval set as the corpus and questions evolve.
  • Cost creep — cache embeddings, right-size top_k, and monitor token spend.

Get retrieval, evals and guardrails right and a RAG assistant stops being a demo — and starts being something you can put in front of customers.