Blog
Production·16 min read

Why production RAG is harder than it looks

Chunking, latency, cost, and evaluation: what changes when you leave the notebook and enter corporate SLAs and messy data.

Abstract neural network with golden light threads representing data flow

In demos, RAG looks trivial: index PDFs, query the vector store, inject context into the prompt, done. In production it becomes a distributed product with latency, cost, privacy, and quality requirements that are hard to reconcile. This article organizes what usually breaks real projects—and how I tackle each part in practice.

Introduction: the gap between PoC and production

Most RAG tutorials stop when your first PDF answers questions in a Jupyter notebook. The pain starts when you handle ten thousand documents updated weekly, five hundred requests per minute, teams that don’t understand why it “only works sometimes,” and a product VP asking what it will cost per month.

Over the last few months, building AI systems in Cogna’s MarTech context, I learned that production RAG is not a model problem—it’s a systems engineering problem. Here are the real challenges and practical decisions that matter.

1. Real-world data is messy

Legal docs, internal policies, and legacy bases rarely arrive as clean Markdown. You get duplicate headings, scanned tables, repeated attachments, and conflicting versions. If chunking ignores semantic structure, the retriever mixes paragraphs from different contexts and the model “answers nicely” with wrong information.

Chunking strategies that actually work

Chunking is not just “split every 500 tokens.” Documents have structure—sections, subsections, tables, lists—and ignoring that yields fragmented contexts. In practice I use hybrid strategies:

  • Semantic splits by section when the document has clear markup (headings, XML tags)
  • Fallback to fixed windows with 50–100 token overlap when structure is inconsistent
  • Variable chunk sizes: short sections can be merged, long ones subdivided
  • Hierarchical context: each chunk carries metadata for its parent section title

Handling heterogeneous formats

Native PDFs and scanned PDFs (OCR) need different pipelines. PDFs with complex tables need specialized tools like Camelot or Tabula. Word documents may lose formatting on conversion. The fix: ingestion pipelines per document type, with quality validation before embedding.

Metadata as first-class citizens

  • Normalize metadata: source, creation date, version, author, owning team
  • Pre-search filters: shrink the vector search space with structured filters
  • Document versioning: keep history for audit and rollback
  • Quality tags: mark chunks with confidence scores from heuristics

2. Latency and cost are part of the contract

Each query can trigger embedding, top-k search, re-ranking, and an LLM call. Multiplied by thousands of users, cost becomes predictable OPEX or an invoice surprise. Latency above a few seconds ruins UX for interactive flows.

Anatomy of a RAG query and its costs

A typical RAG query goes through costly steps: (1) query embedding (~$0.0001 with text-embedding-3-small), (2) vector search in Pinecone-like stores, (3) optional cross-encoder re-ranking (+10–50ms), (4) LLM call with retrieved context (~$0.002–0.02 depending on model and context). At 100k queries/day that can be hundreds to thousands of dollars per day in API spend alone.

Cost optimization strategies

  • Embedding cache for frequent queries (Redis/Memcached)
  • Response cache with smart TTL when context hasn’t changed
  • Smaller models for auxiliary steps (e.g. GPT-3.5 for routing, GPT-4 for final answer)
  • Summarize long chunks before the final model
  • Batch embedding jobs where possible

Latency without blowing the budget

Latency has three main parts: embedding time, vector search, and LLM inference. To reduce it: smaller embeddings when possible, approximate search (HNSW, IVF), stream LLM tokens for perceived speed, and host models near users.

A well-architected RAG system can answer in under two seconds, cost under $0.005 per query, and still scale 10x.

3. Evaluation is not a leaderboard metric

MMLU and generic benchmarks don’t tell you if your assistant gets your contract clauses right. You need domain-labeled eval sets, regression tests after prompt or index changes, and—when possible—human feedback on samples.

Building evaluation datasets from scratch

Start with gold Q&A pairs. Aim for 50–100 pairs covering common cases (~80% of expected use), edge cases that already broke the system, adversarial or ambiguous questions, and “I don’t know” cases where that’s the correct answer.

Metrics that actually matter

  • Retrieval precision@k: of the k chunks returned, how many are relevant?
  • Retrieval recall: of all relevant chunks, how many were retrieved?
  • Answer accuracy: is the final answer factually correct? (human validation)
  • Citation rate: did the model cite sources correctly?
  • “I don’t know” rate: did it admit insufficient evidence?
  • Latency p95: what’s the 95th percentile response time?

Regression tests in CI/CD

Every prompt change, index update, or model swap should run against the eval set. I wire this into CI: run test queries, compare to expected answers (LLM-as-judge + structured checks), and block deploys if accuracy drops more than ~5%. That stops silent regressions.

4. Observability is mandatory

Without observability you fly blind. When someone says “the AI was wrong,” you must reconstruct the query, retrieved chunks, context sent to the LLM, response, and cost.

Observability stack for RAG

At minimum: structured logs with trace IDs across the pipeline, per-step latency metrics, per-query and per-user cost tracking, query/response storage for audit, and dashboards for volume, latency, cost, and errors.

  • LangSmith: great traces for LangChain chains, prompts, and costs
  • Weights & Biases: experiments, embedding comparisons, fine-tuning
  • Prometheus + Grafana: infra metrics and production alerts
  • Custom JSON logs to CloudWatch/Datadog with standard fields
Production RAG is data engineering + prompt engineering + systems engineering. The model is just one component.

5. Security and privacy are not afterthoughts

Sensitive enterprise data must not leak across users or hit external models without controls. RAG adds attack surface: malicious document injection, query-based exfiltration, and exposure via bad citations.

Essential security controls

  • Data isolation: multi-tenant vector DB with per-user/org ACLs
  • Input sanitization for queries
  • Output filtering so citations don’t leak other users’ data
  • Audit logs for document access via RAG
  • Self-hosted or private endpoints (e.g. Azure OpenAI, local models) for regulated data (GDPR, HIPAA)

6. What to ship next sprint

Start with the minimum auditable path: versioned ingestion, retrieval metrics (precision@k on real queries), and a per-request cost panel. Only then add re-ranking, multi-query, and agents—each layer adds failure modes and observability needs.

Production-ready RAG checklist

  • Versioned ingestion pipeline with rollback
  • Normalized metadata and pre-search filters
  • Eval dataset with at least ~100 test cases
  • Automated regression tests in CI/CD
  • Structured logs with trace IDs end-to-end
  • Per-query and per-user cost dashboard
  • Alerts on latency p95 and error rate
  • Access control and multi-tenant isolation
  • Runbooks for common incidents

Conclusion

Production RAG is not about the newest model or the fanciest embedding. It’s about building something reliable, observable, secure, and economically viable—turning an impressive demo into a product ops can run and finance can approve.

If you’re starting: don’t build everything at once. Ship the simplest pipeline, measure everything, evolve incrementally. If you already run RAG in production: review against the checklist above and prioritize gaps that hurt reliability and cost most.