RAG Systems in Production: What the Tutorials Skip

Every RAG tutorial follows the same pattern: chunk your documents, embed them, store in a vector database, retrieve relevant chunks, and generate an answer. It works beautifully in the demo. Then you deploy it to production and everything falls apart.

Chunking is an art, not a setting

The default chunking strategy - split every 500 tokens with 50-token overlap - works for blog posts. It fails for technical documentation, legal contracts, and medical records. Each document type needs a chunking strategy that preserves semantic meaning. A chunk that splits a table in half is worse than no chunk at all.

Embedding drift is real

Embedding models get updated. When you re-embed your corpus with a new model version, the existing vectors in your database become incompatible. You need a strategy for versioned embeddings and incremental re-indexing that does not require downtime.

Latency budgets matter

Users expect answers in under 3 seconds. A typical RAG pipeline involves: embedding the query (100-200ms), vector search (50-100ms), re-ranking (200-400ms), and generation (1-3 seconds). That is 1.5-4 seconds before any network latency. Every millisecond in the retrieval pipeline directly impacts the generation budget.

Evaluation is the missing piece

How do you know your RAG system is giving good answers? Most teams ship without evaluation frameworks. At minimum, you need: retrieval precision (are the right chunks being found?), answer faithfulness (is the answer grounded in the retrieved context?), and answer relevance (does it actually answer the question?).

The production checklist

Before shipping a RAG system: define your chunking strategy per document type, implement embedding versioning, set latency budgets per pipeline stage, build an evaluation harness with golden test cases, and add observability for every stage of the pipeline. The tutorials give you the architecture. Production demands the engineering.