Production RAG: cost, latency, operations + project
What separates a RAG demo from a reliable, cheap production system.
7 min read
A RAG demo impresses in ten minutes. A RAG system in production needs to work on the ten-thousandth request, cost what was promised, and be debuggable when something goes wrong. This lesson closes the course with what truly separates the two worlds: FinOps, latency, continuous operation, observability, and security — plus a guided project so you leave with a real architecture sketched out.
RAG FinOps: where the money goes and how to control it
The cost of a RAG pipeline lives in two completely different places: ingestion and query.
At ingestion, you pay for embeddings. You do this once (or a few times, on reindexing). Use the cheapest model that still delivers sufficient quality for your domain — AWS Titan Embeddings V2 is a solid choice for Portuguese and English documents. Cost here is proportional to text volume, not to the number of users.
At query time, the dominant cost is generation (the LLM). Every token in context costs money. That's why top-k matters: retrieving 20 chunks and dumping them all into the prompt can be up to three times more expensive than retrieving 5 well-ranked chunks. Reranking pays for itself when it prevents unnecessary tokens from entering the context.
Two mechanisms cut query cost meaningfully:
- Semantic cache: if today's question is semantically close to one asked yesterday, return the cached answer. Amazon ElastiCache (Redis) with vector search or a simple embedding-hash cache layer handles this.
- Retrieval cache: chunks retrieved for a frequent query can be cached separately from generation — useful when you want to regenerate the answer with a different prompt without paying for search again.
Choose the LLM by the quality/cost ratio for your case: Claude Haiku for fast, cheap responses; Claude Sonnet when faithfulness and reasoning matter more.
Where latency lives in the RAG pipeline
In practice, the first thing I do in any RAG going to end users is enable streaming on generation. Real latency doesn't change, but perceived latency drops dramatically — the user sees tokens arriving in under a second instead of waiting 5s for a complete response. Only after that do I look at the rest of the chain. Optimizing embed and search without streaming is polishing the engine of a car with flat tires.
Continuous operation: reindexing, versioning, and data updates
Documents change. Policies get revised, products are discontinued, prices are updated. A stale vector index is worse than no index — it answers confidently with wrong information.
Update strategies:
- Incremental: for each new or modified document, delete old chunks by
doc_idand insert the new ones. Works well when change rate is low and documents have stable IDs. - Full reindex: rebuild the index from scratch in parallel (index B), redirect traffic via alias (OpenSearch supports aliases natively), delete index A. Zero downtime, higher cost.
- Embedding versioning: if you switch embedding models, all documents need reindexing. Store the model name as metadata on each chunk so you know what needs reprocessing.
In Bedrock Knowledge Bases, incremental sync is managed by the service — you point to an S3 bucket and trigger a StartIngestionJob. For custom pipelines, S3 event → Lambda → SQS queue → ingestion worker is the most reliable pattern.
One detail that's expensive to ignore: chunk overlap and chunking strategy must be fixed per index version. If you change chunking strategy without reindexing everything, your old and new chunks will have different embedding distributions and search will degrade silently.
Observability: what to log, trace, and measure in production
RAG without observability is a black box you can't improve. You need three layers:
1. Structured logs per request
For each query, persist: the original query, retrieved chunks (with scores and doc_id), the final prompt sent to the LLM, and the generated response. Without this, you can't debug a bad answer or feed an offline evaluation pipeline.
2. Distributed traces
AWS X-Ray or OpenTelemetry with spans for each step (embed, search, rerank, generation) give per-component latency visibility. You'll find that 80% of time is in generation — but the remaining 20% will surprise you.
3. Quality metrics in production
Faithfulness and relevance (covered in Lesson 08) aren't just for offline evaluation. With 5–10% request sampling, you can run an async LLM-as-judge and publish metrics to CloudWatch. If average faithfulness drops after a reindexing, you know something broke before users complain.
A practical pattern: use Amazon CloudWatch for operational metrics (latency, errors, estimated cost per request), AWS X-Ray for traces, and S3 + Athena for quality log analysis. Bedrock Knowledge Bases already emits native metrics to CloudWatch — take advantage of it.
Security and privacy: the non-negotiables
tenant_id on every query — never rely on the prompt alone.Order the RAG path to production
From prototype to an operable system.
- 1Prototype: ingestion + search + generation working
- 2Optimize cost/latency (model, top-k, cache) and add guardrails
- 3Operate: observability, reindexing and continuous improvement
- 4Evaluate retrieval and generation (faithfulness, relevance)
Production RAG architecture on AWS — reference view
Full flow: document ingestion (left), query pipeline (center), and operations layers (right). Numbers on edges indicate the sequence of a typical query.
- S3 · Documentos fonte
- Lambda · Ingestão / chunking
- Bedrock · Titan Embeddings V2
- OpenSearch Serverless · Índice híbrido (kNN + BM25)
- API Gateway · + Lambda orquestrador
- ElastiCache (Redis) · Cache semântico
- Bedrock Reranker · Cross-encoder
- Bedrock · Claude (geração)
- Bedrock Guardrails · PII / tópicos / jailbreak
- X-Ray · Traces por etapa
- CloudWatch · Métricas + alertas
- S3 + Athena · Logs de qualidade
Guided project: corporate document RAG assistant on AWS
Let's sketch the architecture of an assistant that answers questions about a company's internal documents (HR policies, technical manuals, contracts). This is the type of system you'll build after this course.
Design decisions:
| Dimension | Decision | Rationale |
|---|---|---|
| Chunking | Hierarchical (512 tokens, 10% overlap) | Long documents with clear structure |
| Embedding | Titan Embeddings V2 | Low cost, good quality for pt/en |
| Index | OpenSearch Serverless | No cluster management, auto-scaling |
| Search | Hybrid (kNN + BM25) | Exact technical terms + semantics |
| Reranking | Bedrock Reranker | Better precision without extra code |
| Generation | Claude Haiku (fast) / Sonnet (complex) | Routing by query type |
| Guardrails | Bedrock Guardrails | Blocks PII and out-of-scope topics |
| Isolation | Filter by dept_id on every query | Each department sees only its docs |
| Cache | Redis (ElastiCache) by embedding hash | Repeated queries without LLM cost |
| Observability | X-Ray + CloudWatch + S3/Athena | Traces, metrics, and quality analysis |
What to evaluate continuously: faithfulness (did the LLM hallucinate?) and context precision (were the retrieved chunks relevant?). Run async LLM-as-judge on 10% of requests and publish to CloudWatch as custom metric RAG/Faithfulness.
This project synthesizes everything the course covered. Each row in the table is a lesson.
Frequently asked questions about production RAG
Should I use Bedrock Knowledge Bases or build my own pipeline?
Knowledge Bases to start fast and when the use case is standard. Custom pipeline when you need specialized chunking, complex routing logic, or integration with data sources KB doesn't natively support. Both are production-ready — the choice is about control vs. speed.
What is the ideal top-k?
There's no universal number. Start with top-k=10 at search and top-k=5 after reranking. Measure faithfulness and context precision. If faithfulness drops, you're bringing in irrelevant chunks — reduce. If relevance drops, you're cutting too much — increase. Let the data decide.
How to handle frequently changing documents?
Use incremental ingestion by doc_id with S3 events. For critical documents (pricing, regulatory), consider a shorter TTL on the semantic cache or disable caching for those categories via metadata.
Does RAG completely solve hallucination?
No. RAG reduces hallucination by grounding the response in retrieved context, but the LLM can still extrapolate beyond what the chunks say. That's why faithfulness in production is non-negotiable — you need to measure it, not assume it.
Course recap
Tap a card to flip it.
What you built throughout this course
You started by understanding why LLMs need RAG, went through embeddings, chunking, hybrid search, reranking, metadata, agentic RAG, evaluation, Knowledge Bases, vector stores, guardrails — and arrived here knowing how to operate all of it reliably and economically. This isn't theory: every decision you'd make in a real project was discussed with explicit trade-offs. The next step is the final exam. It tests whether you truly understood — not whether you memorized. Good luck. You're ready.