Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

Production RAG on AWS

Module 3 · Production on AWS· Lesson 12/12

Production RAG: cost, latency, operations + project

What separates a RAG demo from a reliable, cheap production system.

7 min read

A RAG demo impresses in ten minutes. A RAG system in production needs to work on the ten-thousandth request, cost what was promised, and be debuggable when something goes wrong. This lesson closes the course with what truly separates the two worlds: FinOps, latency, continuous operation, observability, and security — plus a guided project so you leave with a real architecture sketched out.

RAG FinOps: where the money goes and how to control it

The cost of a RAG pipeline lives in two completely different places: ingestion and query.

At ingestion, you pay for embeddings. You do this once (or a few times, on reindexing). Use the cheapest model that still delivers sufficient quality for your domain — AWS Titan Embeddings V2 is a solid choice for Portuguese and English documents. Cost here is proportional to text volume, not to the number of users.

At query time, the dominant cost is generation (the LLM). Every token in context costs money. That's why top-k matters: retrieving 20 chunks and dumping them all into the prompt can be up to three times more expensive than retrieving 5 well-ranked chunks. Reranking pays for itself when it prevents unnecessary tokens from entering the context.

Two mechanisms cut query cost meaningfully:

Semantic cache: if today's question is semantically close to one asked yesterday, return the cached answer. Amazon ElastiCache (Redis) with vector search or a simple embedding-hash cache layer handles this.
Retrieval cache: chunks retrieved for a frequent query can be cached separately from generation — useful when you want to regenerate the answer with a different prompt without paying for search again.

Choose the LLM by the quality/cost ratio for your case: Claude Haiku for fast, cheap responses; Claude Sonnet when faithfulness and reasoning matter more.

Where latency lives in the RAG pipeline

~20ms

Query embed + vector search (well-tuned OpenSearch)

Search is rarely the bottleneck. If it is, check index size and shard count.

50–150ms

Cross-encoder reranker (Bedrock or dedicated endpoint)

Fixed cost per call. Only worth it for top-k > 5 and when precision matters more than speed.

1–8s

Generation (LLM) — dominates total latency

Streaming fixes user perception. Reducing context (lower top-k, smaller chunks) reduces TTFT.

In practice: streaming before any other optimization

Senior Solutions Architect

In practice, the first thing I do in any RAG going to end users is enable streaming on generation. Real latency doesn't change, but perceived latency drops dramatically — the user sees tokens arriving in under a second instead of waiting 5s for a complete response. Only after that do I look at the rest of the chain. Optimizing embed and search without streaming is polishing the engine of a car with flat tires.

Continuous operation: reindexing, versioning, and data updates

Documents change. Policies get revised, products are discontinued, prices are updated. A stale vector index is worse than no index — it answers confidently with wrong information.

Update strategies:

Incremental: for each new or modified document, delete old chunks by doc_id and insert the new ones. Works well when change rate is low and documents have stable IDs.
Full reindex: rebuild the index from scratch in parallel (index B), redirect traffic via alias (OpenSearch supports aliases natively), delete index A. Zero downtime, higher cost.
Embedding versioning: if you switch embedding models, all documents need reindexing. Store the model name as metadata on each chunk so you know what needs reprocessing.

In Bedrock Knowledge Bases, incremental sync is managed by the service — you point to an S3 bucket and trigger a StartIngestionJob. For custom pipelines, S3 event → Lambda → SQS queue → ingestion worker is the most reliable pattern.

One detail that's expensive to ignore: chunk overlap and chunking strategy must be fixed per index version. If you change chunking strategy without reindexing everything, your old and new chunks will have different embedding distributions and search will degrade silently.

Observability: what to log, trace, and measure in production

RAG without observability is a black box you can't improve. You need three layers:

1. Structured logs per request

For each query, persist: the original query, retrieved chunks (with scores and doc_id), the final prompt sent to the LLM, and the generated response. Without this, you can't debug a bad answer or feed an offline evaluation pipeline.

2. Distributed traces

AWS X-Ray or OpenTelemetry with spans for each step (embed, search, rerank, generation) give per-component latency visibility. You'll find that 80% of time is in generation — but the remaining 20% will surprise you.

3. Quality metrics in production

Faithfulness and relevance (covered in Lesson 08) aren't just for offline evaluation. With 5–10% request sampling, you can run an async LLM-as-judge and publish metrics to CloudWatch. If average faithfulness drops after a reindexing, you know something broke before users complain.

A practical pattern: use Amazon CloudWatch for operational metrics (latency, errors, estimated cost per request), AWS X-Ray for traces, and S3 + Athena for quality log analysis. Bedrock Knowledge Bases already emits native metrics to CloudWatch — take advantage of it.

Security and privacy: the non-negotiables

Tenant isolation: if the system serves multiple clients, each must see only their own documents. Use metadata filters by tenant_id on every query — never rely on the prompt alone.

Bedrock Guardrails: block forbidden topics, PII, and jailbreaks at the generation layer (Lesson 11). Guardrails are the last line of defense, not the only one.

IAM with least privilege: the application role needs access to S3 (read), OpenSearch (query), and Bedrock (InvokeModel). Nothing more. Audit with IAM Access Analyzer.

Don't log sensitive data: request logs are valuable, but if documents contain PII, redact before persisting. AWS Macie can audit the document S3 bucket.

VPC and private endpoints: OpenSearch and Bedrock support VPC endpoints. In enterprise production, no data traffic should traverse the public internet.

Put in order

Order the RAG path to production

From prototype to an operable system.

1Prototype: ingestion + search + generation working
2Optimize cost/latency (model, top-k, cache) and add guardrails
3Operate: observability, reindexing and continuous improvement
4Evaluate retrieval and generation (faithfulness, relevance)

Production RAG architecture on AWS — reference view

Full flow: document ingestion (left), query pipeline (center), and operations layers (right). Numbers on edges indicate the sequence of a typical query.

📥 Ingestão

S3 · Documentos fonte
Lambda · Ingestão / chunking
Bedrock · Titan Embeddings V2

🔍 Índice vetorial

OpenSearch Serverless · Índice híbrido (kNN + BM25)

🤖 Pipeline de consulta

API Gateway · + Lambda orquestrador
ElastiCache (Redis) · Cache semântico
Bedrock Reranker · Cross-encoder
Bedrock · Claude (geração)
Bedrock Guardrails · PII / tópicos / jailbreak

📊 Operação

X-Ray · Traces por etapa
CloudWatch · Métricas + alertas
S3 + Athena · Logs de qualidade

Guided project: corporate document RAG assistant on AWS

Let's sketch the architecture of an assistant that answers questions about a company's internal documents (HR policies, technical manuals, contracts). This is the type of system you'll build after this course.

Design decisions:

| Dimension | Decision | Rationale |

|---|---|---|

| Chunking | Hierarchical (512 tokens, 10% overlap) | Long documents with clear structure |

| Embedding | Titan Embeddings V2 | Low cost, good quality for pt/en |

| Index | OpenSearch Serverless | No cluster management, auto-scaling |

| Search | Hybrid (kNN + BM25) | Exact technical terms + semantics |

| Reranking | Bedrock Reranker | Better precision without extra code |

| Generation | Claude Haiku (fast) / Sonnet (complex) | Routing by query type |

| Guardrails | Bedrock Guardrails | Blocks PII and out-of-scope topics |

| Isolation | Filter by dept_id on every query | Each department sees only its docs |

| Cache | Redis (ElastiCache) by embedding hash | Repeated queries without LLM cost |

| Observability | X-Ray + CloudWatch + S3/Athena | Traces, metrics, and quality analysis |

What to evaluate continuously: faithfulness (did the LLM hallucinate?) and context precision (were the retrieved chunks relevant?). Run async LLM-as-judge on 10% of requests and publish to CloudWatch as custom metric RAG/Faithfulness.

This project synthesizes everything the course covered. Each row in the table is a lesson.

Frequently asked questions about production RAG

Should I use Bedrock Knowledge Bases or build my own pipeline?

Knowledge Bases to start fast and when the use case is standard. Custom pipeline when you need specialized chunking, complex routing logic, or integration with data sources KB doesn't natively support. Both are production-ready — the choice is about control vs. speed.

What is the ideal top-k?

There's no universal number. Start with top-k=10 at search and top-k=5 after reranking. Measure faithfulness and context precision. If faithfulness drops, you're bringing in irrelevant chunks — reduce. If relevance drops, you're cutting too much — increase. Let the data decide.

How to handle frequently changing documents?

Use incremental ingestion by doc_id with S3 events. For critical documents (pricing, regulatory), consider a shorter TTL on the semantic cache or disable caching for those categories via metadata.

Does RAG completely solve hallucination?

No. RAG reduces hallucination by grounding the response in retrieved context, but the LLM can still extrapolate beyond what the chunks say. That's why faithfulness in production is non-negotiable — you need to measure it, not assume it.

Flashcards

Course recap

Tap a card to flip it.

What you built throughout this course

Curso completo — vá para o exame final

You started by understanding why LLMs need RAG, went through embeddings, chunking, hybrid search, reranking, metadata, agentic RAG, evaluation, Knowledge Bases, vector stores, guardrails — and arrived here knowing how to operate all of it reliably and economically. This isn't theory: every decision you'd make in a real project was discussed with explicit trade-offs. The next step is the final exam. It tests whether you truly understood — not whether you memorized. Good luck. You're ready.

References and further reading

Amazon Bedrock Knowledge Bases — documentação oficial Amazon OpenSearch Serverless — Vector Engine Bedrock Guardrails — configuração e uso AWS Well-Architected Framework — Machine Learning Lens RAGAS — framework de avaliação de RAG Semantic Caching for LLMs — AWS Blog

Previous Final exam