Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

Production RAG on AWS

Module 1 · Fundamentals· Lesson 01/12

Why RAG: the LLM knowledge problem

What RAG solves, when to use it and when NOT to — versus fine-tuning and long context.

5 min read

Every LLM has a knowledge cutoff and has never seen your private data — and when it doesn't know the answer, it confidently makes one up. RAG (Retrieval-Augmented Generation) fixes exactly that: before generating, the model retrieves the right passages and uses them as grounding. This course shows you how to do that reliably, cheaply, and observably on AWS.

The problem: what the LLM doesn't know — and what it does when it doesn't

An LLM is trained on a snapshot of the world. After that, it freezes. It doesn't know what happened yesterday, it hasn't seen your internal documentation, it didn't read the contract you signed last week.

But the bigger problem isn't what it doesn't know. It's what it does when it doesn't know: it fills the gap with plausible text. This is called hallucination — and it's not a bug that will be fixed in the next version. It's a direct consequence of how language models work: they always produce the most probable next token given the context, regardless of whether it's true.

In production, this is critical. A support chatbot that invents refund policies. A legal assistant that cites non-existent case law. A code copilot that references an API that doesn't exist. The damage doesn't come from the model being 'dumb' — it comes from the model being convincing even when it's wrong.

The solution isn't more training. It's changing the architecture: instead of asking the model to remember, you deliver the relevant context to it at query time. That's RAG.

What RAG does: the core idea in three steps

RAG is simple at its core. When a user asks a question, the system doesn't go straight to the LLM. It first searches for the most relevant text passages in a knowledge base — documents, wikis, contracts, tickets, whatever you have. Then it injects those passages into the prompt, alongside the question. Only then does the LLM generate the answer.

The result is a grounded response: the model isn't remembering, it's reading. And since you know exactly which passages were used, the answer is traceable — you can show sources, audit the reasoning, detect when the model went beyond what was provided.

Three properties this guarantees:

Updatable without retraining: add a new document to the base and the system knows it on the next query.
Traceable: every answer has an evidence trail. That's gold in regulated contexts.
Controllable: you decide what goes into the base. The model can only use what you authorized.

RAG Flow: from question to grounded answer

Two distinct moments: ingestion (offline) and query (online). During ingestion, documents are chunked, turned into vectors, and stored. During query time, the user's question follows the same path and retrieves the most relevant passages before reaching the LLM.

📥 Ingestão — Offline

Documentos · S3, wikis, PDFs
Chunking · fragmentar texto
Embedding Model · texto → vetor
Vector Store · OpenSearch / FAISS

🔍 Consulta — Online

Usuário · pergunta
Embedding Model · pergunta → vetor
Retriever · busca top-k chunks
Prompt Builder · pergunta + chunks
LLM · Bedrock / Claude
Resposta · com fontes citadas

RAG vs fine-tuning vs long context: when to use each

This is the question I get most from architects. And the honest answer is: it depends on what you're trying to solve.

Fine-tuning teaches the model to behave differently — follow a style, adopt a tone, understand domain jargon. It's not good for injecting new facts. If you train the model on your internal documents, it will 'absorb' that knowledge in a diffuse way, with no guarantee of fidelity. And when the documents change, you retrain. Expensive and slow.

Long context (128k, 200k token windows) seems to solve everything: just throw the whole document into the prompt. It works well for single documents and one-off analysis tasks. But it doesn't scale: cost grows linearly with context size, latency increases, and models degrade in quality when the context gets too full — the well-known 'lost in the middle' effect.

RAG is the right choice when you have a large, dynamic, or private knowledge base and need traceable answers. You retrieve only what's relevant for that specific question — surgical context, not total context.

In practice, all three can coexist: fine-tuning for behavior, RAG for knowledge, long context for single-document analysis. The exercise below will help cement when each approach makes sense.

Match

RAG × fine-tuning × long context

Tap a concept, then its definition.

In practice: RAG doesn't replace fine-tuning, it complements it

Senior Solutions Architect

In practice, the biggest mistake I see is teams trying to use RAG to teach the model to write in the company's tone, or fine-tuning to make the model 'remember' technical documents. Neither works well outside its purpose. RAG is about retrievable, traceable knowledge. Fine-tuning is about behavior and style. When you confuse the two, you spend money and still get hallucinations. Always start with the question: 'what do I need to change — what the model knows or how it acts?'

Key takeaways from this lesson

LLMs hallucinate because they generate plausible text, not because they look up the truth — this doesn't change with bigger models.

RAG solves the private and current knowledge problem without retraining the model.

The flow has two moments: ingestion (offline) and query (online). Each has its own design decisions.

Fine-tuning changes behavior; RAG changes what the model can cite; long context serves for one-off analysis of single documents.

RAG answers are traceable: you know exactly which passages grounded each response.

Frequently asked questions

Does RAG completely eliminate hallucinations?

No. RAG reduces hallucinations by providing factual context, but the model can still extrapolate beyond the provided passages or combine information incorrectly. That's why the course dedicates a full lesson to evaluation and guardrails.

Do I need a GPU to run RAG in production?

Not necessarily. On AWS, you can use Amazon Bedrock for embeddings and generation via API, without managing GPU infrastructure. The cost is per token, not per instance. We'll cover this in detail in the Knowledge Bases and cost lessons.

What's the difference between RAG and semantic search?

Semantic search is the retrieval step inside RAG — it finds the relevant passages. RAG is the full pipeline: retrieve + inject into context + generate a natural language answer. Without the generation step, you have search. With it, you have RAG.

References

Lewis et al. — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (paper original)Amazon Bedrock Knowledge Bases — documentação oficial AWS Blog — Building RAG-based applications with Amazon Bedrock Anthropic — Reducing hallucinations with citations (Claude docs)

Quiz

Quick check

1. When is RAG the best choice?

Next lesson