Why RAG: the LLM knowledge problem
What RAG solves, when to use it and when NOT to — versus fine-tuning and long context.
5 min read
Every LLM has a knowledge cutoff and has never seen your private data — and when it doesn't know the answer, it confidently makes one up. RAG (Retrieval-Augmented Generation) fixes exactly that: before generating, the model retrieves the right passages and uses them as grounding. This course shows you how to do that reliably, cheaply, and observably on AWS.
The problem: what the LLM doesn't know — and what it does when it doesn't
An LLM is trained on a snapshot of the world. After that, it freezes. It doesn't know what happened yesterday, it hasn't seen your internal documentation, it didn't read the contract you signed last week.
But the bigger problem isn't what it doesn't know. It's what it does when it doesn't know: it fills the gap with plausible text. This is called hallucination — and it's not a bug that will be fixed in the next version. It's a direct consequence of how language models work: they always produce the most probable next token given the context, regardless of whether it's true.
In production, this is critical. A support chatbot that invents refund policies. A legal assistant that cites non-existent case law. A code copilot that references an API that doesn't exist. The damage doesn't come from the model being 'dumb' — it comes from the model being convincing even when it's wrong.
The solution isn't more training. It's changing the architecture: instead of asking the model to remember, you deliver the relevant context to it at query time. That's RAG.
What RAG does: the core idea in three steps
RAG is simple at its core. When a user asks a question, the system doesn't go straight to the LLM. It first searches for the most relevant text passages in a knowledge base — documents, wikis, contracts, tickets, whatever you have. Then it injects those passages into the prompt, alongside the question. Only then does the LLM generate the answer.
The result is a grounded response: the model isn't remembering, it's reading. And since you know exactly which passages were used, the answer is traceable — you can show sources, audit the reasoning, detect when the model went beyond what was provided.
Three properties this guarantees:
- Updatable without retraining: add a new document to the base and the system knows it on the next query.
- Traceable: every answer has an evidence trail. That's gold in regulated contexts.
- Controllable: you decide what goes into the base. The model can only use what you authorized.
RAG Flow: from question to grounded answer
Two distinct moments: ingestion (offline) and query (online). During ingestion, documents are chunked, turned into vectors, and stored. During query time, the user's question follows the same path and retrieves the most relevant passages before reaching the LLM.
- Documentos · S3, wikis, PDFs
- Chunking · fragmentar texto
- Embedding Model · texto → vetor
- Vector Store · OpenSearch / FAISS
- Usuário · pergunta
- Embedding Model · pergunta → vetor
- Retriever · busca top-k chunks
- Prompt Builder · pergunta + chunks
- LLM · Bedrock / Claude
- Resposta · com fontes citadas
RAG vs fine-tuning vs long context: when to use each
This is the question I get most from architects. And the honest answer is: it depends on what you're trying to solve.
Fine-tuning teaches the model to behave differently — follow a style, adopt a tone, understand domain jargon. It's not good for injecting new facts. If you train the model on your internal documents, it will 'absorb' that knowledge in a diffuse way, with no guarantee of fidelity. And when the documents change, you retrain. Expensive and slow.
Long context (128k, 200k token windows) seems to solve everything: just throw the whole document into the prompt. It works well for single documents and one-off analysis tasks. But it doesn't scale: cost grows linearly with context size, latency increases, and models degrade in quality when the context gets too full — the well-known 'lost in the middle' effect.
RAG is the right choice when you have a large, dynamic, or private knowledge base and need traceable answers. You retrieve only what's relevant for that specific question — surgical context, not total context.
In practice, all three can coexist: fine-tuning for behavior, RAG for knowledge, long context for single-document analysis. The exercise below will help cement when each approach makes sense.
RAG × fine-tuning × long context
Tap a concept, then its definition.
In practice, the biggest mistake I see is teams trying to use RAG to teach the model to write in the company's tone, or fine-tuning to make the model 'remember' technical documents. Neither works well outside its purpose. RAG is about retrievable, traceable knowledge. Fine-tuning is about behavior and style. When you confuse the two, you spend money and still get hallucinations. Always start with the question: 'what do I need to change — what the model knows or how it acts?'
Key takeaways from this lesson
Frequently asked questions
Does RAG completely eliminate hallucinations?
No. RAG reduces hallucinations by providing factual context, but the model can still extrapolate beyond the provided passages or combine information incorrectly. That's why the course dedicates a full lesson to evaluation and guardrails.
Do I need a GPU to run RAG in production?
Not necessarily. On AWS, you can use Amazon Bedrock for embeddings and generation via API, without managing GPU infrastructure. The cost is per token, not per instance. We'll cover this in detail in the Knowledge Bases and cost lessons.
What's the difference between RAG and semantic search?
Semantic search is the retrieval step inside RAG — it finds the relevant passages. RAG is the full pipeline: retrieve + inject into context + generate a natural language answer. Without the generation step, you have search. With it, you have RAG.
References
Quick check
1. When is RAG the best choice?