Evaluating RAG: faithfulness and relevance
Without measuring, you can't improve. How to evaluate a RAG's retrieval and generation.
6 min read
You tuned the chunking, swapped the embedding model, rewrote the prompt — but how do you know if things got better? Without metrics, you're flying blind. Evaluation isn't a final step: it's the instrument that guides every decision in your RAG pipeline.
Two independent halves to evaluate
A RAG pipeline has two distinct jobs: retrieve the right chunks and generate a response that is faithful to them. These two jobs fail in different ways, so they need different metrics.
On the retrieval side, the question is: did the relevant chunks show up in the returned list? You measure this with hit rate (is at least one relevant chunk in the top-k?), precision@k (of the k chunks returned, how many are actually useful?), and recall@k (of all relevant chunks that exist, how many were captured?). A retriever with low precision floods the context with noise; one with low recall leaves critical information out.
On the generation side, the question splits in two: is the answer faithful to the sources (faithfulness / groundedness) and does it answer the question (answer relevance)? Faithfulness detects hallucination — the model asserted something not present in any retrieved chunk. Answer relevance detects answers that may be true but don't actually address what was asked.
Misdiagnosing which half is broken is the most expensive mistake in RAG. If the retriever brings the wrong chunks, improving the prompt won't fix it. If the retriever is fine but the model hallucinates, swapping the embedding won't help. Measuring both halves separately is what lets you act in the right place.
RAG metrics: what each one measures
| Metric | Half | What it detects | How to compute | |
|---|---|---|---|---|
| Hit Rate | Retrieval | At least 1 relevant chunk in top-k | % of queries with a hit | — |
| Precision@k | Retrieval | Noise in the context sent to the LLM | Relevant chunks / k returned | — |
| Recall@k | Retrieval | Relevant information missed | Relevant chunks captured / total available | — |
| Faithfulness | Generation | Hallucination: claims unsupported by chunks | LLM-as-judge or NLI per claim | — |
| Answer Relevance | Generation | Answer doesn't address the question | Semantic similarity answer ↔ query | — |
| Latency (p50/p95) | System | Experience degradation | End-to-end response time | — |
| Cost per query | System | Financial impact of changes | Input + output tokens × model price | — |
In practice, human annotation is the gold standard but doesn't scale for rapid iteration. What works day-to-day is using an LLM (Claude 3 Sonnet or Haiku on Bedrock, for example) as a judge: you pass the query, the retrieved chunks, and the generated answer, and ask the model to evaluate faithfulness and relevance on a structured scale. It's not perfect — the judge makes mistakes too — but it's consistent enough to detect regressions between versions. The course site has eval studies running on Bedrock with real examples of judgment prompts. Use human annotation to calibrate the judge periodically, not for every experiment.
How to build an evaluation dataset and detect hallucination
Every serious evaluation starts with a reference dataset: (question, expected answer) pairs built from your real documents. You can generate them synthetically — ask an LLM to read each chunk and produce plausible questions — then manually filter for the most representative cases and hard cases (ambiguous, multi-hop, unanswerable). Fifty to a hundred well-chosen pairs are worth more than a thousand generated without curation.
To detect hallucination, the most practical technique is claim decomposition: break the generated answer into atomic sentences and, for each one, ask the LLM-judge whether it is supported by any of the retrieved chunks. An unsupported claim is a hallucination. This is more precise than evaluating the whole answer at once, because the model may be 90% correct and hallucinate on one critical detail.
An important operational detail: measure cost and latency alongside quality. Increasing k from 5 to 10 may improve recall by a few percentage points but doubles context tokens and generation cost. That trade-off needs to appear on the same dashboard. Changes to chunking, embedding model, search strategy, and system prompt should be treated as experiments with recorded metrics — not informal tweaks.
RAG evaluation pipeline
Flow of how a RAG experiment is evaluated: reference dataset feeds both the RAG pipeline and the judgment process, producing retrieval and generation metrics separately.
- Dataset de referência · (query, resposta esperada)
- Documentos reais · (chunks indexados)
- Retriever · (busca híbrida + rerank)
- LLM gerador · (Bedrock)
- Resposta gerada · + chunks usados
- Métricas de recuperação · hit rate · precision · recall
- LLM-as-judge · (faithfulness · relevance)
- Anotação humana · (calibração periódica)
- Dashboard de métricas · qualidade · custo · latência
RAG metrics
Tap a concept, then its definition.
Continuous evaluation: every change is an experiment
Evaluation isn't something you do once before going to production. Every change to the pipeline — chunking strategy, embedding model, k value, system prompt, generator model — should trigger an evaluation run on the reference dataset. Without that, you accumulate changes without knowing which one helped and which introduced a regression.
The minimum structure I recommend: version each pipeline configuration (can be simple — a hash of the relevant parameters), run the reference dataset, record metrics in a comparison table. When a metric drops, you know exactly which change caused it.
In production, add a layer of online evaluation: sample real queries, run the LLM-judge asynchronously, and alert when faithfulness drops below a threshold. This catches problems that the synthetic dataset didn't anticipate — new documents arriving in the index, shifts in the distribution of user questions, model behavior drift after updates.
Latency and cost belong here too. A more expensive generator model might improve faithfulness by a few points — but if cost per query triples, a better prompt with the current model might be the right call. Only the numbers tell you. The following lessons cover how Bedrock Knowledge Bases manages part of this pipeline, and how guardrails complement evaluation with real-time controls.
How to set up your RAG evaluation from scratch
- 1
Create the reference dataset
Generate (query, expected answer) pairs synthetically with an LLM over your real documents. Manually curate 50–100 cases, including hard and unanswerable cases.
- 2
Measure retrieval separately
For each query, compare returned chunks against known relevant chunks. Compute hit rate, precision@k, and recall@k before looking at generation.
- 3
Evaluate generation with LLM-as-judge
Decompose the answer into atomic claims. For each one, ask the judge if it's supported by the chunks. Also evaluate whether the answer addresses the original question.
- 4
Record cost and latency in the same run
Input and output tokens, p50/p95 response time. Quality without cost is an incomplete metric for architecture decisions.
- 5
Version and compare
Each pipeline configuration gets an identifier. No change goes to production without a row in the metrics comparison table.
- 6
Add online evaluation in production
Sample real queries, run the judge asynchronously, alert on regressions. Calibrate the judge with human annotation periodically.
Frequently asked questions about RAG evaluation
Can I use RAGAS or ready-made frameworks?
Yes, RAGAS and similar frameworks implement these metrics and work well as a starting point. The important thing is understanding what each metric measures to interpret results correctly — frameworks don't replace that understanding.
What's the minimum size for the reference dataset?
There's no magic number, but 50 well-curated pairs are enough to detect significant regressions. Below that, statistical variance is too high to trust comparisons.
Does high faithfulness guarantee the answer is correct?
No. Faithfulness measures whether the answer is supported by the retrieved chunks — not whether the chunks are true. If the source document contains wrong information, high faithfulness doesn't help. That's why knowledge base quality matters as much as the pipeline.
How to measure cost per query on Bedrock?
Bedrock returns input and output token counts in each response. Multiply by the published model prices. Record this alongside quality metrics to have the complete picture of each experiment.
Module 2 Wrap-up
You've reached the end of Module 2 with what really matters: not just how to retrieve better (hybrid search, reranking, filters, routing), but how to know if you're actually retrieving better. Evaluation is what turns experiments into decisions. A RAG pipeline without metrics is a system you operate in the dark — and in production, dark is expensive. Module 3 starts with Amazon Bedrock Knowledge Bases: how AWS's managed service implements much of what we covered in this module, where it saves you work, and where you still need to make your own choices.
Checkpoint — Module 2
1. What does 'faithfulness' measure in RAG?