Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

Production RAG on AWS

Module 2 · Retrieval· Lesson 08/12

Evaluating RAG: faithfulness and relevance

Without measuring, you can't improve. How to evaluate a RAG's retrieval and generation.

6 min read

You tuned the chunking, swapped the embedding model, rewrote the prompt — but how do you know if things got better? Without metrics, you're flying blind. Evaluation isn't a final step: it's the instrument that guides every decision in your RAG pipeline.

Two independent halves to evaluate

A RAG pipeline has two distinct jobs: retrieve the right chunks and generate a response that is faithful to them. These two jobs fail in different ways, so they need different metrics.

On the retrieval side, the question is: did the relevant chunks show up in the returned list? You measure this with hit rate (is at least one relevant chunk in the top-k?), precision@k (of the k chunks returned, how many are actually useful?), and recall@k (of all relevant chunks that exist, how many were captured?). A retriever with low precision floods the context with noise; one with low recall leaves critical information out.

On the generation side, the question splits in two: is the answer faithful to the sources (faithfulness / groundedness) and does it answer the question (answer relevance)? Faithfulness detects hallucination — the model asserted something not present in any retrieved chunk. Answer relevance detects answers that may be true but don't actually address what was asked.

Misdiagnosing which half is broken is the most expensive mistake in RAG. If the retriever brings the wrong chunks, improving the prompt won't fix it. If the retriever is fine but the model hallucinates, swapping the embedding won't help. Measuring both halves separately is what lets you act in the right place.

RAG metrics: what each one measures

	Metric	Half	What it detects	How to compute
Hit Rate	Retrieval	At least 1 relevant chunk in top-k	% of queries with a hit	—
Precision@k	Retrieval	Noise in the context sent to the LLM	Relevant chunks / k returned	—
Recall@k	Retrieval	Relevant information missed	Relevant chunks captured / total available	—
Faithfulness	Generation	Hallucination: claims unsupported by chunks	LLM-as-judge or NLI per claim	—
Answer Relevance	Generation	Answer doesn't address the question	Semantic similarity answer ↔ query	—
Latency (p50/p95)	System	Experience degradation	End-to-end response time	—
Cost per query	System	Financial impact of changes	Input + output tokens × model price	—

In practice: LLM-as-judge is the realistic starting point

Senior Solutions Architect

In practice, human annotation is the gold standard but doesn't scale for rapid iteration. What works day-to-day is using an LLM (Claude 3 Sonnet or Haiku on Bedrock, for example) as a judge: you pass the query, the retrieved chunks, and the generated answer, and ask the model to evaluate faithfulness and relevance on a structured scale. It's not perfect — the judge makes mistakes too — but it's consistent enough to detect regressions between versions. The course site has eval studies running on Bedrock with real examples of judgment prompts. Use human annotation to calibrate the judge periodically, not for every experiment.

How to build an evaluation dataset and detect hallucination

Every serious evaluation starts with a reference dataset: (question, expected answer) pairs built from your real documents. You can generate them synthetically — ask an LLM to read each chunk and produce plausible questions — then manually filter for the most representative cases and hard cases (ambiguous, multi-hop, unanswerable). Fifty to a hundred well-chosen pairs are worth more than a thousand generated without curation.

To detect hallucination, the most practical technique is claim decomposition: break the generated answer into atomic sentences and, for each one, ask the LLM-judge whether it is supported by any of the retrieved chunks. An unsupported claim is a hallucination. This is more precise than evaluating the whole answer at once, because the model may be 90% correct and hallucinate on one critical detail.

An important operational detail: measure cost and latency alongside quality. Increasing k from 5 to 10 may improve recall by a few percentage points but doubles context tokens and generation cost. That trade-off needs to appear on the same dashboard. Changes to chunking, embedding model, search strategy, and system prompt should be treated as experiments with recorded metrics — not informal tweaks.

RAG evaluation pipeline

Flow of how a RAG experiment is evaluated: reference dataset feeds both the RAG pipeline and the judgment process, producing retrieval and generation metrics separately.

📋 Dados de avaliação

Dataset de referência · (query, resposta esperada)
Documentos reais · (chunks indexados)

🔍 Pipeline RAG sob teste

Retriever · (busca híbrida + rerank)
LLM gerador · (Bedrock)
Resposta gerada · + chunks usados

⚖️ Julgamento

Métricas de recuperação · hit rate · precision · recall
LLM-as-judge · (faithfulness · relevance)
Anotação humana · (calibração periódica)

📊 Resultados

Dashboard de métricas · qualidade · custo · latência

Match

RAG metrics

Tap a concept, then its definition.

Continuous evaluation: every change is an experiment

Evaluation isn't something you do once before going to production. Every change to the pipeline — chunking strategy, embedding model, k value, system prompt, generator model — should trigger an evaluation run on the reference dataset. Without that, you accumulate changes without knowing which one helped and which introduced a regression.

The minimum structure I recommend: version each pipeline configuration (can be simple — a hash of the relevant parameters), run the reference dataset, record metrics in a comparison table. When a metric drops, you know exactly which change caused it.

In production, add a layer of online evaluation: sample real queries, run the LLM-judge asynchronously, and alert when faithfulness drops below a threshold. This catches problems that the synthetic dataset didn't anticipate — new documents arriving in the index, shifts in the distribution of user questions, model behavior drift after updates.

Latency and cost belong here too. A more expensive generator model might improve faithfulness by a few points — but if cost per query triples, a better prompt with the current model might be the right call. Only the numbers tell you. The following lessons cover how Bedrock Knowledge Bases manages part of this pipeline, and how guardrails complement evaluation with real-time controls.

How to set up your RAG evaluation from scratch

1
Create the reference dataset
Generate (query, expected answer) pairs synthetically with an LLM over your real documents. Manually curate 50–100 cases, including hard and unanswerable cases.
2
Measure retrieval separately
For each query, compare returned chunks against known relevant chunks. Compute hit rate, precision@k, and recall@k before looking at generation.
3
Evaluate generation with LLM-as-judge
Decompose the answer into atomic claims. For each one, ask the judge if it's supported by the chunks. Also evaluate whether the answer addresses the original question.
4
Record cost and latency in the same run
Input and output tokens, p50/p95 response time. Quality without cost is an incomplete metric for architecture decisions.
5
Version and compare
Each pipeline configuration gets an identifier. No change goes to production without a row in the metrics comparison table.
6
Add online evaluation in production
Sample real queries, run the judge asynchronously, alert on regressions. Calibrate the judge with human annotation periodically.

Frequently asked questions about RAG evaluation

Can I use RAGAS or ready-made frameworks?

Yes, RAGAS and similar frameworks implement these metrics and work well as a starting point. The important thing is understanding what each metric measures to interpret results correctly — frameworks don't replace that understanding.

What's the minimum size for the reference dataset?

There's no magic number, but 50 well-curated pairs are enough to detect significant regressions. Below that, statistical variance is too high to trust comparisons.

Does high faithfulness guarantee the answer is correct?

No. Faithfulness measures whether the answer is supported by the retrieved chunks — not whether the chunks are true. If the source document contains wrong information, high faithfulness doesn't help. That's why knowledge base quality matters as much as the pipeline.

How to measure cost per query on Bedrock?

Bedrock returns input and output token counts in each response. Multiply by the published model prices. Record this alongside quality metrics to have the complete picture of each experiment.

Module 2 Wrap-up

Módulo 2 completo

You've reached the end of Module 2 with what really matters: not just how to retrieve better (hybrid search, reranking, filters, routing), but how to know if you're actually retrieving better. Evaluation is what turns experiments into decisions. A RAG pipeline without metrics is a system you operate in the dark — and in production, dark is expensive. Module 3 starts with Amazon Bedrock Knowledge Bases: how AWS's managed service implements much of what we covered in this module, where it saves you work, and where you still need to make your own choices.

Quiz

Checkpoint — Module 2

1. What does 'faithfulness' measure in RAG?

References and further reading

RAGAS: Automated Evaluation of RAG Pipelines Amazon Bedrock — Model evaluation AWS Blog: Evaluate the reliability of RAG applications using Amazon Bedrock ARES: An Automated Evaluation Framework for RAG Systems Benchmarking Large Language Models in Complex Medical Answering (faithfulness methodology reference)

Previous Next lesson

Two independent halves to evaluate

A RAG pipeline has two distinct jobs: retrieve the right chunks and generate a response that is faithful to them. These two jobs fail in different ways, so they need different metrics.

RAG metrics: what each one measures

	Metric	Half	What it detects	How to compute
Hit Rate	Retrieval	At least 1 relevant chunk in top-k	% of queries with a hit	—
Precision@k	Retrieval	Noise in the context sent to the LLM	Relevant chunks / k returned	—
Recall@k	Retrieval	Relevant information missed	Relevant chunks captured / total available	—
Faithfulness	Generation	Hallucination: claims unsupported by chunks	LLM-as-judge or NLI per claim	—
Answer Relevance	Generation	Answer doesn't address the question	Semantic similarity answer ↔ query	—
Latency (p50/p95)	System	Experience degradation	End-to-end response time	—
Cost per query	System	Financial impact of changes	Input + output tokens × model price	—

How to build an evaluation dataset and detect hallucination

RAG evaluation pipeline

Flow of how a RAG experiment is evaluated: reference dataset feeds both the RAG pipeline and the judgment process, producing retrieval and generation metrics separately.

📋 Dados de avaliação

Dataset de referência · (query, resposta esperada)
Documentos reais · (chunks indexados)

🔍 Pipeline RAG sob teste

Retriever · (busca híbrida + rerank)
LLM gerador · (Bedrock)
Resposta gerada · + chunks usados

⚖️ Julgamento

Métricas de recuperação · hit rate · precision · recall
LLM-as-judge · (faithfulness · relevance)
Anotação humana · (calibração periódica)

📊 Resultados

Dashboard de métricas · qualidade · custo · latência

Continuous evaluation: every change is an experiment

How to set up your RAG evaluation from scratch

Create the reference dataset

Generate (query, expected answer) pairs synthetically with an LLM over your real documents. Manually curate 50–100 cases, including hard and unanswerable cases.

Measure retrieval separately

For each query, compare returned chunks against known relevant chunks. Compute hit rate, precision@k, and recall@k before looking at generation.

Evaluate generation with LLM-as-judge

Decompose the answer into atomic claims. For each one, ask the judge if it's supported by the chunks. Also evaluate whether the answer addresses the original question.

Record cost and latency in the same run

Input and output tokens, p50/p95 response time. Quality without cost is an incomplete metric for architecture decisions.

Version and compare

Each pipeline configuration gets an identifier. No change goes to production without a row in the metrics comparison table.

Add online evaluation in production

Sample real queries, run the judge asynchronously, alert on regressions. Calibrate the judge with human annotation periodically.