Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

The AI Architect Track

Module 2 · From model to application· Lesson 09/22

Evaluation (evals) and hallucination: how to know it's good

Without evals you're in the dark. How to measure quality, cost and hallucination of an AI system.

6 min read

Listen — Fernando's cloned voice

0:0010:47

Speed

Download

You tested the system, the answers looked fine, and you shipped to production. Two weeks later, a user reports that the model invented a policy that doesn't exist. Without systematic evaluation, you have no way of knowing whether the system works — you only find out when something breaks in production. Evals aren't bureaucracy: they're the only instrument that turns 'looks good' into 'we know it's good'.

Why 'looks good' is not a metric

When you evaluate an AI system manually — read ten outputs, find them reasonable, move on — you're doing confirmation-bias sampling. You tend to notice what works and normalize what fails.

The problem compounds with LLMs because they are fluent by nature. A wrong answer written confidently and with good grammar looks more correct than a right answer written hesitantly. Fluency is not factuality.

On top of that, AI systems have non-deterministic failure surfaces: the same prompt can produce different outputs depending on temperature, model version, or context shifts. A one-off manual evaluation captures a moment, not a behavior.

What you need is a test dataset with ground truth — known inputs, expected outputs, clear acceptance criteria — and a process that runs those cases every time you change something: the prompt, the model, the RAG chunking, the tool order. Without that, every change is a leap in the dark. With it, every change has a score.

Continuous evaluation cycle of an AI system

Evals are not a one-time pre-deploy step — they are a cycle that closes the loop between change and evidence.

🧪 Conjunto de avaliação

Dataset de referência · entradas + gabarito
Novos casos · de falhas em prod

⚙️ Pipeline de eval

Eval runner · executa todos os casos
LLM-as-judge · correção / alucinação
Métricas de tarefa · F1, ROUGE, exact match

📊 Resultados

Scorecard · correção, custo, latência
Detector de regressão · compara com baseline

🔁 Ação

Decisão · promoter ou bloquear
Produção · observabilidade ativa

The four types of evaluation — and when to use each

Reference dataset with ground truth is the most reliable type. You have (question, expected answer) pairs created by humans, and you compare the model's output against that ground truth using deterministic metrics (exact match, F1, ROUGE). It's cheap to run, reproducible, and great for regression. The limitation: building the dataset costs time, and it goes stale if the domain changes.

LLM-as-judge uses a model (typically a more capable one, like Claude Opus or GPT-4o) to evaluate another model's output against criteria you define in an evaluation prompt: 'is the answer factually correct? is it grounded in the provided context? is it concise?'. It scales well and works for subjective dimensions — tone, completeness, absence of hallucination — that automatic metrics can't capture. The risk is judge bias: the evaluator model has its own preferences and can be fooled by fluent outputs.

Task metrics are specific to what the system does. An entity-extraction system has precision and recall. A classification system has accuracy and F1. A code-generation system has compilation rate and test-pass rate. Define these metrics before you build — they force clarity about what 'working' means.

Human evaluation is the gold standard, but doesn't scale. Use it to calibrate the other methods (you discover your LLM-judge agrees with humans 87% of the time), for edge cases, and for go/no-go decisions on critical launches.

Evaluation types: comparison

	Type	Setup cost	Scale	Objectivity	Best for
Dataset + ground truth	High (human creation)	High	High	Regression, CI/CD	—
LLM-as-judge	Medium (eval prompt)	High	Medium (judge bias)	Subjective dimensions, hallucination	—
Task metrics	Low (automatic)	High	High	Structured tasks (extraction, classification)	—
Human evaluation	High (expert time)	Low	High (gold standard)	Calibration, critical go/no-go	—

In practice: start small, but start

Senior Solutions Architect

In practice, the biggest mistake I see is waiting for a perfect dataset before starting to measure. Start with 30 representative cases — covering the most critical scenarios and the edge cases you already know. An LLM-as-judge with a well-written evaluation prompt running on those 30 cases is already infinitely better than no evaluation. You'll iterate and grow the dataset over time. On my site I have specific case studies on how to structure evals for systems built with Bedrock AgentCore — including how to use the platform's native observability features to close the loop between production and evaluation.

What to measure: beyond correctness

Correctness and factuality are the core — is the answer right? Is it grounded in verifiable sources? Did it invent data? — but a production system requires more dimensions.

Hallucination deserves special attention. There are two main types: factual hallucination (the model asserts something false about the world) and grounding hallucination (the model cites or paraphrases something not in the context it received). In RAG systems, the second type is more treacherous because the model may ignore the retrieved documents and generate from its own knowledge — or worse, mix the two. A well-calibrated LLM-judge can detect this by comparing the output against the provided context.

Token cost is a business metric disguised as a technical one. If your system prompt is 4,000 tokens and you make 1 million calls per day, that matters. Measure input and output tokens per call, and project cost per use case. Prompt changes that seem innocent can double the cost.

Latency directly affects UX. Measure p50 and p95 — the median hides the slow cases that frustrate users. Latency is also affected by model choice, context size, and whether you're using streaming.

Security as an eval dimension means: did the system correctly refuse malicious inputs? Did it leak system prompt data? Did it execute tool calls it shouldn't have? This connects directly to Lesson 10, where we cover guardrails.

Match

Types of evaluation

Tap a concept, then its definition.

What you need to know about evals

Fluency is not factuality — a beautiful output can be completely wrong.

Evals are a continuous asset: run them on every prompt, model, or pipeline change.

LLM-as-judge scales where humans can't — but calibrate the judge with human evaluation first.

In RAG, specifically evaluate whether the output is grounded in the retrieved documents.

Cost and latency are eval metrics — not just monitoring metrics.

Production failures become new test cases — close the loop between observability and evals.

Evals as an engineering asset — and the connection to production

A well-built evaluation dataset is an engineering asset as valuable as the system's code. It documents expected behavior, serves as an executable specification, and protects against regressions.

With every relevant change — new model, revised prompt, different RAG chunking, new tool added to the agent — you run the evals and compare against the previous baseline. If the correctness score dropped 3 points and cost went up 20%, you have evidence to decide whether it's worth it. Without evals, you're making that decision in the dark.

The full cycle closes when you connect evals with production observability. Production logs reveal cases you didn't anticipate in the dataset — users ask questions you didn't imagine, in languages you didn't test, with contexts that break your prompt's assumptions. Those cases become new items in the eval dataset. The system gets more robust with each iteration.

This connection between offline evals and online observability is what separates a managed AI system from an AI system you're just hoping works. In Module 5 we'll go deeper on production instrumentation — traces, metrics, and alerts — but the foundation is this: you need to measure before you get there, and keep measuring after.

Frequently asked questions about evals

How many test cases do I need to start?

30 to 50 well-chosen cases are already enough to detect gross regressions and calibrate an LLM-judge. Prioritize coverage of critical scenarios and known edge cases, not volume. You'll grow the dataset over time.

Is LLM-as-judge reliable?

It's useful, not infallible. The judge has position bias (prefers the first option), verbosity bias (prefers longer answers), and can be fooled by fluent outputs. Always calibrate by comparing against human evaluation on a sample. Use evaluation prompts with explicit criteria and ask the judge to justify the score — this reduces bias.

Should I use the same evals for different models?

Yes — that's exactly what they're for. A stable eval dataset lets you compare models under controlled conditions. If you change the dataset at the same time you change the model, you lose the ability to isolate what caused the quality change.

Is hallucination always detectable?

No. Subtle hallucinations — especially in domains where you don't have experts to review — can pass any automated method. That's why evals don't replace periodic human review, especially in high-risk domains like healthcare, legal, and finance.

References and further reading

AWS — Evaluating LLM-based applications (Amazon Bedrock docs)RAGAS — Framework for RAG evaluation Anthropic — How to evaluate LLM outputs OpenAI Evals — open-source evaluation framework AWS Blog — LLM-as-judge patterns on Amazon Bedrock

Previous Next lesson