Evaluation (evals) and hallucination: how to know it's good
Without evals you're in the dark. How to measure quality, cost and hallucination of an AI system.
6 min read
You tested the system, the answers looked fine, and you shipped to production. Two weeks later, a user reports that the model invented a policy that doesn't exist. Without systematic evaluation, you have no way of knowing whether the system works — you only find out when something breaks in production. Evals aren't bureaucracy: they're the only instrument that turns 'looks good' into 'we know it's good'.
Why 'looks good' is not a metric
When you evaluate an AI system manually — read ten outputs, find them reasonable, move on — you're doing confirmation-bias sampling. You tend to notice what works and normalize what fails.
The problem compounds with LLMs because they are fluent by nature. A wrong answer written confidently and with good grammar looks more correct than a right answer written hesitantly. Fluency is not factuality.
On top of that, AI systems have non-deterministic failure surfaces: the same prompt can produce different outputs depending on temperature, model version, or context shifts. A one-off manual evaluation captures a moment, not a behavior.
What you need is a test dataset with ground truth — known inputs, expected outputs, clear acceptance criteria — and a process that runs those cases every time you change something: the prompt, the model, the RAG chunking, the tool order. Without that, every change is a leap in the dark. With it, every change has a score.
Continuous evaluation cycle of an AI system
Evals are not a one-time pre-deploy step — they are a cycle that closes the loop between change and evidence.
- Dataset de referência · entradas + gabarito
- Novos casos · de falhas em prod
- Eval runner · executa todos os casos
- LLM-as-judge · correção / alucinação
- Métricas de tarefa · F1, ROUGE, exact match
- Scorecard · correção, custo, latência
- Detector de regressão · compara com baseline
- Decisão · promoter ou bloquear
- Produção · observabilidade ativa
The four types of evaluation — and when to use each
Reference dataset with ground truth is the most reliable type. You have (question, expected answer) pairs created by humans, and you compare the model's output against that ground truth using deterministic metrics (exact match, F1, ROUGE). It's cheap to run, reproducible, and great for regression. The limitation: building the dataset costs time, and it goes stale if the domain changes.
LLM-as-judge uses a model (typically a more capable one, like Claude Opus or GPT-4o) to evaluate another model's output against criteria you define in an evaluation prompt: 'is the answer factually correct? is it grounded in the provided context? is it concise?'. It scales well and works for subjective dimensions — tone, completeness, absence of hallucination — that automatic metrics can't capture. The risk is judge bias: the evaluator model has its own preferences and can be fooled by fluent outputs.
Task metrics are specific to what the system does. An entity-extraction system has precision and recall. A classification system has accuracy and F1. A code-generation system has compilation rate and test-pass rate. Define these metrics before you build — they force clarity about what 'working' means.
Human evaluation is the gold standard, but doesn't scale. Use it to calibrate the other methods (you discover your LLM-judge agrees with humans 87% of the time), for edge cases, and for go/no-go decisions on critical launches.
Evaluation types: comparison
| Type | Setup cost | Scale | Objectivity | Best for | |
|---|---|---|---|---|---|
| Dataset + ground truth | High (human creation) | High | High | Regression, CI/CD | — |
| LLM-as-judge | Medium (eval prompt) | High | Medium (judge bias) | Subjective dimensions, hallucination | — |
| Task metrics | Low (automatic) | High | High | Structured tasks (extraction, classification) | — |
| Human evaluation | High (expert time) | Low | High (gold standard) | Calibration, critical go/no-go | — |
In practice, the biggest mistake I see is waiting for a perfect dataset before starting to measure. Start with 30 representative cases — covering the most critical scenarios and the edge cases you already know. An LLM-as-judge with a well-written evaluation prompt running on those 30 cases is already infinitely better than no evaluation. You'll iterate and grow the dataset over time. On my site I have specific case studies on how to structure evals for systems built with Bedrock AgentCore — including how to use the platform's native observability features to close the loop between production and evaluation.
What to measure: beyond correctness
Correctness and factuality are the core — is the answer right? Is it grounded in verifiable sources? Did it invent data? — but a production system requires more dimensions.
Hallucination deserves special attention. There are two main types: factual hallucination (the model asserts something false about the world) and grounding hallucination (the model cites or paraphrases something not in the context it received). In RAG systems, the second type is more treacherous because the model may ignore the retrieved documents and generate from its own knowledge — or worse, mix the two. A well-calibrated LLM-judge can detect this by comparing the output against the provided context.
Token cost is a business metric disguised as a technical one. If your system prompt is 4,000 tokens and you make 1 million calls per day, that matters. Measure input and output tokens per call, and project cost per use case. Prompt changes that seem innocent can double the cost.
Latency directly affects UX. Measure p50 and p95 — the median hides the slow cases that frustrate users. Latency is also affected by model choice, context size, and whether you're using streaming.
Security as an eval dimension means: did the system correctly refuse malicious inputs? Did it leak system prompt data? Did it execute tool calls it shouldn't have? This connects directly to Lesson 10, where we cover guardrails.
Types of evaluation
Tap a concept, then its definition.
What you need to know about evals
Evals as an engineering asset — and the connection to production
A well-built evaluation dataset is an engineering asset as valuable as the system's code. It documents expected behavior, serves as an executable specification, and protects against regressions.
With every relevant change — new model, revised prompt, different RAG chunking, new tool added to the agent — you run the evals and compare against the previous baseline. If the correctness score dropped 3 points and cost went up 20%, you have evidence to decide whether it's worth it. Without evals, you're making that decision in the dark.
The full cycle closes when you connect evals with production observability. Production logs reveal cases you didn't anticipate in the dataset — users ask questions you didn't imagine, in languages you didn't test, with contexts that break your prompt's assumptions. Those cases become new items in the eval dataset. The system gets more robust with each iteration.
This connection between offline evals and online observability is what separates a managed AI system from an AI system you're just hoping works. In Module 5 we'll go deeper on production instrumentation — traces, metrics, and alerts — but the foundation is this: you need to measure before you get there, and keep measuring after.
Frequently asked questions about evals
How many test cases do I need to start?
30 to 50 well-chosen cases are already enough to detect gross regressions and calibrate an LLM-judge. Prioritize coverage of critical scenarios and known edge cases, not volume. You'll grow the dataset over time.
Is LLM-as-judge reliable?
It's useful, not infallible. The judge has position bias (prefers the first option), verbosity bias (prefers longer answers), and can be fooled by fluent outputs. Always calibrate by comparing against human evaluation on a sample. Use evaluation prompts with explicit criteria and ask the judge to justify the score — this reduces bias.
Should I use the same evals for different models?
Yes — that's exactly what they're for. A stable eval dataset lets you compare models under controlled conditions. If you change the dataset at the same time you change the model, you lose the ability to isolate what caused the quality change.
Is hallucination always detectable?
No. Subtle hallucinations — especially in domains where you don't have experts to review — can pass any automated method. That's why evals don't replace periodic human review, especially in high-risk domains like healthcare, legal, and finance.