Design Doc: Continuous Evaluation Suite for Agents with Bedrock AgentCore
Listen to study
generated on playGenerated only on first play
LLM agents in production silently degrade as models, tools, and prompts evolve — without a continuous evaluation discipline, regressions reach users before they are detected. This document proposes a complete offline and online evaluation architecture using Amazon Bedrock AgentCore, with versioned datasets, CI/CD quality gates, runtime signals, and systematic adversarial testing.
An agent that works well at launch can silently fail six weeks later — when the base model is updated, a tool changes its schema, or a new prompt flow is introduced. Without a continuous evaluation suite with real quality gates, you are flying blind. This RFC describes how to build that discipline on top of Bedrock AgentCore.
The Problem: Agents Are Composite Systems That Degrade in Non-Obvious Ways
LLM agents are not deterministic functions. They combine a language model (which can be updated by the provider without explicit notice), a set of tools with their own APIs and schemas, short- and long-term memory, and an orchestration layer that interprets intent and chains calls. Each of these components can change independently — and the interaction between changes is where the most dangerous regressions hide.
Consider a concrete scenario: a fintech's customer support agent uses a get_account_balance tool that returns values in cents. The backend team migrates the API to return values in the base currency unit without updating the tool documentation. The agent keeps working — it calls the tool, receives a number, and responds to the user. But now it reports balances 100x larger than the real value. No runtime error. No latency alarm. The degradation is semantic, not technical.
This is the central problem a continuous evaluation suite needs to solve: detecting semantic and behavioral degradation before it reaches the user, in a system where parts change asynchronously and errors are rarely exceptions — they are plausible but incorrect responses.
The state of the art in agent evaluation is still consolidating. Most teams I encounter operate with a combination of: manual ad hoc testing before releases, some user satisfaction metrics collected reactively, and latency/error alarms that only capture technical failures. This is insufficient for systems that make decisions with real consequences. What we need is a discipline analogous to what we have in traditional software systems — unit tests, integration, regression, coverage — but adapted to the probabilistic and compositional nature of agents.
Goals and Non-Goals
Scenario Context
- System
- Quality platform for LLM agents in production (composite scenario)
- Base infrastructure
- Amazon Bedrock AgentCore + AWS
- Target scale
- 10-50 distinct agents, each with 5-20 tools, multiple versions in parallel
- Offline evaluation frequency
- On every PR / commit to agent config, prompt, or tool schema
- Online evaluation frequency
- Continuous via sampling (estimate: 5-15% of conversations)
- Main stack
- Bedrock AgentCore, S3, DynamoDB, Step Functions, Lambda, EventBridge, CloudWatch, Bedrock Evaluations
- Document type
- RFC / Design Doc — proposal for implementation
- Main risk
- Silent semantic degradation in production without automated detection
Proposed Design: Three Evaluation Layers
The architecture I propose organizes evaluation into three complementary layers, each with distinct purpose, frequency, and cost. The useful metaphor here is defense in depth — no single layer is sufficient, but together they form a robust detection network.
Layer 1 — Offline Evaluation (Pre-Deploy Gate)
Before any agent version is promoted to production, it runs through an offline test battery against a versioned dataset stored in S3. The dataset is structured as a set of test cases, each containing: the user input (or turn sequence for multi-turn conversations), the expected set of tool calls (with expected arguments and tolerances), the expected response (or evaluation criteria when there is no deterministic response), and category metadata (task type, difficulty level, whether it is adversarial).
Bedrock AgentCore runs the agent in evaluation mode — with the same tools configured, but in an isolated environment with mocked tools that return controlled responses. This is critical: you do not want offline tests making real calls to production APIs, but you also do not want mocks so simplified they do not test the agent's reasoning about tool responses.
Metrics collected in this layer include: tool-call accuracy (did the agent call the right tool?), argument fidelity (were arguments correct within tolerance?), call sequence correctness (for tasks requiring a specific call sequence), unnecessary tool calls (did the agent call unnecessary tools, indicating confused reasoning?), and final response quality evaluated by an LLM-as-judge configured via Bedrock Evaluations.
The quality gate is defined as a set of thresholds per metric category. Initial proposal: tool-call accuracy ≥ 0.90 on the full set, ≥ 0.95 on the subset of critical cases (flagged in the dataset), and no regression > 5 percentage points relative to the previous version's baseline. That last criterion — relative regression — is as important as the absolute threshold: an agent that was already poor at something should not get even worse.
Layer 2 — Adversarial Testing
Separate from the standard evaluation dataset, we maintain an adversarial dataset with specific categories: prompt injection attempts (attempts to make the agent ignore its system instructions), boundary probing (inputs that test policy limits — what does the agent do when it receives an ambiguous request that could be legitimate or malicious?), and tool-use edge cases (what happens when a tool returns an unexpected error? Does the agent recover gracefully or loop?).
These tests do not follow the same pass/fail regime as functional tests — they have expected behavior criteria. For prompt injection, the criterion is that the agent maintains its system instructions. For boundary cases, the criterion is that the agent escalates or refuses appropriately rather than hallucinating a response. For tool errors, the criterion is recovery in at most N attempts with graceful fallback.
Layer 3 — Online Signals (Production Monitoring)
In production, we instrument an asynchronous evaluation pipeline that processes a sample of real conversations. Sampling is stratified — we ensure representation of different task types, not just volume. Selected conversations are sent to a Step Functions workflow that: extracts tool-use traces from AgentCore, runs LLM-as-judge evaluation on final turns, and persists metrics to CloudWatch with dimensions by agent version, task type, and tool.
Beyond sampling, we monitor proxy signals that do not require LLM evaluation: rate of conversations hitting the maximum turn limit (indicates agent stuck in a loop), rate of unrecovered tool errors, escalation-to-human rate (if applicable), and latency distribution per tool (latency changes can indicate call behavior changes).
Continuous Evaluation Suite Architecture
Complete flow from committing a new agent version through quality signals in production, showing the three evaluation layers and the dataset curation feedback loop.
- Git Repo · Agent Config / Prompts
- CI/CD · (CodePipeline)
- Quality Gate · Lambda
- S3 · Versioned Datasets
- DynamoDB · Dataset Registry
- Curator UI · (Internal Tool)
- Step Functions · Eval Orchestrator
- Bedrock AgentCore · (Eval Mode)
- Mock Tool Layer · Lambda
- Bedrock Evaluations · LLM-as-Judge
- S3 · Eval Results
- S3 · Adversarial Dataset
- Adversarial Runner · Lambda
- Bedrock AgentCore · (Production)
- Real Tools · (APIs)
- AgentCore Traces · S3 / CloudWatch
- Sampling Lambda · Stratified 5-15%
- Step Functions · Online Eval
- Bedrock Evaluations · Online Judge
- CloudWatch · Metrics + Alarms
- EventBridge · Anomaly Events
Dataset Management: Versioning, Curation, and the Dataset Drift Problem
One of the most neglected aspects of agent evaluation systems is management of the evaluation dataset itself. Teams build an initial set of test cases, run them for months, and never revisit — while the agent evolves to handle entirely new use cases that are not represented. The dataset ages and loses discriminative power.
The proposal here is to treat the evaluation dataset with the same discipline we apply to code: explicit versioning, changelog, and a continuous curation process.
Versioning structure: Each dataset version is an immutable artifact in S3, identified by a content hash and a semantic version number (major.minor.patch). DynamoDB maintains the registry with metadata: creation date, author, changelog, coverage by task category, and the agent version it was validated against. The quality gate always runs against the dataset version specified in the agent manifest — not against "latest". This is deliberate: it guarantees reproducibility and prevents a new dataset from breaking an agent that was working well.
Curation process: New test cases enter the dataset through three paths. The first is manual curation — engineers or domain experts write cases based on requirements. The second is production mining — real conversations that were flagged as problematic (via user feedback, escalation, or automatic detection) are reviewed and converted into test cases. This is the most valuable path: these are real failures, not hypothetical ones. The third is synthetic generation — using an LLM to generate variations of existing cases, especially to increase coverage of edge cases and adversarial scenarios.
The dataset drift problem: As the agent improves, cases that were once difficult become trivial. A dataset that is not updated loses discriminative power — the agent passes everything, but that does not mean it is working well on new cases. The solution is to monitor the distribution of dataset scores over time: if most cases score near 1.0, it is time to add harder cases. This is analogous to the benchmark saturation problem in ML — when a model reaches human performance on a benchmark, the benchmark needs to be replaced, not the model declared perfect.
Tool-use coverage: For agents with multiple tools, the dataset must ensure adequate coverage of each tool and tool combinations. We maintain a coverage map showing, for each tool, how many test cases exercise it, in what contexts, and with what argument patterns. New tools added to the agent must have minimum coverage before the agent can be promoted — this is an additional quality gate rule.
Design Alternatives: Evaluation Approaches
LLM-as-Judge only (no structured test cases)
- Easy to scale — does not require manual test case curation
- Flexible for evaluating open-ended responses without reference answers
- Does not detect specific tool-use regressions — the judge evaluates the final result, not the process
- High cost to evaluate each case; inconsistency between runs of the same judge
- No deterministic baseline — hard to know if a score change is real or judge noise
Useful as a complementary layer, insufficient as the sole approach
Deterministic tests only (mock tools + exact expected outputs)
- Reproducible and cheap — no LLM calls for evaluation
- Detects tool-use regressions with high precision
- Brittle for natural language responses — any legitimate variation breaks the test
- Does not capture semantic quality of the final response
- High maintenance — each prompt change can invalidate hundreds of expected outputs
Excellent for tool-use assertions; inadequate for response evaluation
Hybrid approach (proposed in this RFC)
- Deterministic where possible (tool-use), probabilistic where necessary (final response)
- Offline + online layers cover different failure points
- Versioned dataset allows tracking regressions over time with reproducibility
- Higher operational complexity — multiple systems to maintain
- LLM-as-judge cost for online evaluation can be significant at scale
Recommended approach — complexity justified by discriminative power
Third-party evaluation platform (Braintrust, LangSmith, etc.)
- Faster time-to-value — ready UI, existing integrations
- Community and metric templates
- Evaluation data (including production conversations) leaves the AWS environment — problem for sensitive data
- Integration with Bedrock AgentCore traces requires custom work regardless
- Additional vendor lock-in; per-volume cost can be prohibitive
Valid for prototyping; not recommended for production in sensitive data contexts
Decision: Sampling Strategy for Online Evaluation
Evaluating 100% of production conversations with LLM-as-judge is prohibitively expensive and slow. We need a sampling strategy that maximizes detection power with controlled cost.
Stratified sampling with a 10% base rate, increasing to 50% when proxy metrics (error rate, anomalous latency) exceed thresholds. Conversations with explicit negative user feedback are always evaluated (100%). Guaranteed distribution by task type and most critical tool.
- Controlled online evaluation cost — estimated at 10-15% of 100% evaluation cost
- Anomaly detection may have latency of minutes to hours depending on traffic volume
- Requires implementation of stratified sampling logic — non-trivial to guarantee representativeness
Rollout Plan
- 1
Phase 1 — Foundation (Weeks 1-3)
Create the dataset structure in S3 and DynamoDB. Define the test case schema. Build the first dataset with 50-100 manual cases covering the agent's most critical flows. Configure AgentCore in evaluation mode with mock tools for main flows. No quality gate yet — just collect baseline data.
- 2
Phase 2 — Offline Evaluation (Weeks 4-6)
Implement the offline evaluation Step Functions workflow. Integrate Bedrock Evaluations as LLM-as-judge for final responses. Define tool-use metrics and implement deterministic calculation. Run against the last 3 agent versions to calibrate quality gate thresholds. Integrate into CI/CD as a non-blocking check initially.
- 3
Phase 3 — Active Quality Gate (Weeks 7-8)
Activate the quality gate as blocking in CI/CD. Define thresholds based on collected baseline data. Create the initial adversarial dataset with 20-30 prompt injection and boundary testing cases. Train the team on how to interpret results and how to add new cases to the dataset.
- 4
Phase 4 — Online Monitoring (Weeks 9-12)
Implement the sampling and online evaluation pipeline. Configure CloudWatch dashboards with metrics by version and by tool. Create alarms for anomalies. Implement the production mining flow: problematic conversations detected online are queued for review and potential addition to the offline dataset.
- 5
Phase 5 — Maturity and Automation (Weeks 13-16)
Implement synthetic test case generation to increase coverage. Create the tool-use coverage map and add minimum coverage rules to the quality gate. Review and refine thresholds with 2+ months of data. Document the curation process and create runbooks for the team.
Critical Risks and Mitigations
Risk 1 — Goodhart's Law in the quality gate: When a threshold becomes a target, it ceases to be a good measure. Teams may start writing test cases the agent already passes well, rather than cases that test real weaknesses. Mitigation: separate who writes test cases from who develops the agent; periodically include test cases from external adversaries. Risk 2 — Judge inconsistency: LLM-as-judge has non-trivial variance between runs. An agent may pass the quality gate in one run and fail in another with no real change. Mitigation: use multiple runs (3-5) and aggregate scores; define thresholds with a safety margin above the observed judge variance. Risk 3 — Mock tool drift: Tool mocks can diverge from real API behavior over time, making offline tests increasingly unrepresentative. Mitigation: version mocks alongside real tool schemas; include contract tests that verify mocks still correspond to real APIs. Risk 4 — Online evaluation cost at scale: At high volumes, even 10% sampling with LLM-as-judge can be expensive. Mitigation: use smaller models for the online judge (Claude Haiku vs. Sonnet), reserving more capable models for offline evaluation where precision is more critical. Risk 5 — False confidence: An evaluation suite that passes can give false confidence. The dataset never covers all possible cases. Mitigation: explicitly communicate what the dataset covers and what it does not; keep humans in the loop for review of high-risk cases.
Success Metrics and Targets
- Tool-call accuracy (offline)
- ≥ 0.90 overall, ≥ 0.95 on critical cases
- Maximum allowed regression (offline)
- < 5 percentage points vs. previous version baseline
- Tool coverage in dataset
- 100% of tools with ≥ 10 test cases each
- Offline evaluation pipeline latency
- < 15 minutes for dataset of up to 200 cases (estimate)
- Online evaluation coverage
- ≥ 10% of all conversations, 100% of those with negative feedback
- MTTD (Mean Time to Detect) production regression
- < 4 hours at normal volume (estimate)
- Quality gate false positive rate
- < 5% — gates that block deploys without real regression
- Dataset update cadence
- Mandatory monthly review; continuous addition via production mining
After working with high-criticality systems in finance, I have a strong conviction: the quality of an evaluation system is determined by the quality of its dataset, not the sophistication of its infrastructure. You can have the most elegant pipeline in the world running on Bedrock AgentCore with Step Functions and LLM-as-judge, but if the dataset has 30 cases all of the same type, you are measuring nothing. What I would not compromise on: the production mining process. Test cases derived from real failures are orders of magnitude more valuable than synthetic cases. In every project I worked on involving system evaluation, the most important bugs were found by cases that came from real incidents, not from preventive brainstorming. I would invest disproportionately in the flow from "problematic conversation detected in production" → "human review" → "test case in offline dataset". What I would do differently from what is described here: start smaller. This document describes a complete, mature system. In practice, I would start with a simple Python script that runs 50 test cases against the agent before each deploy, without Step Functions, without LLM-as-judge, just deterministic tool-use assertions. That already captures 70% of the value with 10% of the complexity. The additional sophistication is justified as you have evidence that you need it. On Bedrock AgentCore specifically: the ability to run the agent in evaluation mode with mocked tools is genuinely useful — it solves a real environment isolation problem that teams usually solve with workarounds. Dataset management via AgentCore is a recent addition that is still maturing; I would maintain a thin abstraction layer over it to avoid being held hostage to API changes. The most important lesson I have learned about AI system evaluation: treat evaluation as a product, not as infrastructure. It needs clear ownership, someone who wakes up in the morning thinking about how to improve the dataset and thresholds. Without that, the most sophisticated system in the world becomes a CI check that nobody reads.
Verdict
The continuous evaluation suite described in this RFC is necessary for any LLM agent operating in production with real consequences. The three-layer architecture — offline evaluation with quality gate, systematic adversarial testing, and online monitoring with stratified sampling — covers the most important degradation vectors: model changes, tool schema drift, and regressions introduced by prompt updates. Bedrock AgentCore provides the necessary primitives — execution in evaluation mode, tool-use tracing, and integration with Bedrock Evaluations for LLM-as-judge — without requiring you to build this infrastructure from scratch. The real value, however, is not in the infrastructure: it is in the dataset curation process and the discipline of treating evaluation as a product with clear ownership. My recommendation: implement in phases, starting with what generates the most value with the least complexity (versioned dataset + deterministic tool-use quality gate), and add sophistication only when you have evidence that the simple system is letting regressions through. The risk of over-engineering an evaluation system is real — a complex system that nobody understands or maintains is worse than a simple script that the team actually uses.
References
Ask Fernando about this
Get a focused answer about this study from my AI assistant, grounded in my work.