Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

Design Doc / RFCAgent quality platform (cenário)IA / Qualidade

Design Doc: Continuous Evaluation Suite for Agents with Bedrock AgentCore

Jun 7, 2026 12 min AI-assisted

Listen to study

generated on play

Generated only on first play

On demand

0:000:00

Speed

The MP3 is saved to S3 after the first play.

LLM agents in production silently degrade as models, tools, and prompts evolve — without a continuous evaluation discipline, regressions reach users before they are detected. This document proposes a complete offline and online evaluation architecture using Amazon Bedrock AgentCore, with versioned datasets, CI/CD quality gates, runtime signals, and systematic adversarial testing.

An agent that works well at launch can silently fail six weeks later — when the base model is updated, a tool changes its schema, or a new prompt flow is introduced. Without a continuous evaluation suite with real quality gates, you are flying blind. This RFC describes how to build that discipline on top of Bedrock AgentCore.

The Problem: Agents Are Composite Systems That Degrade in Non-Obvious Ways

LLM agents are not deterministic functions. They combine a language model (which can be updated by the provider without explicit notice), a set of tools with their own APIs and schemas, short- and long-term memory, and an orchestration layer that interprets intent and chains calls. Each of these components can change independently — and the interaction between changes is where the most dangerous regressions hide.

Consider a concrete scenario: a fintech's customer support agent uses a get_account_balance tool that returns values in cents. The backend team migrates the API to return values in the base currency unit without updating the tool documentation. The agent keeps working — it calls the tool, receives a number, and responds to the user. But now it reports balances 100x larger than the real value. No runtime error. No latency alarm. The degradation is semantic, not technical.

This is the central problem a continuous evaluation suite needs to solve: detecting semantic and behavioral degradation before it reaches the user, in a system where parts change asynchronously and errors are rarely exceptions — they are plausible but incorrect responses.

The state of the art in agent evaluation is still consolidating. Most teams I encounter operate with a combination of: manual ad hoc testing before releases, some user satisfaction metrics collected reactively, and latency/error alarms that only capture technical failures. This is insufficient for systems that make decisions with real consequences. What we need is a discipline analogous to what we have in traditional software systems — unit tests, integration, regression, coverage — but adapted to the probabilistic and compositional nature of agents.

Goals and Non-Goals

✅ GOAL: Define an offline (pre-deploy) evaluation architecture with versioned datasets and quality gates that block promotion of regressive versions

✅ GOAL: Instrument online (post-deploy) evaluation signals that detect degradation in production without relying exclusively on human feedback

✅ GOAL: Create tool-use specific metrics: correct call rate, argument accuracy, call sequence correctness, and tool error recovery

✅ GOAL: Define a curation and versioning process for evaluation datasets that evolves alongside the agent

✅ GOAL: Include systematic adversarial scenarios: prompt injection, boundary testing, and edge cases that expose reasoning failures

❌ NON-GOAL: Define the internal architecture of the agent itself (routing, memory, RAG) — that is a separate document

Scenario Context

System: Quality platform for LLM agents in production (composite scenario)
Base infrastructure: Amazon Bedrock AgentCore + AWS
Target scale: 10-50 distinct agents, each with 5-20 tools, multiple versions in parallel
Offline evaluation frequency: On every PR / commit to agent config, prompt, or tool schema
Online evaluation frequency: Continuous via sampling (estimate: 5-15% of conversations)
Main stack: Bedrock AgentCore, S3, DynamoDB, Step Functions, Lambda, EventBridge, CloudWatch, Bedrock Evaluations
Document type: RFC / Design Doc — proposal for implementation
Main risk: Silent semantic degradation in production without automated detection

Proposed Design: Three Evaluation Layers

The architecture I propose organizes evaluation into three complementary layers, each with distinct purpose, frequency, and cost. The useful metaphor here is defense in depth — no single layer is sufficient, but together they form a robust detection network.

Layer 1 — Offline Evaluation (Pre-Deploy Gate)

Before any agent version is promoted to production, it runs through an offline test battery against a versioned dataset stored in S3. The dataset is structured as a set of test cases, each containing: the user input (or turn sequence for multi-turn conversations), the expected set of tool calls (with expected arguments and tolerances), the expected response (or evaluation criteria when there is no deterministic response), and category metadata (task type, difficulty level, whether it is adversarial).

Bedrock AgentCore runs the agent in evaluation mode — with the same tools configured, but in an isolated environment with mocked tools that return controlled responses. This is critical: you do not want offline tests making real calls to production APIs, but you also do not want mocks so simplified they do not test the agent's reasoning about tool responses.

Metrics collected in this layer include: tool-call accuracy (did the agent call the right tool?), argument fidelity (were arguments correct within tolerance?), call sequence correctness (for tasks requiring a specific call sequence), unnecessary tool calls (did the agent call unnecessary tools, indicating confused reasoning?), and final response quality evaluated by an LLM-as-judge configured via Bedrock Evaluations.

The quality gate is defined as a set of thresholds per metric category. Initial proposal: tool-call accuracy ≥ 0.90 on the full set, ≥ 0.95 on the subset of critical cases (flagged in the dataset), and no regression > 5 percentage points relative to the previous version's baseline. That last criterion — relative regression — is as important as the absolute threshold: an agent that was already poor at something should not get even worse.

Layer 2 — Adversarial Testing

Separate from the standard evaluation dataset, we maintain an adversarial dataset with specific categories: prompt injection attempts (attempts to make the agent ignore its system instructions), boundary probing (inputs that test policy limits — what does the agent do when it receives an ambiguous request that could be legitimate or malicious?), and tool-use edge cases (what happens when a tool returns an unexpected error? Does the agent recover gracefully or loop?).

These tests do not follow the same pass/fail regime as functional tests — they have expected behavior criteria. For prompt injection, the criterion is that the agent maintains its system instructions. For boundary cases, the criterion is that the agent escalates or refuses appropriately rather than hallucinating a response. For tool errors, the criterion is recovery in at most N attempts with graceful fallback.

Layer 3 — Online Signals (Production Monitoring)

In production, we instrument an asynchronous evaluation pipeline that processes a sample of real conversations. Sampling is stratified — we ensure representation of different task types, not just volume. Selected conversations are sent to a Step Functions workflow that: extracts tool-use traces from AgentCore, runs LLM-as-judge evaluation on final turns, and persists metrics to CloudWatch with dimensions by agent version, task type, and tool.

Beyond sampling, we monitor proxy signals that do not require LLM evaluation: rate of conversations hitting the maximum turn limit (indicates agent stuck in a loop), rate of unrecovered tool errors, escalation-to-human rate (if applicable), and latency distribution per tool (latency changes can indicate call behavior changes).

Continuous Evaluation Suite Architecture

Complete flow from committing a new agent version through quality signals in production, showing the three evaluation layers and the dataset curation feedback loop.

🔧 CI/CD Pipeline

Git Repo · Agent Config / Prompts
CI/CD · (CodePipeline)
Quality Gate · Lambda

📦 Dataset Management

S3 · Versioned Datasets
DynamoDB · Dataset Registry
Curator UI · (Internal Tool)

🧪 Offline Evaluation

Step Functions · Eval Orchestrator
Bedrock AgentCore · (Eval Mode)
Mock Tool Layer · Lambda
Bedrock Evaluations · LLM-as-Judge
S3 · Eval Results

⚔️ Adversarial Layer

S3 · Adversarial Dataset
Adversarial Runner · Lambda

🚀 Production

Bedrock AgentCore · (Production)
Real Tools · (APIs)
AgentCore Traces · S3 / CloudWatch

📊 Online Monitoring

Sampling Lambda · Stratified 5-15%
Step Functions · Online Eval
Bedrock Evaluations · Online Judge
CloudWatch · Metrics + Alarms
EventBridge · Anomaly Events

Dataset Management: Versioning, Curation, and the Dataset Drift Problem

One of the most neglected aspects of agent evaluation systems is management of the evaluation dataset itself. Teams build an initial set of test cases, run them for months, and never revisit — while the agent evolves to handle entirely new use cases that are not represented. The dataset ages and loses discriminative power.

The proposal here is to treat the evaluation dataset with the same discipline we apply to code: explicit versioning, changelog, and a continuous curation process.

Versioning structure: Each dataset version is an immutable artifact in S3, identified by a content hash and a semantic version number (major.minor.patch). DynamoDB maintains the registry with metadata: creation date, author, changelog, coverage by task category, and the agent version it was validated against. The quality gate always runs against the dataset version specified in the agent manifest — not against "latest". This is deliberate: it guarantees reproducibility and prevents a new dataset from breaking an agent that was working well.

Curation process: New test cases enter the dataset through three paths. The first is manual curation — engineers or domain experts write cases based on requirements. The second is production mining — real conversations that were flagged as problematic (via user feedback, escalation, or automatic detection) are reviewed and converted into test cases. This is the most valuable path: these are real failures, not hypothetical ones. The third is synthetic generation — using an LLM to generate variations of existing cases, especially to increase coverage of edge cases and adversarial scenarios.

The dataset drift problem: As the agent improves, cases that were once difficult become trivial. A dataset that is not updated loses discriminative power — the agent passes everything, but that does not mean it is working well on new cases. The solution is to monitor the distribution of dataset scores over time: if most cases score near 1.0, it is time to add harder cases. This is analogous to the benchmark saturation problem in ML — when a model reaches human performance on a benchmark, the benchmark needs to be replaced, not the model declared perfect.

Tool-use coverage: For agents with multiple tools, the dataset must ensure adequate coverage of each tool and tool combinations. We maintain a coverage map showing, for each tool, how many test cases exercise it, in what contexts, and with what argument patterns. New tools added to the agent must have minimum coverage before the agent can be promoted — this is an additional quality gate rule.

Design Alternatives: Evaluation Approaches

LLM-as-Judge only (no structured test cases)

Pros

Easy to scale — does not require manual test case curation
Flexible for evaluating open-ended responses without reference answers

Cons

Does not detect specific tool-use regressions — the judge evaluates the final result, not the process
High cost to evaluate each case; inconsistency between runs of the same judge
No deterministic baseline — hard to know if a score change is real or judge noise

Useful as a complementary layer, insufficient as the sole approach

Deterministic tests only (mock tools + exact expected outputs)

Pros

Reproducible and cheap — no LLM calls for evaluation
Detects tool-use regressions with high precision

Cons

Brittle for natural language responses — any legitimate variation breaks the test
Does not capture semantic quality of the final response
High maintenance — each prompt change can invalidate hundreds of expected outputs

Excellent for tool-use assertions; inadequate for response evaluation

Hybrid approach (proposed in this RFC)

Pros

Deterministic where possible (tool-use), probabilistic where necessary (final response)
Offline + online layers cover different failure points
Versioned dataset allows tracking regressions over time with reproducibility

Cons

Higher operational complexity — multiple systems to maintain
LLM-as-judge cost for online evaluation can be significant at scale

Recommended approach — complexity justified by discriminative power

Third-party evaluation platform (Braintrust, LangSmith, etc.)

Pros

Faster time-to-value — ready UI, existing integrations
Community and metric templates

Cons

Evaluation data (including production conversations) leaves the AWS environment — problem for sensitive data
Integration with Bedrock AgentCore traces requires custom work regardless
Additional vendor lock-in; per-volume cost can be prohibitive

Valid for prototyping; not recommended for production in sensitive data contexts

Decision: Sampling Strategy for Online Evaluation

Proposed

Context

Evaluating 100% of production conversations with LLM-as-judge is prohibitively expensive and slow. We need a sampling strategy that maximizes detection power with controlled cost.

Decision

Stratified sampling with a 10% base rate, increasing to 50% when proxy metrics (error rate, anomalous latency) exceed thresholds. Conversations with explicit negative user feedback are always evaluated (100%). Guaranteed distribution by task type and most critical tool.

Consequences

Controlled online evaluation cost — estimated at 10-15% of 100% evaluation cost
Anomaly detection may have latency of minutes to hours depending on traffic volume
Requires implementation of stratified sampling logic — non-trivial to guarantee representativeness

Rollout Plan

1
Phase 1 — Foundation (Weeks 1-3)
Create the dataset structure in S3 and DynamoDB. Define the test case schema. Build the first dataset with 50-100 manual cases covering the agent's most critical flows. Configure AgentCore in evaluation mode with mock tools for main flows. No quality gate yet — just collect baseline data.
2
Phase 2 — Offline Evaluation (Weeks 4-6)
Implement the offline evaluation Step Functions workflow. Integrate Bedrock Evaluations as LLM-as-judge for final responses. Define tool-use metrics and implement deterministic calculation. Run against the last 3 agent versions to calibrate quality gate thresholds. Integrate into CI/CD as a non-blocking check initially.
3
Phase 3 — Active Quality Gate (Weeks 7-8)
Activate the quality gate as blocking in CI/CD. Define thresholds based on collected baseline data. Create the initial adversarial dataset with 20-30 prompt injection and boundary testing cases. Train the team on how to interpret results and how to add new cases to the dataset.
4
Phase 4 — Online Monitoring (Weeks 9-12)
Implement the sampling and online evaluation pipeline. Configure CloudWatch dashboards with metrics by version and by tool. Create alarms for anomalies. Implement the production mining flow: problematic conversations detected online are queued for review and potential addition to the offline dataset.
5
Phase 5 — Maturity and Automation (Weeks 13-16)
Implement synthetic test case generation to increase coverage. Create the tool-use coverage map and add minimum coverage rules to the quality gate. Review and refine thresholds with 2+ months of data. Document the curation process and create runbooks for the team.

Critical Risks and Mitigations

Risk 1 — Goodhart's Law in the quality gate: When a threshold becomes a target, it ceases to be a good measure. Teams may start writing test cases the agent already passes well, rather than cases that test real weaknesses. Mitigation: separate who writes test cases from who develops the agent; periodically include test cases from external adversaries. Risk 2 — Judge inconsistency: LLM-as-judge has non-trivial variance between runs. An agent may pass the quality gate in one run and fail in another with no real change. Mitigation: use multiple runs (3-5) and aggregate scores; define thresholds with a safety margin above the observed judge variance. Risk 3 — Mock tool drift: Tool mocks can diverge from real API behavior over time, making offline tests increasingly unrepresentative. Mitigation: version mocks alongside real tool schemas; include contract tests that verify mocks still correspond to real APIs. Risk 4 — Online evaluation cost at scale: At high volumes, even 10% sampling with LLM-as-judge can be expensive. Mitigation: use smaller models for the online judge (Claude Haiku vs. Sonnet), reserving more capable models for offline evaluation where precision is more critical. Risk 5 — False confidence: An evaluation suite that passes can give false confidence. The dataset never covers all possible cases. Mitigation: explicitly communicate what the dataset covers and what it does not; keep humans in the loop for review of high-risk cases.

Success Metrics and Targets

Tool-call accuracy (offline): ≥ 0.90 overall, ≥ 0.95 on critical cases
Maximum allowed regression (offline): < 5 percentage points vs. previous version baseline
Tool coverage in dataset: 100% of tools with ≥ 10 test cases each
Offline evaluation pipeline latency: < 15 minutes for dataset of up to 200 cases (estimate)
Online evaluation coverage: ≥ 10% of all conversations, 100% of those with negative feedback
MTTD (Mean Time to Detect) production regression: < 4 hours at normal volume (estimate)
Quality gate false positive rate: < 5% — gates that block deploys without real regression
Dataset update cadence: Mandatory monthly review; continuous addition via production mining

My Perspective: What I Would Do Differently and What I Would Not Compromise On

Senior Solutions Architect

After working with high-criticality systems in finance, I have a strong conviction: the quality of an evaluation system is determined by the quality of its dataset, not the sophistication of its infrastructure. You can have the most elegant pipeline in the world running on Bedrock AgentCore with Step Functions and LLM-as-judge, but if the dataset has 30 cases all of the same type, you are measuring nothing. What I would not compromise on: the production mining process. Test cases derived from real failures are orders of magnitude more valuable than synthetic cases. In every project I worked on involving system evaluation, the most important bugs were found by cases that came from real incidents, not from preventive brainstorming. I would invest disproportionately in the flow from "problematic conversation detected in production" → "human review" → "test case in offline dataset". What I would do differently from what is described here: start smaller. This document describes a complete, mature system. In practice, I would start with a simple Python script that runs 50 test cases against the agent before each deploy, without Step Functions, without LLM-as-judge, just deterministic tool-use assertions. That already captures 70% of the value with 10% of the complexity. The additional sophistication is justified as you have evidence that you need it. On Bedrock AgentCore specifically: the ability to run the agent in evaluation mode with mocked tools is genuinely useful — it solves a real environment isolation problem that teams usually solve with workarounds. Dataset management via AgentCore is a recent addition that is still maturing; I would maintain a thin abstraction layer over it to avoid being held hostage to API changes. The most important lesson I have learned about AI system evaluation: treat evaluation as a product, not as infrastructure. It needs clear ownership, someone who wakes up in the morning thinking about how to improve the dataset and thresholds. Without that, the most sophisticated system in the world becomes a CI check that nobody reads.

Verdict

The continuous evaluation suite described in this RFC is necessary for any LLM agent operating in production with real consequences. The three-layer architecture — offline evaluation with quality gate, systematic adversarial testing, and online monitoring with stratified sampling — covers the most important degradation vectors: model changes, tool schema drift, and regressions introduced by prompt updates. Bedrock AgentCore provides the necessary primitives — execution in evaluation mode, tool-use tracing, and integration with Bedrock Evaluations for LLM-as-judge — without requiring you to build this infrastructure from scratch. The real value, however, is not in the infrastructure: it is in the dataset curation process and the discipline of treating evaluation as a product with clear ownership. My recommendation: implement in phases, starting with what generates the most value with the least complexity (versioned dataset + deterministic tool-use quality gate), and add sophistication only when you have evidence that the simple system is letting regressions through. The risk of over-engineering an evaluation system is real — a complex system that nobody understands or maintains is worse than a simple script that the team actually uses.

References

Amazon Bedrock AgentCore — Official Product Page AWS Machine Learning Blog — AgentCore and Agent Evaluation Amazon Bedrock Evaluations — Documentation AWS Step Functions — Developer Guide Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., 2023)AgentBench: Evaluating LLMs as Agents (Liu et al., 2023)Tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains Amazon Bedrock AgentCore — Getting Started Documentation

#bedrock-agentcore#llm-evaluation#agent-quality#ci-cd#adversarial-testing#tool-use-metrics#mlops#aws

Case sources

AWS AI Blog — AgentCore dataset management Amazon Bedrock AgentCore

Written with AI assistance from the public case and my architect's reading.

Ask Fernando about this

Get a focused answer about this study from my AI assistant, grounded in my work.

Design Doc / RFCAgent quality platform (cenário)IA / Qualidade

Design Doc: Continuous Evaluation Suite for Agents with Bedrock AgentCore

Jun 7, 2026 12 min AI-assisted

Listen to study

generated on play

Generated only on first play

On demand

0:000:00

Speed

The MP3 is saved to S3 after the first play.

The Problem: Agents Are Composite Systems That Degrade in Non-Obvious Ways

Goals and Non-Goals

✅ GOAL: Define an offline (pre-deploy) evaluation architecture with versioned datasets and quality gates that block promotion of regressive versions

✅ GOAL: Instrument online (post-deploy) evaluation signals that detect degradation in production without relying exclusively on human feedback

✅ GOAL: Create tool-use specific metrics: correct call rate, argument accuracy, call sequence correctness, and tool error recovery

✅ GOAL: Define a curation and versioning process for evaluation datasets that evolves alongside the agent

✅ GOAL: Include systematic adversarial scenarios: prompt injection, boundary testing, and edge cases that expose reasoning failures

❌ NON-GOAL: Define the internal architecture of the agent itself (routing, memory, RAG) — that is a separate document

Scenario Context

System: Quality platform for LLM agents in production (composite scenario)
Base infrastructure: Amazon Bedrock AgentCore + AWS
Target scale: 10-50 distinct agents, each with 5-20 tools, multiple versions in parallel
Offline evaluation frequency: On every PR / commit to agent config, prompt, or tool schema
Online evaluation frequency: Continuous via sampling (estimate: 5-15% of conversations)
Main stack: Bedrock AgentCore, S3, DynamoDB, Step Functions, Lambda, EventBridge, CloudWatch, Bedrock Evaluations
Document type: RFC / Design Doc — proposal for implementation
Main risk: Silent semantic degradation in production without automated detection

Proposed Design: Three Evaluation Layers

Layer 1 — Offline Evaluation (Pre-Deploy Gate)

Layer 2 — Adversarial Testing

Layer 3 — Online Signals (Production Monitoring)

Continuous Evaluation Suite Architecture

Complete flow from committing a new agent version through quality signals in production, showing the three evaluation layers and the dataset curation feedback loop.

🔧 CI/CD Pipeline

Git Repo · Agent Config / Prompts
CI/CD · (CodePipeline)
Quality Gate · Lambda

📦 Dataset Management

S3 · Versioned Datasets
DynamoDB · Dataset Registry
Curator UI · (Internal Tool)

🧪 Offline Evaluation

Step Functions · Eval Orchestrator
Bedrock AgentCore · (Eval Mode)
Mock Tool Layer · Lambda
Bedrock Evaluations · LLM-as-Judge
S3 · Eval Results

⚔️ Adversarial Layer

S3 · Adversarial Dataset
Adversarial Runner · Lambda

🚀 Production

Bedrock AgentCore · (Production)
Real Tools · (APIs)
AgentCore Traces · S3 / CloudWatch

📊 Online Monitoring

Sampling Lambda · Stratified 5-15%
Step Functions · Online Eval
Bedrock Evaluations · Online Judge
CloudWatch · Metrics + Alarms
EventBridge · Anomaly Events

Dataset Management: Versioning, Curation, and the Dataset Drift Problem

The proposal here is to treat the evaluation dataset with the same discipline we apply to code: explicit versioning, changelog, and a continuous curation process.

Design Alternatives: Evaluation Approaches

LLM-as-Judge only (no structured test cases)

Pros

Easy to scale — does not require manual test case curation
Flexible for evaluating open-ended responses without reference answers

Cons

Does not detect specific tool-use regressions — the judge evaluates the final result, not the process
High cost to evaluate each case; inconsistency between runs of the same judge
No deterministic baseline — hard to know if a score change is real or judge noise

Useful as a complementary layer, insufficient as the sole approach

Deterministic tests only (mock tools + exact expected outputs)

Pros

Reproducible and cheap — no LLM calls for evaluation
Detects tool-use regressions with high precision

Cons

Brittle for natural language responses — any legitimate variation breaks the test
Does not capture semantic quality of the final response
High maintenance — each prompt change can invalidate hundreds of expected outputs

Excellent for tool-use assertions; inadequate for response evaluation

Hybrid approach (proposed in this RFC)

Pros

Deterministic where possible (tool-use), probabilistic where necessary (final response)
Offline + online layers cover different failure points
Versioned dataset allows tracking regressions over time with reproducibility

Cons

Higher operational complexity — multiple systems to maintain
LLM-as-judge cost for online evaluation can be significant at scale

Recommended approach — complexity justified by discriminative power

Third-party evaluation platform (Braintrust, LangSmith, etc.)

Pros

Faster time-to-value — ready UI, existing integrations
Community and metric templates

Cons

Evaluation data (including production conversations) leaves the AWS environment — problem for sensitive data
Integration with Bedrock AgentCore traces requires custom work regardless
Additional vendor lock-in; per-volume cost can be prohibitive

Valid for prototyping; not recommended for production in sensitive data contexts

Decision: Sampling Strategy for Online Evaluation

Proposed

Context

Evaluating 100% of production conversations with LLM-as-judge is prohibitively expensive and slow. We need a sampling strategy that maximizes detection power with controlled cost.

Decision

Consequences

Controlled online evaluation cost — estimated at 10-15% of 100% evaluation cost
Anomaly detection may have latency of minutes to hours depending on traffic volume
Requires implementation of stratified sampling logic — non-trivial to guarantee representativeness

Rollout Plan

1
Phase 1 — Foundation (Weeks 1-3)
Create the dataset structure in S3 and DynamoDB. Define the test case schema. Build the first dataset with 50-100 manual cases covering the agent's most critical flows. Configure AgentCore in evaluation mode with mock tools for main flows. No quality gate yet — just collect baseline data.
2
Phase 2 — Offline Evaluation (Weeks 4-6)
Implement the offline evaluation Step Functions workflow. Integrate Bedrock Evaluations as LLM-as-judge for final responses. Define tool-use metrics and implement deterministic calculation. Run against the last 3 agent versions to calibrate quality gate thresholds. Integrate into CI/CD as a non-blocking check initially.
3
Phase 3 — Active Quality Gate (Weeks 7-8)
Activate the quality gate as blocking in CI/CD. Define thresholds based on collected baseline data. Create the initial adversarial dataset with 20-30 prompt injection and boundary testing cases. Train the team on how to interpret results and how to add new cases to the dataset.
4
Phase 4 — Online Monitoring (Weeks 9-12)
Implement the sampling and online evaluation pipeline. Configure CloudWatch dashboards with metrics by version and by tool. Create alarms for anomalies. Implement the production mining flow: problematic conversations detected online are queued for review and potential addition to the offline dataset.
5
Phase 5 — Maturity and Automation (Weeks 13-16)
Implement synthetic test case generation to increase coverage. Create the tool-use coverage map and add minimum coverage rules to the quality gate. Review and refine thresholds with 2+ months of data. Document the curation process and create runbooks for the team.

Critical Risks and Mitigations

Success Metrics and Targets

Tool-call accuracy (offline): ≥ 0.90 overall, ≥ 0.95 on critical cases
Maximum allowed regression (offline): < 5 percentage points vs. previous version baseline
Tool coverage in dataset: 100% of tools with ≥ 10 test cases each
Offline evaluation pipeline latency: < 15 minutes for dataset of up to 200 cases (estimate)
Online evaluation coverage: ≥ 10% of all conversations, 100% of those with negative feedback
MTTD (Mean Time to Detect) production regression: < 4 hours at normal volume (estimate)
Quality gate false positive rate: < 5% — gates that block deploys without real regression
Dataset update cadence: Mandatory monthly review; continuous addition via production mining

My Perspective: What I Would Do Differently and What I Would Not Compromise On

Senior Solutions Architect

Verdict

References

#bedrock-agentcore#llm-evaluation#agent-quality#ci-cd#adversarial-testing#tool-use-metrics#mlops#aws

Case sources

AWS AI Blog — AgentCore dataset management Amazon Bedrock AgentCore

Written with AI assistance from the public case and my architect's reading.

Ask Fernando about this

Get a focused answer about this study from my AI assistant, grounded in my work.

Listen to study

The Problem: Agents Are Composite Systems That Degrade in Non-Obvious Ways

Goals and Non-Goals

Scenario Context

Proposed Design: Three Evaluation Layers

Continuous Evaluation Suite Architecture

Dataset Management: Versioning, Curation, and the Dataset Drift Problem

Design Alternatives: Evaluation Approaches

LLM-as-Judge only (no structured test cases)

Deterministic tests only (mock tools + exact expected outputs)

Hybrid approach (proposed in this RFC)

Third-party evaluation platform (Braintrust, LangSmith, etc.)

Decision: Sampling Strategy for Online Evaluation

Rollout Plan

Phase 1 — Foundation (Weeks 1-3)

Phase 2 — Offline Evaluation (Weeks 4-6)

Phase 3 — Active Quality Gate (Weeks 7-8)

Phase 4 — Online Monitoring (Weeks 9-12)

Phase 5 — Maturity and Automation (Weeks 13-16)

Critical Risks and Mitigations

Success Metrics and Targets

Verdict

References

Ask Fernando about this

Listen to study

The Problem: Agents Are Composite Systems That Degrade in Non-Obvious Ways

Goals and Non-Goals

Scenario Context

Proposed Design: Three Evaluation Layers

Continuous Evaluation Suite Architecture

Dataset Management: Versioning, Curation, and the Dataset Drift Problem

Design Alternatives: Evaluation Approaches

LLM-as-Judge only (no structured test cases)

Deterministic tests only (mock tools + exact expected outputs)

Hybrid approach (proposed in this RFC)

Third-party evaluation platform (Braintrust, LangSmith, etc.)

Decision: Sampling Strategy for Online Evaluation

Rollout Plan

Phase 1 — Foundation (Weeks 1-3)

Phase 2 — Offline Evaluation (Weeks 4-6)

Phase 3 — Active Quality Gate (Weeks 7-8)

Phase 4 — Online Monitoring (Weeks 9-12)

Phase 5 — Maturity and Automation (Weeks 13-16)

Critical Risks and Mitigations

Success Metrics and Targets

Verdict

References

Ask Fernando about this