Agentic RAG on AWS: Architecture Bake-Off for Financial-Grade Platforms
Listen to article
Fernando's voiceFernando · 16:40
Powered by Amazon Polly + OmniVoice
Agentic RAG has moved from lab experiment to platform requirement in financial environments that demand auditability, cost control, and predictable latency. In this article I compare four concrete architectural approaches on AWS, with real trade-offs, plausible numbers, and an unambiguous recommendation.
When Google Research published its analysis on reliable Agentic RAG for enterprise platforms, the signal I captured was not about models — it was about orchestration, governance, and where control responsibility should live in the stack. In financial environments where every LLM call needs to be auditable, every retrieval failure needs to be traceable, and every inference dollar needs to be justifiable, the choice of agent orchestration architecture is not an implementation detail: it is a first-class architectural decision. I have four serious candidates on AWS and I am going to put them head to head.
The Real Problem: Why Naive RAG Falls Short in Finance
Classic RAG — embed, retrieve, generate — solves the static knowledge problem but breaks along three critical dimensions in financial environments. First, multi-hop reasoning: a query like "what is customer X's consolidated credit exposure considering derivatives and open positions as of T-2?" requires multiple retrieval steps with dependencies between them, not a single vector search. Second, tool use: computing VaR, querying a credit limit system, or triggering a real-time pricing API are actions, not documents — and naive RAG has no action mechanism. Third, regulatory traceability: under BACEN, CVM, or SEC, the reasoning chain that led to an automated recommendation or decision must be reconstructible, not just the final output.
Agentic RAG solves all three by introducing a plan-execute-observe loop (ReAct, MRKL, or variants) where the agent dynamically decides which tools to invoke, in what order, and when the answer is good enough to terminate. The cost of that capability is operational complexity: agent loops introduce non-determinism, variable latency, infinite-loop risk, and an expanded attack surface. The central architectural choice is where that loop lives and who controls it — and that is precisely where the four approaches diverge materially.
The Four Candidates: Architectural Identity of Each
Option A — Native Bedrock Agents: The agent lives entirely inside Bedrock. AWS manages the ReAct loop, tool routing (Action Groups via Lambda), session memory, and Knowledge Base integration (OpenSearch Serverless as the vector backend). The operator defines the instruction prompt, tool definitions in OpenAPI schema, and guardrails. Orchestration latency runs around 800ms–1.5s per reasoning step for Claude 3 Sonnet, billed per input/output token plus orchestration overhead.
Option B — LangGraph + EKS: The agent loop runs in Python inside EKS pods, using LangGraph to define the state graph. The team has full control over every graph node, transitions, state checkpointing in DynamoDB, and integration with any retrieval backend — OpenSearch Service (provisioned), pgvector on Aurora, or Kendra. Orchestration latency is deterministic and controllable, but operational responsibility is total.
Option C — Step Functions-orchestrated: The agent loop is externalized to a Step Functions Express Workflow. Each reasoning step is a machine state — invoke model, evaluate, branch, retrieve, tool call. LLM non-determinism is contained within individual states; the orchestration itself is deterministic, auditable in X-Ray, and replayable. Lambda executes the tools; Bedrock or SageMaker serves the model.
Option D — Hybrid Bedrock + Step Functions: Bedrock Agents handles the inner ReAct loop while Step Functions orchestrates the outer business flow — input validation, context enrichment, agent invocation, post-processing, audit logging. The agent is a controlled black box, not the primary orchestrator.
Comparative Map: Four Agentic RAG Architectures on AWS
Each column represents one architecture. Edges show orchestration and retrieval flow. The control line indicates where the agent loop resides.
- Bedrock Agent · ReAct loop (managed)
- Knowledge Base · OpenSearch Serverless
- Action Groups · Lambda (OpenAPI)
- Guardrails · + Session Memory
- LangGraph Pod · EKS (Fargate/EC2)
- OpenSearch Service · Provisioned + kNN
- DynamoDB · State checkpoint
- Bedrock / SageMaker · Model endpoint
- Step Functions · Express Workflow
- Lambda · Tool executor
- Bedrock InvokeModel · per-step call
- X-Ray + CloudWatch · Full trace per step
- Step Functions · Outer orchestrator
- Bedrock Agent · Inner ReAct loop
- S3 + CloudWatch · Audit log sink
Critical Dimensions: Where Each Architecture Wins and Where It Bleeds
Regulatory auditability is where Step Functions (Option C) has a structural advantage. Each machine state generates an EventBridge event and an X-Ray entry with the full payload — input, output, duration, errors. In a BACEN audit, I can reconstruct exactly what the system decided at each reasoning step for any historical session. Bedrock Agents (Option A) provides CloudTrail for API calls and invocation logs in CloudWatch, but the internal reasoning loop is opaque — you see the input and final output, not the intermediate thinking steps, unless you explicitly enable trace via enableTrace: true in the InvokeAgent call.
Inference cost is the second differentiator. In Bedrock Agents, each reasoning step consumes tokens from the system prompt + conversation history + retrieved context + response. For Claude 3 Sonnet (us-east-1), that means roughly $3/1M input tokens and $15/1M output tokens. An agent with 5 reasoning steps and average context of 4k tokens per step costs ~$0.08 per session. In Step Functions Express, orchestration cost is $1/1M state transitions — practically zero — but you still pay for Bedrock tokens. The real difference is that Step Functions lets you terminate the loop early with deterministic evaluation logic, reducing unnecessary steps.
P99 latency is where LangGraph+EKS (Option B) can surprise negatively. Pod cold starts, state serialization overhead in DynamoDB, and network latency to provisioned OpenSearch Service can push P99 to 8–12 seconds under complex workloads. Bedrock Agents, being managed serverless, has more consistent P99 around 4–7 seconds for 5 steps, but without tuning control.
Technical Comparison: Four Agentic RAG Architectures
| Dimension | A — Bedrock Agents | B — LangGraph + EKS | C — Step Functions | D — Hybrid | |
|---|---|---|---|---|---|
| Agent loop control | AWS-managed (grey box) | Full (Python code) | Full (declarative states) | Partial (outer controlled) | — |
| Regulatory auditability | Medium (trace opt-in) | High (custom logging) | Very high (native X-Ray) | High (SF trace + CT) | — |
| P50 latency (5 steps) | ~3–5s | ~2–4s (warm pods) | ~3–6s | ~4–7s | — |
| P99 latency (5 steps) | ~5–8s | ~8–14s (cold start) | ~6–10s | ~7–12s | — |
| Orchestration cost (excl. tokens) | Included in Bedrock | EKS node hours (~$0.10–0.20/h) | ~$0.000001/transition | SF + Bedrock overhead | — |
| Guardrails and security | Native (Bedrock Guardrails) | Custom (code + WAF) | Custom (Lambda + WAF) | Bedrock + SF validation | — |
| Operational complexity | Low | High | Medium | Medium-high | — |
| Portability (multi-cloud / on-prem) | Low (Bedrock lock-in) | High (framework-agnostic) | Low (AWS lock-in) | Low (AWS lock-in) | — |
| Hybrid retrieval support (sparse+dense) | Limited (managed KB) | Full (custom pipeline) | Full (via Lambda) | Partial (KB + Lambda) | — |
| Time-to-production (average team) | 2–4 weeks | 8–16 weeks | 4–8 weeks | 5–10 weeks | — |
Security and Governance: What the Comparison Table Doesn't Fully Capture
In regulated financial environments, the attack surface of a RAG agent is qualitatively different from a REST API. The agent can be induced via prompt injection to exfiltrate context data, invoke unauthorized tools, or bypass guardrails. Each architecture has a distinct risk profile.
In Bedrock Agents, native Guardrails offers configurable content filters with harm categories (HATE, INSULTS, SEXUAL, VIOLENCE, MISCONDUCT, PROMPT_ATTACK) and word filters with custom lists — configurable via the CreateGuardrail API with contentPolicyConfig and wordPolicyConfig. The problem is that Guardrails evaluates the output, not the intermediate reasoning. An injection that manipulates the planning step can go undetected if the final output appears benign.
In LangGraph+EKS, security responsibility is entirely the team's. That means implementing: (1) input sanitization before any model call, (2) IAM roles with least-privilege per graph node — a retrieval node should not have permission to invoke payment APIs, (3) KMS CMK for state encryption in DynamoDB with aws:kms encryption type and per-tenant key in multi-tenant environments, (4) VPC endpoints for OpenSearch and Bedrock eliminating public internet traffic.
In Step Functions, the state separation creates a natural opportunity to insert deterministic validation between steps — a ValidateToolOutput state that checks schema, range, and permissions before passing the result to the next reasoning step. This is difficult to do reliably in Bedrock Agents without extensive Action Group customization. For environments with LGPD/GDPR requirements, Step Functions also facilitates data residency controls via IAM condition aws:RequestedRegion on each service invocation.
Decision Matrix: Which Architecture for Which Context
A — Native Bedrock Agents
- Lowest time-to-production (2–4 weeks)
- Managed guardrails and session memory
- Native integration with Knowledge Bases and OpenSearch Serverless
- No agent infrastructure operational overhead
- Opaque reasoning loop — limited auditability without explicit trace
- Hybrid retrieval (BM25 + dense) not natively available in KB
- Strong Bedrock lock-in; costly migration
- Limited control over per-tool retry and backoff policy
Ideal for MVPs and teams without distributed orchestration expertise. Not recommended for Tier-1 regulatory audit environments.
B — LangGraph + EKS
- Full control over state graph — unit-testable
- Support for custom hybrid retrieval and re-ranking
- Portability: can run on any cloud or on-prem
- Persistent state checkpointing for long-running sessions
- High operational complexity: EKS, HPA, cold starts, dependency management
- Time-to-production 3–4x longer than Bedrock Agents
- Security and guardrails are entirely the team's responsibility
- P99 degraded by cold starts without configured warm pool
Right for AI platforms with mature ML engineering teams needing portability and granular control. Overkill for most financial use cases.
C — Step Functions-Orchestrated
- Maximum auditability: every step traced in X-Ray with full payload
- Deterministic orchestration with LLM non-determinism contained
- Virtually zero orchestration cost ($1/1M transitions)
- Native retry with jitter, per-state timeout, and transactional compensation
- ASL verbosity for complex loops — non-trivial maintenance
- 25k character limit per state payload (mitigable with S3 reference pattern)
- No native session memory — must be implemented externally
- Dynamic agent loop requires Map state or recursion — complex to model
Best choice for agent flows with known maximum step count and Tier-1 regulatory audit requirements. My primary recommendation for banks and brokerages.
D — Hybrid Bedrock + Step Functions
- Combines Bedrock development speed with SF flow control
- Business flow audit in SF; internal reasoning in Bedrock trace
- Easy to add deterministic pre/post-processing around the agent
- Two orchestration systems to operate and debug
- Bedrock Agent inner loop still partially opaque
- Cumulative cost: Bedrock overhead + SF transitions
Good compromise for teams already using Bedrock Agents that need to add governance without rewriting everything. Not the ideal greenfield choice.
Observability and SLOs: What to Monitor in Each Architecture
Agentic RAG breaks traditional API SLOs because latency is a function of the number of reasoning steps, which is non-deterministic. Defining a P99 < 5s SLO for an agent with up to 8 steps is mathematically impossible without explicit termination control.
For Bedrock Agents, the primary observability signals are: InvokeAgent duration in CloudWatch (metric InvocationLatency), step count via trace parsing (field orchestrationTrace.rationale), and THROTTLING_EXCEPTION rate indicating pressure on the account TPS limit (default: 10 TPS per model per region, increasable via Service Quotas). A realistic SLO is P95 < 8s with a 0.5% error budget for throttling.
For Step Functions, each state emits native metrics: ExecutionTime, ExecutionsFailed, ExecutionsTimedOut. With OpenTelemetry, I can instrument each tool's Lambda with spans that propagate the Step Functions traceId, creating an end-to-end trace tree in X-Ray or Datadog. The most useful SLO here is steps per session — if P95 > 6 steps, that's a signal that the system prompt or retrieved documents are low quality, not that the system is slow.
For LangGraph+EKS, the pattern that works is the OpenTelemetry SDK with LangChain auto-instrumentation, exporting to ADOT Collector on EKS and then to CloudWatch EMF or Datadog. Critical metrics: llm.token.usage per graph node (for per-step cost control), retrieval.hit_rate (relevant documents / total retrieved), and tool.error_rate per tool name. A retrieval.hit_rate < 0.6 in production is an alert for vector index degradation — likely embedding drift or a stale index.
The Flexibility Paradox in Financial Agents
The architecture that gives the agent the most flexibility (LangGraph with an open graph) is precisely the one that most complicates regulatory audit. In finance, the goal is not to maximize agent autonomy — it is to maximize behavioral predictability within an envelope of authorized actions. This inverts the intuition of those coming from the AI research world: you want the most constrained agent that still solves the problem, not the most capable one. Step Functions enforces that constraint structurally.
Cost and Performance Reference Points (Production Estimates)
Critical Anti-Patterns in Financial Agentic RAG
- No step limit: Agents without a configured
maxIterationscan enter infinite loops consuming tokens indefinitely. Always set a ceiling — 8 steps is reasonable for most financial use cases. - Overpermissioned tools: Action Groups or Lambda tools with IAM policies using
*on resources. Each tool should have a dedicated IAM role with minimum permissions andaws:ResourceTagconditions for per-tenant isolation. - RAG context without metadata filtering: Retrieving documents without filtering by
tenantId,classification_level, oreffective_datein the OpenSearch query. In multi-tenant environments, this is a data leakage vector between clients. - Logging prompts with PII: Enabling full trace in Bedrock or logging LangGraph payloads without masking CPF, bank account, and position data. This violates LGPD and creates audit risk.
- Static embeddings in production: Indexing documents once and never reindexing. Embedding drift when switching embedding models silently invalidates retrieval quality — monitor
retrieval.hit_rateand reindex when switching models.
In every regulated financial environment I have architected, Option C — Step Functions with Bedrock InvokeModel per state — is the correct starting point, not because it is the most elegant, but because it is the most auditable and the easiest to explain to a compliance team that has never seen an AI agent. The hard-won lesson is that the AI adoption battle in finance is not technical — it is about institutional trust. A Step Functions flow with named states (EvaluateCreditQuery, RetrieveRegulatoryDocs, ValidateToolOutput) that an auditor can read in the AWS console is worth more than a perfectly optimized LangGraph loop that only the ML team understands. Once the business and governance are comfortable, then you evolve to the hybrid or to LangGraph — but start with what you can defend in a risk committee meeting.
Verdict: Step Functions-Orchestrated Is the Financially Responsible Choice
For financial platforms with serious regulatory requirements — BACEN, CVM, SEC, LGPD — Option C (Step Functions-orchestrated) is the primary recommendation. It delivers structural auditability that none of the other options offer natively, negligible orchestration cost, deterministic per-step retry and timeout, and a mental model that compliance and engineering teams can share. The 25KB payload limit is the only real obstacle — solved with the S3 reference pattern in less than a sprint. Native Bedrock Agents (Option A) is the right choice for MVPs, proofs of concept, and internal use cases where Tier-1 auditability is not a requirement. Do not dismiss it — it has the best time-to-production and the lowest operational overhead. LangGraph+EKS (Option B) only makes sense if you have an AI platform with a dedicated ML engineering team, need real portability (multi-cloud or on-prem), and are willing to invest 3–4x more engineering time. For most Brazilian banks and brokerages, that investment is not justified.
References and Further Reading
Architecture, AWS, AI and market deep dives — straight to your inbox. Free.
No spam · unsubscribe anytime
Ask Fernando about this
Get a focused answer about this article from my AI assistant, grounded in my work.
Join the conversation
Sign in to comment
Verify your email to join in — you'll also get the newsletter. No password.
Keep reading
Architecture intelligence, in your inbox
Curated signals and original analysis on AWS, AI, distributed systems and the market — the way a solutions architect reads them.
- Curated AWS · AI · architecture · market signals
- New architecture studies & deep-dives when they ship
- Sharp summaries — depth without the noise
- No spam · double opt-in · unsubscribe anytime