Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

The AI Architect Track

Module 4 · Architecture on AWS· Lesson 19/22

Current grounding and AI FinOps

Keeping the agent current with web search — and keeping the AI bill cheap and predictable.

7 min read

Listen — Fernando's cloned voice

0:0011:03

Speed

Download

An agent that only knows what was in its training set is an agent living in the past — and it charges you full price for that. In this lesson we close Module 4 by joining two practical problems every architect faces when putting AI into production: how to keep the agent factually current without exploding the context window, and how to keep the AWS bill below a number your manager approves without blinking.

Grounding: why the model needs today's facts

LLMs have a training cutoff. Everything that happened after that — new AWS services, price changes, articles published yesterday — simply does not exist for the model. The practical result is staleness-driven hallucination: the model answers confidently about something that has changed.

Grounding is the practice of injecting verifiable, current facts into the context before generation. The most direct form is web search: the agent issues a query, receives real snippets from pages indexed today, and uses those snippets as the basis for the answer. This does not eliminate hallucination, but it changes the error source — instead of inventing, the model now cites.

In Bedrock AgentCore, web search is a native tool the agent can invoke inside the ReAct loop (we saw that loop in Lesson 11). The flow is simple: the model decides it needs current information, calls the search tool, receives results as structured text, and incorporates them into its reasoning before responding. The additional cost is real — each search call adds latency and can increase the number of tokens processed — but the reliability gain usually justifies it, especially in fast-moving domains: prices, regulations, product releases.

I use this pattern on my own site: articles and case studies are generated with an agent that performs web search in Bedrock AgentCore before drafting any factual claim. The result is content grounded in real sources, not training memory.

Grounding flow with web search in Bedrock AgentCore

The agent decides when it needs current facts, calls the search tool, receives real snippets, and injects them into context before generating the final answer.

👤 Usuário / Cliente

Usuário · Pergunta factual

🟧 AWS — Bedrock AgentCore

Orchestrator · Loop ReAct
LLM (ex: Claude) · Raciocínio + geração
Web Search Tool · Bedrock AgentCore
Knowledge Base · Dados internos (RAG)
Prompt Cache · Reutiliza prefixo

🌐 Web / Externo

Web Index · Snippets atuais

In practice: grounding on my site

Senior Solutions Architect

In practice, I use web search as the first step of any technical article I publish. The agent performs the search, cites sources from the snippets, and only then drafts. The extra cost per search call is real, but marginal compared to the cost of publishing something factually wrong. My advice: enable grounding for any claim that changes over time — prices, API versions, benchmarks — and disable it for static content where internal RAG already covers it.

AI FinOps: the levers that actually move the bill

Generative AI charges per token — input and output separately, and the price varies widely between models. This changes the cost logic compared to traditional workloads: there is no idle infrastructure cost in Bedrock's serverless model, but there is cost from unnecessary context, oversized models, and missing cache.

Model selection by task is the highest-impact lever. A large, expensive model (e.g., Claude Sonnet) makes sense for complex reasoning, synthesis, and creative generation. For classification, field extraction, or intent routing, a smaller, cheaper model (e.g., Claude Haiku, Titan Text Lite) delivers the same result at a fraction of the cost. The practical rule: use the smallest model that passes your evals (Lesson 09).

Prompt cache is the second lever. If your system prompt is 2,000 tokens and you make 10,000 calls per day, you are paying for 20 million input tokens that are identical. Bedrock supports prefix caching: cached tokens cost significantly less than normally processed tokens. Structure your prompts by placing the static part (instructions, fixed context) at the beginning — that is what gets cached.

Token limits and budgets close the loop. Configure quotas in AWS Budgets and alerts in CloudWatch for token throughput. In Bedrock AgentCore, you can define iteration limits in the agent loop — an agent that loops due to a bug can generate hundreds of calls in seconds. Iteration limits are both safety and FinOps.

AI FinOps levers: impact and effort

	Lever	Typical cost reduction	Implementation effort	When to apply
Smaller model per task	High (50–90% per call)	Medium — requires per-task evals	Repetitive, well-defined tasks	—
Prompt cache (static prefix)	Medium-high (depends on volume)	Low — restructure the prompt	Long system prompts + high volume	—
Limit context / window	Medium (direct input tokens)	Low — truncate history and RAG chunks	Agents with long history	—
Serverless (no idle instance)	High for sporadic workloads	None — default in Bedrock	Always (except reserved throughput)	—
Agent iteration limit	Protection against runaway cost	Low — configuration in AgentCore	Every production agent	—
AWS Budgets + alerts	Does not reduce, but prevents surprises	Low — billing configuration	Always, from day one	—

Keeping a real system under a few dollars a month

The question I receive most from developers starting with AI on AWS is: 'how much will it cost?' The honest answer is: it depends on volume, but a well-architected system for personal use or a small team can run for under five dollars a month without any obscure tricks.

The secret lies in three design decisions made early. First, use lightweight models for routing and heavy models only for final generation. An agent that classifies intent with Haiku and only calls Sonnet to draft the final response can have 80% of its calls at penny-level cost. Second, limit RAG chunk size. Large chunks improve retrieval precision marginally but increase input cost linearly — 500 tokens per chunk is a good starting point for most cases. Third, do not store infinite history in context. Implement a sliding window or periodic summarization (we covered memory strategies in Lesson 12).

For my site, the monthly AI cost stays below ten dollars processing dozens of articles, because the agent uses Haiku for search and triage, Sonnet only for drafting, and the instruction prompt is cached. It is not magic — it is conscious model choice per pipeline stage.

One final point on Bedrock reserved throughput: it makes financial sense only when you have predictable, high volume. For most early-stage projects, serverless on-demand mode is cheaper and simpler. Reserve throughput when your cost evals show the break-even has been reached.

Match

AI FinOps levers

Tap a concept, then its definition.

What to remember from this lesson

Grounding via web search injects current facts into context, reducing staleness-driven hallucination — but has latency and token cost.

The highest-impact AI FinOps lever is choosing the smallest model that passes your evals for each task.

Prompt cache reduces the cost of repeated input tokens — structure your system prompt with the static part at the beginning.

Agent iteration limits are both a safety guardrail and cost protection — always configure them in production.

Serverless on-demand in Bedrock eliminates idle cost; reserved throughput only pays off with predictable, high volume.

AWS Budgets + CloudWatch alerts are mandatory from the first deploy — billing surprises are an architecture problem.

Frequently asked questions

Does web search in AgentCore always improve answer quality?

No. If the search query is poorly formulated or the returned snippets come from low-quality sources, the model may cite wrong information with more confidence than if it had not searched at all. Grounding improves quality when the search tool is well configured and the snippets are relevant. Evaluate with evals (Lesson 09) before enabling in production.

Does prompt cache work automatically in Bedrock?

It depends on the model and configuration. Some models in Bedrock support prefix caching, but you need to structure the prompt correctly and, in some cases, enable it explicitly. Check the documentation for the specific model — not all models available in Bedrock support caching in the same way.

What is the risk of using smaller models to save money?

Lower output quality for complex tasks. The real risk is using a small model on a task that requires multi-step reasoning and not noticing the degradation because you have no evals. The mitigation is simple: define evals before switching models, not after.

Is reserved throughput in Bedrock worth it for startups?

Rarely at the start. Reserved throughput requires a volume commitment and upfront payment. For most growth-stage startups, the on-demand model is more flexible and cheaper until volume is predictable and stable. Reassess when your monthly on-demand costs are consistently high.

Closing Module 4

Módulo 4 completo

Grounding and FinOps look like separate topics, but they share the same root: architecture decisions you make before writing the first line of code. Choosing when to fetch external facts, which model to use at each stage, what to cache and what to limit — all of this defines whether your system is reliable and sustainable or an expensive prototype that nobody ships to production. Module 4 covered the fundamental building blocks of AI architecture on AWS: Bedrock, AgentCore, Knowledge Bases, and now cost and currency. In the next module, we look at the most important decision of all: when to use AI and when not to.

References

Amazon Bedrock — Prompt Caching Amazon Bedrock AgentCore — Web Search Tool AWS Budgets — Setting Cost and Usage Budgets Amazon Bedrock — Pricing Amazon Bedrock — Provisioned Throughput AWS Well-Architected — Cost Optimization Pillar

Previous Next lesson