Current grounding and AI FinOps
Keeping the agent current with web search — and keeping the AI bill cheap and predictable.
7 min read
An agent that only knows what was in its training set is an agent living in the past — and it charges you full price for that. In this lesson we close Module 4 by joining two practical problems every architect faces when putting AI into production: how to keep the agent factually current without exploding the context window, and how to keep the AWS bill below a number your manager approves without blinking.
Grounding: why the model needs today's facts
LLMs have a training cutoff. Everything that happened after that — new AWS services, price changes, articles published yesterday — simply does not exist for the model. The practical result is staleness-driven hallucination: the model answers confidently about something that has changed.
Grounding is the practice of injecting verifiable, current facts into the context before generation. The most direct form is web search: the agent issues a query, receives real snippets from pages indexed today, and uses those snippets as the basis for the answer. This does not eliminate hallucination, but it changes the error source — instead of inventing, the model now cites.
In Bedrock AgentCore, web search is a native tool the agent can invoke inside the ReAct loop (we saw that loop in Lesson 11). The flow is simple: the model decides it needs current information, calls the search tool, receives results as structured text, and incorporates them into its reasoning before responding. The additional cost is real — each search call adds latency and can increase the number of tokens processed — but the reliability gain usually justifies it, especially in fast-moving domains: prices, regulations, product releases.
I use this pattern on my own site: articles and case studies are generated with an agent that performs web search in Bedrock AgentCore before drafting any factual claim. The result is content grounded in real sources, not training memory.
Grounding flow with web search in Bedrock AgentCore
The agent decides when it needs current facts, calls the search tool, receives real snippets, and injects them into context before generating the final answer.
- Usuário · Pergunta factual
- Orchestrator · Loop ReAct
- LLM (ex: Claude) · Raciocínio + geração
- Web Search Tool · Bedrock AgentCore
- Knowledge Base · Dados internos (RAG)
- Prompt Cache · Reutiliza prefixo
- Web Index · Snippets atuais
In practice, I use web search as the first step of any technical article I publish. The agent performs the search, cites sources from the snippets, and only then drafts. The extra cost per search call is real, but marginal compared to the cost of publishing something factually wrong. My advice: enable grounding for any claim that changes over time — prices, API versions, benchmarks — and disable it for static content where internal RAG already covers it.
AI FinOps: the levers that actually move the bill
Generative AI charges per token — input and output separately, and the price varies widely between models. This changes the cost logic compared to traditional workloads: there is no idle infrastructure cost in Bedrock's serverless model, but there is cost from unnecessary context, oversized models, and missing cache.
Model selection by task is the highest-impact lever. A large, expensive model (e.g., Claude Sonnet) makes sense for complex reasoning, synthesis, and creative generation. For classification, field extraction, or intent routing, a smaller, cheaper model (e.g., Claude Haiku, Titan Text Lite) delivers the same result at a fraction of the cost. The practical rule: use the smallest model that passes your evals (Lesson 09).
Prompt cache is the second lever. If your system prompt is 2,000 tokens and you make 10,000 calls per day, you are paying for 20 million input tokens that are identical. Bedrock supports prefix caching: cached tokens cost significantly less than normally processed tokens. Structure your prompts by placing the static part (instructions, fixed context) at the beginning — that is what gets cached.
Token limits and budgets close the loop. Configure quotas in AWS Budgets and alerts in CloudWatch for token throughput. In Bedrock AgentCore, you can define iteration limits in the agent loop — an agent that loops due to a bug can generate hundreds of calls in seconds. Iteration limits are both safety and FinOps.
AI FinOps levers: impact and effort
| Lever | Typical cost reduction | Implementation effort | When to apply | |
|---|---|---|---|---|
| Smaller model per task | High (50–90% per call) | Medium — requires per-task evals | Repetitive, well-defined tasks | — |
| Prompt cache (static prefix) | Medium-high (depends on volume) | Low — restructure the prompt | Long system prompts + high volume | — |
| Limit context / window | Medium (direct input tokens) | Low — truncate history and RAG chunks | Agents with long history | — |
| Serverless (no idle instance) | High for sporadic workloads | None — default in Bedrock | Always (except reserved throughput) | — |
| Agent iteration limit | Protection against runaway cost | Low — configuration in AgentCore | Every production agent | — |
| AWS Budgets + alerts | Does not reduce, but prevents surprises | Low — billing configuration | Always, from day one | — |
Keeping a real system under a few dollars a month
The question I receive most from developers starting with AI on AWS is: 'how much will it cost?' The honest answer is: it depends on volume, but a well-architected system for personal use or a small team can run for under five dollars a month without any obscure tricks.
The secret lies in three design decisions made early. First, use lightweight models for routing and heavy models only for final generation. An agent that classifies intent with Haiku and only calls Sonnet to draft the final response can have 80% of its calls at penny-level cost. Second, limit RAG chunk size. Large chunks improve retrieval precision marginally but increase input cost linearly — 500 tokens per chunk is a good starting point for most cases. Third, do not store infinite history in context. Implement a sliding window or periodic summarization (we covered memory strategies in Lesson 12).
For my site, the monthly AI cost stays below ten dollars processing dozens of articles, because the agent uses Haiku for search and triage, Sonnet only for drafting, and the instruction prompt is cached. It is not magic — it is conscious model choice per pipeline stage.
One final point on Bedrock reserved throughput: it makes financial sense only when you have predictable, high volume. For most early-stage projects, serverless on-demand mode is cheaper and simpler. Reserve throughput when your cost evals show the break-even has been reached.
AI FinOps levers
Tap a concept, then its definition.
What to remember from this lesson
Frequently asked questions
Does web search in AgentCore always improve answer quality?
No. If the search query is poorly formulated or the returned snippets come from low-quality sources, the model may cite wrong information with more confidence than if it had not searched at all. Grounding improves quality when the search tool is well configured and the snippets are relevant. Evaluate with evals (Lesson 09) before enabling in production.
Does prompt cache work automatically in Bedrock?
It depends on the model and configuration. Some models in Bedrock support prefix caching, but you need to structure the prompt correctly and, in some cases, enable it explicitly. Check the documentation for the specific model — not all models available in Bedrock support caching in the same way.
What is the risk of using smaller models to save money?
Lower output quality for complex tasks. The real risk is using a small model on a task that requires multi-step reasoning and not noticing the degradation because you have no evals. The mitigation is simple: define evals before switching models, not after.
Is reserved throughput in Bedrock worth it for startups?
Rarely at the start. Reserved throughput requires a volume commitment and upfront payment. For most growth-stage startups, the on-demand model is more flexible and cheaper until volume is predictable and stable. Reassess when your monthly on-demand costs are consistently high.
Closing Module 4
Grounding and FinOps look like separate topics, but they share the same root: architecture decisions you make before writing the first line of code. Choosing when to fetch external facts, which model to use at each stage, what to cache and what to limit — all of this defines whether your system is reliable and sustainable or an expensive prototype that nobody ships to production. Module 4 covered the fundamental building blocks of AI architecture on AWS: Bedrock, AgentCore, Knowledge Bases, and now cost and currency. In the next module, we look at the most important decision of all: when to use AI and when not to.