Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

The AI Architect Track

Module 4 · Architecture on AWS· Lesson 16/22

Amazon Bedrock: managed models and model choice

The front door to AI on AWS — and how to pick a model by cost, latency and reasoning.

5 min read

Listen — Fernando's cloned voice

0:009:09

Speed

Download

If you want to run an LLM in production on AWS without managing GPUs, clusters, or individual provider contracts, Amazon Bedrock is your entry point. One unified API, serverless, pay-per-token — and the most important decision you'll make isn't 'use Bedrock or not', but rather 'which model to pick and why'.

What Amazon Bedrock Is

Bedrock is a managed service that exposes foundation models from multiple providers — Anthropic (Claude), Meta (Llama), Mistral, Amazon (Titan, Nova), and others — through a single API called Converse. You provision nothing: no EC2 instance, no container, no permanent endpoint to keep alive.

The analogy I use: Bedrock is to models what S3 is to storage. You call the API, pay for what you use, and AWS handles all the underlying infrastructure — scaling, availability, patches, tenant isolation.

The Converse API is the most architecturally important detail. It standardizes the call contract regardless of the model: you send a list of messages, you get a response. Switching from Claude to Llama is changing a modelId parameter, not rewriting the integration. That has real value when you need to compare models or migrate for cost reasons.

My own website uses Bedrock in production — the assistant that answers questions about my background and content runs on Claude via the Converse API, with no dedicated server. The cost is proportional to actual usage, which makes sense for a personal site with variable traffic.

How Bedrock Fits Into the Architecture

Your app never talks directly to Anthropic, Meta, or Mistral. It calls Bedrock's Converse API and the service routes to the chosen model. The app only needs IAM and a regional endpoint.

🖥️ Sua aplicação — App Layer

App / Lambda · Business logic
IAM Role · Least privilege

🟧 AWS — Amazon Bedrock

Converse API · modelId param
Model Router · Serverless dispatch

🤖 Modelos de Fundação — Foundation Models

Claude (Anthropic) · Raciocínio / Reasoning
Amazon Nova · Custo / Cost
Llama (Meta) · Open weights
Mistral · Latência / Latency

In Practice

Senior Solutions Architect

In practice, the biggest mistake I see is defaulting to the most powerful available model — usually Claude Opus or equivalent — and only noticing the cost when the bill arrives. My rule: start with the cheapest model that solves the problem. Move up a tier only when evals show the smaller model is failing. That's not frugality — that's engineering.

Why Serverless and Pay-Per-Token Fit FinOps

In Bedrock's serverless model, you pay per input token and per output token. There's no idle cost — if nobody uses the system at 3 a.m., you pay nothing. This completely changes the TCO calculation compared to hosting a model on EC2 or EKS, where the instance runs 24/7.

The practical impact: for workloads with irregular traffic — an internal assistant only used during business hours, a document processing pipeline that runs in batch — Bedrock's on-demand model is almost always cheaper than dedicated infrastructure.

When traffic is high and predictable, there's the provisioned throughput option: you reserve model capacity and pay per hour regardless of usage. The break-even depends on volume — but for most early use cases, on-demand is the right starting point.

The other FinOps angle is that cost per token varies enormously across models. A heavy reasoning model can cost 10–50× more per token than a lightweight model. If your task is classifying sentiment in short reviews, using the most expensive model is pure waste. Model selection is the most powerful cost lever you have.

How to Choose the Right Model in Bedrock

Claude Haiku / Nova Micro

Pros

Very low cost per token
Low latency — good for real-time streaming
Sufficient for classification, simple extraction, short summarization

Cons

Limited reasoning on complex tasks
Smaller context window in some models

Default starting point. Use until evals show failure.

Claude Sonnet / Nova Pro

Pros

Strong balance between cost and capability
Good reasoning, long context, robust tool calling
Covers most agent use cases

Cons

More expensive than lightweight models
Higher latency than Haiku on long responses

Production tier for agents and RAG with medium complexity.

Claude Opus / modelos de raciocínio

Pros

Maximum reasoning and complex instruction following
Best for deep analysis, complex code, critical decisions

Cons

Significantly higher cost per token
Higher latency — bad for interactive UX

Reserve for tasks where deep reasoning is demonstrably necessary.

Llama / Mistral (open weights)

Pros

Competitive cost, no proprietary vendor lock-in
Option for fine-tuning and customization

Cons

Instruction following generally below Anthropic/Amazon models at the same cost tier
Less mature tool calling ecosystem in some models

Valid for cases with extreme cost constraints or fine-tuning needs.

The Four Criteria That Dominate Model Selection

Every model choice in Bedrock revolves around four variables. Understanding the trade-offs between them is what separates an architectural decision from a guess.

Reasoning is the model's ability to follow complex instructions, chain logical steps, and use tools correctly. For agents with multiple tool calls and conditional logic, weak reasoning breaks the flow. For binary classification, advanced reasoning is overkill.

Cost is measured in dollars per million tokens — and the delta between models can be an order of magnitude. As we'll see in the FinOps lesson (lesson 19), token cost dominates the TCO of AI systems at scale. Choosing the right model is the most impactful cost decision you make.

Latency matters for UX. Streaming helps hide time-to-first-token latency, but larger models are still slower. For an interactive chatbot, perceived latency above 2–3 seconds degrades the experience. For a nightly batch pipeline, latency isn't a criterion.

Context window defines how much text the model can process in a single call. For RAG with long documents or agents with extensive history (lesson 12), a small window forces chunking and increases complexity. Models with 200k-token windows solve problems that 8k models simply cannot.

The decision matrix above summarizes where each model family fits. Use it as a starting point — and validate with real evals (lesson 9).

Key Takeaways from This Lesson

Bedrock is serverless, pay-per-token access to models from multiple providers via a single API (Converse).

Your app never talks directly to Anthropic or Meta — it calls Bedrock and the service routes to the model.

The Converse API standardizes the contract: switching models is changing a parameter, not rewriting code.

Model selection is the biggest cost lever — start with the cheapest model that solves the problem.

The four selection criteria: reasoning, cost per token, latency, and context window.

On-demand for variable traffic; provisioned throughput for high, predictable volume.

Frequently Asked Questions

Do I need a special account or approval to use models in Bedrock?

Some models require you to explicitly request access in the Bedrock console (Model Access). It's a simple process, but must be done before calling the API. Amazon models are generally available immediately; third-party models like Claude may require accepting provider terms.

Is the data I send to Bedrock used to train models?

No, by default. AWS guarantees that prompts and responses in on-demand inference are not used to improve base models. This is an important differentiator for use cases with sensitive data. Always check the documentation and BAA if you're in a regulated context (HIPAA, etc.).

What's the difference between Bedrock and SageMaker for running models?

Bedrock is for consuming ready-made foundation models without managing infrastructure. SageMaker is for training, fine-tuning, and hosting custom models with full infrastructure control. For most LLM application use cases, Bedrock is the right choice. SageMaker comes in when you need fine-tuning or models not available in Bedrock's catalog.

My Direct Take

✅ Recomendado como ponto de entrada para

Bedrock solves the right problem: it removes model infrastructure complexity from your path and lets you focus on what matters — application logic. The Converse API is well-designed, the serverless model is honest about costs, and the model catalog is broad enough to cover virtually any use case. The real risk isn't technical — it's architectural: choosing the wrong model for lack of criteria and paying 10× more than necessary. Use the decision matrix, run evals early, and treat model selection as a revisable engineering decision, not a permanent choice.

Quiz

Quick check

1. What most dominates the total cost (TCO) of an AI system?

References

Amazon Bedrock — Developer Guide Converse API — Amazon Bedrock Model IDs and pricing — Amazon Bedrock Provisioned Throughput for Amazon Bedrock Amazon Bedrock — Data Privacy FAQ

Previous Next lesson

What Amazon Bedrock Is

How Bedrock Fits Into the Architecture

Your app never talks directly to Anthropic, Meta, or Mistral. It calls Bedrock's Converse API and the service routes to the chosen model. The app only needs IAM and a regional endpoint.

🖥️ Sua aplicação — App Layer

App / Lambda · Business logic
IAM Role · Least privilege

🟧 AWS — Amazon Bedrock

Converse API · modelId param
Model Router · Serverless dispatch

🤖 Modelos de Fundação — Foundation Models

Claude (Anthropic) · Raciocínio / Reasoning
Amazon Nova · Custo / Cost
Llama (Meta) · Open weights
Mistral · Latência / Latency

Why Serverless and Pay-Per-Token Fit FinOps

How to Choose the Right Model in Bedrock

Claude Haiku / Nova Micro

Pros

Very low cost per token
Low latency — good for real-time streaming
Sufficient for classification, simple extraction, short summarization

Cons

Limited reasoning on complex tasks
Smaller context window in some models

Default starting point. Use until evals show failure.

Claude Sonnet / Nova Pro

Pros

Strong balance between cost and capability
Good reasoning, long context, robust tool calling
Covers most agent use cases

Cons

More expensive than lightweight models
Higher latency than Haiku on long responses

Production tier for agents and RAG with medium complexity.

Claude Opus / modelos de raciocínio

Pros

Maximum reasoning and complex instruction following
Best for deep analysis, complex code, critical decisions

Cons

Significantly higher cost per token
Higher latency — bad for interactive UX

Reserve for tasks where deep reasoning is demonstrably necessary.

Llama / Mistral (open weights)

Pros

Competitive cost, no proprietary vendor lock-in
Option for fine-tuning and customization

Cons

Instruction following generally below Anthropic/Amazon models at the same cost tier
Less mature tool calling ecosystem in some models

Valid for cases with extreme cost constraints or fine-tuning needs.

The Four Criteria That Dominate Model Selection

Every model choice in Bedrock revolves around four variables. Understanding the trade-offs between them is what separates an architectural decision from a guess.

The decision matrix above summarizes where each model family fits. Use it as a starting point — and validate with real evals (lesson 9).

Key Takeaways from This Lesson

Bedrock is serverless, pay-per-token access to models from multiple providers via a single API (Converse).

Your app never talks directly to Anthropic or Meta — it calls Bedrock and the service routes to the model.

The Converse API standardizes the contract: switching models is changing a parameter, not rewriting code.

Model selection is the biggest cost lever — start with the cheapest model that solves the problem.

The four selection criteria: reasoning, cost per token, latency, and context window.

On-demand for variable traffic; provisioned throughput for high, predictable volume.

Frequently Asked Questions

Do I need a special account or approval to use models in Bedrock?

Is the data I send to Bedrock used to train models?

What's the difference between Bedrock and SageMaker for running models?

My Direct Take

✅ Recomendado como ponto de entrada para