Amazon Bedrock: managed models and model choice
The front door to AI on AWS — and how to pick a model by cost, latency and reasoning.
5 min read
If you want to run an LLM in production on AWS without managing GPUs, clusters, or individual provider contracts, Amazon Bedrock is your entry point. One unified API, serverless, pay-per-token — and the most important decision you'll make isn't 'use Bedrock or not', but rather 'which model to pick and why'.
What Amazon Bedrock Is
Bedrock is a managed service that exposes foundation models from multiple providers — Anthropic (Claude), Meta (Llama), Mistral, Amazon (Titan, Nova), and others — through a single API called Converse. You provision nothing: no EC2 instance, no container, no permanent endpoint to keep alive.
The analogy I use: Bedrock is to models what S3 is to storage. You call the API, pay for what you use, and AWS handles all the underlying infrastructure — scaling, availability, patches, tenant isolation.
The Converse API is the most architecturally important detail. It standardizes the call contract regardless of the model: you send a list of messages, you get a response. Switching from Claude to Llama is changing a modelId parameter, not rewriting the integration. That has real value when you need to compare models or migrate for cost reasons.
My own website uses Bedrock in production — the assistant that answers questions about my background and content runs on Claude via the Converse API, with no dedicated server. The cost is proportional to actual usage, which makes sense for a personal site with variable traffic.
How Bedrock Fits Into the Architecture
Your app never talks directly to Anthropic, Meta, or Mistral. It calls Bedrock's Converse API and the service routes to the chosen model. The app only needs IAM and a regional endpoint.
- App / Lambda · Business logic
- IAM Role · Least privilege
- Converse API · modelId param
- Model Router · Serverless dispatch
- Claude (Anthropic) · Raciocínio / Reasoning
- Amazon Nova · Custo / Cost
- Llama (Meta) · Open weights
- Mistral · Latência / Latency
In practice, the biggest mistake I see is defaulting to the most powerful available model — usually Claude Opus or equivalent — and only noticing the cost when the bill arrives. My rule: start with the cheapest model that solves the problem. Move up a tier only when evals show the smaller model is failing. That's not frugality — that's engineering.
Why Serverless and Pay-Per-Token Fit FinOps
In Bedrock's serverless model, you pay per input token and per output token. There's no idle cost — if nobody uses the system at 3 a.m., you pay nothing. This completely changes the TCO calculation compared to hosting a model on EC2 or EKS, where the instance runs 24/7.
The practical impact: for workloads with irregular traffic — an internal assistant only used during business hours, a document processing pipeline that runs in batch — Bedrock's on-demand model is almost always cheaper than dedicated infrastructure.
When traffic is high and predictable, there's the provisioned throughput option: you reserve model capacity and pay per hour regardless of usage. The break-even depends on volume — but for most early use cases, on-demand is the right starting point.
The other FinOps angle is that cost per token varies enormously across models. A heavy reasoning model can cost 10–50× more per token than a lightweight model. If your task is classifying sentiment in short reviews, using the most expensive model is pure waste. Model selection is the most powerful cost lever you have.
How to Choose the Right Model in Bedrock
Claude Haiku / Nova Micro
- Very low cost per token
- Low latency — good for real-time streaming
- Sufficient for classification, simple extraction, short summarization
- Limited reasoning on complex tasks
- Smaller context window in some models
Default starting point. Use until evals show failure.
Claude Sonnet / Nova Pro
- Strong balance between cost and capability
- Good reasoning, long context, robust tool calling
- Covers most agent use cases
- More expensive than lightweight models
- Higher latency than Haiku on long responses
Production tier for agents and RAG with medium complexity.
Claude Opus / modelos de raciocínio
- Maximum reasoning and complex instruction following
- Best for deep analysis, complex code, critical decisions
- Significantly higher cost per token
- Higher latency — bad for interactive UX
Reserve for tasks where deep reasoning is demonstrably necessary.
Llama / Mistral (open weights)
- Competitive cost, no proprietary vendor lock-in
- Option for fine-tuning and customization
- Instruction following generally below Anthropic/Amazon models at the same cost tier
- Less mature tool calling ecosystem in some models
Valid for cases with extreme cost constraints or fine-tuning needs.
The Four Criteria That Dominate Model Selection
Every model choice in Bedrock revolves around four variables. Understanding the trade-offs between them is what separates an architectural decision from a guess.
Reasoning is the model's ability to follow complex instructions, chain logical steps, and use tools correctly. For agents with multiple tool calls and conditional logic, weak reasoning breaks the flow. For binary classification, advanced reasoning is overkill.
Cost is measured in dollars per million tokens — and the delta between models can be an order of magnitude. As we'll see in the FinOps lesson (lesson 19), token cost dominates the TCO of AI systems at scale. Choosing the right model is the most impactful cost decision you make.
Latency matters for UX. Streaming helps hide time-to-first-token latency, but larger models are still slower. For an interactive chatbot, perceived latency above 2–3 seconds degrades the experience. For a nightly batch pipeline, latency isn't a criterion.
Context window defines how much text the model can process in a single call. For RAG with long documents or agents with extensive history (lesson 12), a small window forces chunking and increases complexity. Models with 200k-token windows solve problems that 8k models simply cannot.
The decision matrix above summarizes where each model family fits. Use it as a starting point — and validate with real evals (lesson 9).
Key Takeaways from This Lesson
Frequently Asked Questions
Do I need a special account or approval to use models in Bedrock?
Some models require you to explicitly request access in the Bedrock console (Model Access). It's a simple process, but must be done before calling the API. Amazon models are generally available immediately; third-party models like Claude may require accepting provider terms.
Is the data I send to Bedrock used to train models?
No, by default. AWS guarantees that prompts and responses in on-demand inference are not used to improve base models. This is an important differentiator for use cases with sensitive data. Always check the documentation and BAA if you're in a regulated context (HIPAA, etc.).
What's the difference between Bedrock and SageMaker for running models?
Bedrock is for consuming ready-made foundation models without managing infrastructure. SageMaker is for training, fine-tuning, and hosting custom models with full infrastructure control. For most LLM application use cases, Bedrock is the right choice. SageMaker comes in when you need fine-tuning or models not available in Bedrock's catalog.
My Direct Take
Bedrock solves the right problem: it removes model infrastructure complexity from your path and lets you focus on what matters — application logic. The Converse API is well-designed, the serverless model is honest about costs, and the model catalog is broad enough to cover virtually any use case. The real risk isn't technical — it's architectural: choosing the wrong model for lack of criteria and paying 10× more than necessary. Use the decision matrix, run evals early, and treat model selection as a revisable engineering decision, not a permanent choice.
Quick check
1. What most dominates the total cost (TCO) of an AI system?