Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

Data PlatformsPattern Teardown

CloudWatch to OTel: Tearing Down the Observability Bridge Pattern

May 28, 2026 7 minadvanced AI-assisted

Listen to article

Fernando's voice

Fernando · 16:00

Download MP3

0:0016:00

Speed

The MP3 is saved to S3 after the first play.

Data PlatformsPattern Teardown

2–3s

Metric Streams Latency

From AWS service emission to Firehose. 1-minute interval polling has effective latency of 60–90s.

~$0.30

Cost per 100k series/month

$0.003/1k updates × ~100k series × ~1 update/min × 43,200 min/month ≈ $13k/month — filter aggressively.

Metrics per request in GetMetricData API

Hard quota. With 500 req/s soft limit, polling 25k metrics requires 500 requests — hits the limit in 1 second.

fernando.moretes.com

The CloudWatch-to-OpenTelemetry bridge pattern solves a real observability fragmentation problem in multi-platform environments, but it carries operational costs and design pitfalls that rarely surface in tutorials. In this article I tear down the anatomy of this pattern, when it makes sense, and when it creates more problems than it solves.

In financial environments where platform teams need to consolidate signals from dozens of AWS workloads into a unified observability backend — whether Datadog, Grafana Cloud, Honeycomb, or an internal stack — the CloudWatch → OpenTelemetry bridge pattern appears as the obvious solution. But 'obvious' and 'correct' rarely coincide in architecture. This pattern has a specific anatomy, a narrow validity envelope, and failure modes that only surface in production under real load. I'm going to dissect every layer.

The Real Problem: Observability Fragmentation in AWS-Native Environments

Every organization that grows beyond two or three engineering teams faces the same tension: AWS services emit metrics natively to CloudWatch — Lambda, RDS, EKS, API Gateway, MSK — but the corporate observability backend speaks OTLP. The result is a split world: SREs need to open two consoles to correlate an incident, alerts live in different namespaces, and business dashboards become impossible to build without manual ETL.

The bridge pattern exists to solve exactly this. The core idea is simple: a Lambda function (or an OTel collector running on ECS/EKS) subscribes to CloudWatch metric streams via CloudWatch Metric Streams or polls the GetMetricData API, transforms the payload to OTLP format, and forwards it to a collector endpoint. In theory, you get a single observability control plane. In practice, the complexity hides in the details.

What makes this problem especially treacherous in financial environments is the combination of three factors: (1) metric volume — a mid-size AWS account with EKS, Multi-AZ RDS, and Lambda can easily generate 50,000+ metric series per minute; (2) business latency — anomaly detection SLOs require freshness of 60 seconds or less; (3) API cost — each GetMetricData call has a direct cost and a quota of 50 metrics per request, meaning naive polling at scale breaks both the budget and AWS rate limits.

Anatomy of the CloudWatch → OTel Bridge Pattern

Complete flow from metric emission by AWS services to the external observability backend, through the two ingestion paths (Metric Streams and polling) and the security and cost guardrails.

🟧 AWS — Metric Sources

Lambda · Invocations/Errors/Duration
EKS / EC2 · CPU, Memory, Network
RDS / Aurora · DBConnections, Latency
API Gateway · 4xx/5xx, Latency

🟦 AWS — Ingestion Layer

CloudWatch · Namespaces
CW Metric Streams · ~2-3s latency, Firehose
Kinesis Firehose · JSON/OTel0.7 format

🟨 AWS — Transform & Forward

Bridge Lambda · OTLP transform + retry
SQS DLQ · failed batches
KMS CMK · payload encryption

🔵 External — Observability Backend

OTel Collector · OTLP/gRPC :4317
Datadog / Grafana · / Honeycomb

Pattern Anatomy: Two Paths, One Fundamental Trade-off

The pattern has two ingestion flavors, and the choice between them defines everything that follows.

CloudWatch Metric Streams + Kinesis Firehose is the low-latency path. Streams deliver data in OpenTelemetry 0.7 (protobuf) or JSON format with 2–3 second latency. The cost is predictable: $0.003 per 1,000 metric updates. For an account with 100k active series, that's roughly $300/month before Firehose costs. The critical architectural advantage is that you're not polling — data flows, and the Lambda at the Firehose destination receives batches, not individual calls.

Polling via GetMetricData is the fine-grained control path. You choose exactly which metrics, at what resolution, and what lookback period. But the API has a quota of 50 metrics per request and 500 requests per second per account (soft limit). In a large account, hitting that limit is a matter of minutes if the polling code doesn't implement exponential backoff with jitter and doesn't batch correctly. I've seen financial production systems break critical alerts because the poller entered throttling at 09:00 on a Monday morning — exactly when the market opens and transaction volume explodes.

The transformation Lambda needs three non-negotiable capabilities: (1) idempotency — Firehose can re-deliver batches on failure; the Lambda must detect duplicates via payload hash; (2) circuit breaker for the external OTel endpoint — if the collector is down, the Lambda cannot loop consuming concurrency; (3) DLQ with alarm — batches that fail after 3 attempts go to SQS DLQ and a CloudWatch alarm must fire within 5 minutes.

When This Pattern Makes Sense

You have an external observability backend (Datadog, Grafana Cloud, Honeycomb) that speaks OTLP and needs metrics from managed AWS services that have no installable agent.

Metric volume justifies Streams (>10k active series) — below that, the fixed stream cost rarely pays off versus selective polling.

Your metric freshness SLO is ≤60 seconds — Metric Streams delivers in 2–3s; 1-minute polling has effective latency of up to 90s.

The platform team wants to decouple the observability backend from AWS without rewriting application instrumentation — the bridge is transparent to product teams.

You need trace and metric correlation in a single backend: OTel allows enriching metrics with resource attributes (account ID, cluster, service) that native CloudWatch doesn't propagate.

Security and Guardrails: What the Tutorials Don't Tell You

In financial environments, the observability bridge is an underestimated data exfiltration vector. Business metrics — transaction volume, payment latency, authentication error rates — are sensitive information. A misconfigured OTel endpoint can leak this data out of the AWS account without any alarm.

The three control layers I implement in every deployment of this pattern:

IAM with resource conditions: The Lambda role must have cloudwatch:GetMetricData and cloudwatch:ListMetrics permission with condition aws:ResourceTag/Environment: production — never a wildcard. For Metric Streams, the Firehose role needs cloudwatch:PutMetricStream and firehose:PutRecord, but the destination Lambda role must be separate and have only s3:GetObject on the buffer bucket (if using S3 as fallback) and invocation permission.

KMS CMK for payload in transit: Firehose must be configured with ServerSideEncryption using a CMK managed by the security team. The transformation Lambda must decrypt with the same key. This ensures that even unauthorized access to the Firehose stream doesn't expose readable data.

VPC Endpoint for the OTel collector: If the collector runs on ECS inside the VPC, the Lambda must be in the same VPC and use private DNS. If the collector is external (SaaS), traffic must exit through a NAT Gateway with a fixed IP whitelisted in the vendor's firewall — never through an Internet Gateway without egress control. Add a WAF rule on API Gateway if the external collector exposes an HTTP endpoint.

A detail that burned one of my clients: the bridge Lambda running with a 15-minute timeout (maximum) and no reserved concurrency can consume the entire account concurrency pool during an ingestion spike, bringing down critical business functions. Reserve explicit concurrency — typically 10–20 instances are sufficient for a Firehose with a 5MB batch.

Anti-Patterns: When This Bridge Will Explode in Production

Naive polling without backoff: Calling GetMetricData in a loop with a fixed 60s interval for hundreds of namespaces. In accounts with many services, you hit the rate limit in minutes and lose observability data exactly when you need it most — during incidents.
Lambda without DLQ and without error alarm: Silent transformation or export failures mean metrics disappear without any signal. In financial environments, this can mask service degradation for hours.
Forwarding all metrics from all namespaces: CloudWatch has over 200 namespaces. Forwarding everything to the external backend multiplies SaaS ingestion cost by 5–10x without proportional value. Filter by namespace and dimension in the Metric Stream itself.
No idempotency in the transformation Lambda: Firehose guarantees at-least-once delivery. Without payload hash deduplication, you inject duplicate series into the OTel backend, corrupting aggregations and sum/count-based alerts.
Using this pattern for traces and logs: The CloudWatch → OTel bridge is designed for metrics. Trying to forward CloudWatch Logs Insights or X-Ray traces through the same channel creates a fragile multi-purpose system. Use the OTel Collector directly in applications for traces and logs.
No reserved concurrency on the Lambda: During ingestion spikes (market open, nightly batch), the bridge Lambda can exhaust the account concurrency pool and throttle critical business functions.

Reference Design: What Actually Works in Financial Production

After implementing and debugging this pattern in three different financial environments, the design that works has these concrete characteristics:

Ingestion via Metric Streams with namespace filter: Configure the stream to include only relevant namespaces — AWS/Lambda, AWS/EKS, AWS/RDS, AWS/ApiGateway, AWS/MSK — and explicitly exclude high-cost, low-value namespaces like AWS/Billing and AWS/CloudFront (unless you monitor CDN). The format should be opentelemetry0.7 (protobuf), not JSON, to reduce payload size by ~40%.

Firehose with 60s/5MB buffer and S3 as fallback: Configure BufferingHints with IntervalInSeconds: 60 and SizeInMBs: 5. The fallback S3 bucket should have a lifecycle policy to expire objects after 7 days — it exists only for manual replay in case of external collector failure, not as permanent storage.

Transformation Lambda in Python/Rust with OTel SDK: Use opentelemetry-sdk to build the OTLP payload. Add fixed resource attributes at transformation time: cloud.account.id, cloud.region, deployment.environment. These attributes enable cross-account correlation in the backend. Timeout should be 3 minutes (not 15) — if transforming a batch takes more than 3 minutes, there's a volume problem that needs to be solved in the stream filter, not in the timeout.

Observability of the bridge itself: Instrument the Lambda with custom metrics: bridge.metrics.transformed.count, bridge.export.latency.p99, bridge.export.errors.count. Create a CloudWatch dashboard for the bridge itself — it's ironic but necessary: you need to observe the observability system. Define an SLO of bridge.export.errors.count < 0.1% with a 5-minute burn rate alarm.

Real Production Numbers

2–3s

Metric Streams Latency

From AWS service emission to Firehose. 1-minute interval polling has effective latency of 60–90s.

~$0.30

Cost per 100k series/month

$0.003/1k updates × ~100k series × ~1 update/min × 43,200 min/month ≈ $13k/month — filter aggressively.

Metrics per request in GetMetricData API

Hard quota. With 500 req/s soft limit, polling 25k metrics requires 500 requests — hits the limit in 1 second.

Filter at the Stream, Not in Lambda

CloudWatch Metric Stream supports filters by namespace and by dimension. Use them. Every metric you don't forward saves Firehose cost, Lambda cost, and ingestion cost in the external backend. The transformation Lambda should be dumb and fast — transform format, add resource attributes, and export. Filtering logic in Lambda is an anti-pattern: you've already paid for the data when it arrives at Firehose.

Well-Architected Lenses for This Pattern

Security

IAM with resource tag conditions, KMS CMK on Firehose, Lambda in private VPC, controlled egress via NAT with fixed IP. The Lambda role must never have cloudwatch:* — minimum scope per namespace.

Reliability

DLQ at Firehose destination, circuit breaker for the external OTel endpoint, burn rate alarm on the bridge SLO. External collector failure testing must be part of the DR runbook.

My Curation Note

Senior Solutions Architect

I implemented this pattern for the first time in 2022 in a payments environment and learned the hard way that the observability bridge needs its own observability — it sounds obvious, but in practice it's always the last item on the list. What concerns me most about this pattern is not the technical complexity, but the false sense of security it creates: teams assume that because metrics are 'flowing,' observability is working. It's not — not until you have freshness SLOs, error alarms on the bridge, and a runbook for external collector failure. In financial environments, I always require the team to demonstrate what happens when the external OTel endpoint is unavailable for 10 minutes before going to production. The answer to that question reveals whether the design has real guardrails or just good intentions.

Verdict: Use with Surgery, Not Enthusiasm

The CloudWatch → OTel Bridge pattern via Lambda and Metric Streams is technically sound and solves a real observability fragmentation problem. But it has a non-trivial operational cost and a narrow validity envelope. Use it when: (a) you have >10k active metric series from managed AWS services that need to be in an external OTLP backend; (b) your freshness SLO is ≤60 seconds; (c) the platform team has the capacity to operate and observe the bridge itself. Don't use it when: you're trying to consolidate traces and logs through the same channel; when volume doesn't justify the fixed Metric Stream cost; or when there isn't operational maturity to maintain a system that observes other systems. The highest risk is not technical — it's organizational: teams that deploy this pattern and then don't monitor it create an observability blind spot disguised as an observability solution.

References

CloudWatch Metric Streams — AWS Documentation OpenTelemetry Collector — Official Docs Kinesis Data Firehose Lambda Transformation GetMetricData API — Quotas and Limits AWS Well-Architected Framework — Operational Excellence Pillar OTLP Specification — OpenTelemetry Observability Engineering — Charity Majors, Liz Fong-Jones, George Miranda

#observability#opentelemetry#cloudwatch#lambda#otel-collector#financial-grade#platform-engineering#aws

Liked this? Get the next one.

Architecture, AWS, AI and market deep dives — straight to your inbox. Free.

No spam · unsubscribe anytime

Analyzed source: CloudWatch metrics to OTel collectors

Ask Fernando about this

Get a focused answer about this article from my AI assistant, grounded in my work.

Join the conversation

Verify your email to join in — you'll also get the newsletter. No password.

Keep reading

Data PlatformsLLM Observability in Production: From GPU Metrics to Response QualityDeploying an LLM to SageMaker is the easy part. The hard part is knowing, in real time, whether it is answering well, using GPU efficiently, and costing what you planned. This article details the observability stack I would build today for financial-grade LLM inference.Read Data PlatformsML Observability on EKS: Logs, Metrics and Tracing Head-to-HeadML workloads on EKS generate telemetry volumes that expose the limits of any observability pipeline not designed for that profile. In this article I compare four collection and routing approaches for logs and metrics, focusing on real cost, diagnostic latency and fitness for regulated financial environments.Read Data PlatformsAgentic RAG with OpenSearch Serverless: Anatomy of a PatternThe agentic RAG pattern with OpenSearch Serverless promises elastic scale and semantic retrieval without infrastructure management — but hides serious latency, cost, and consistency pitfalls that financial-grade systems cannot afford to ignore. In this article, I dissect the pattern's anatomy, map when it works, when it fails, and how to configure it with production-grade rigor.Read

Architecture newsletter

Architecture intelligence, in your inbox

Curated signals and original analysis on AWS, AI, distributed systems and the market — the way a solutions architect reads them.

Curated AWS · AI · architecture · market signals
New architecture studies & deep-dives when they ship
Sharp summaries — depth without the noise
No spam · double opt-in · unsubscribe anytime

Data PlatformsPattern Teardown

CloudWatch to OTel: Tearing Down the Observability Bridge Pattern

May 28, 2026 7 minadvanced AI-assisted

Listen to article

Fernando's voice

Fernando · 16:00

Download MP3

0:0016:00

Speed

The MP3 is saved to S3 after the first play.

Data PlatformsPattern Teardown

2–3s

Metric Streams Latency

From AWS service emission to Firehose. 1-minute interval polling has effective latency of 60–90s.

~$0.30

Cost per 100k series/month

$0.003/1k updates × ~100k series × ~1 update/min × 43,200 min/month ≈ $13k/month — filter aggressively.

Metrics per request in GetMetricData API

Hard quota. With 500 req/s soft limit, polling 25k metrics requires 500 requests — hits the limit in 1 second.

fernando.moretes.com

The Real Problem: Observability Fragmentation in AWS-Native Environments

Anatomy of the CloudWatch → OTel Bridge Pattern

Complete flow from metric emission by AWS services to the external observability backend, through the two ingestion paths (Metric Streams and polling) and the security and cost guardrails.

🟧 AWS — Metric Sources

Lambda · Invocations/Errors/Duration
EKS / EC2 · CPU, Memory, Network
RDS / Aurora · DBConnections, Latency
API Gateway · 4xx/5xx, Latency

🟦 AWS — Ingestion Layer

CloudWatch · Namespaces
CW Metric Streams · ~2-3s latency, Firehose
Kinesis Firehose · JSON/OTel0.7 format

🟨 AWS — Transform & Forward

Bridge Lambda · OTLP transform + retry
SQS DLQ · failed batches
KMS CMK · payload encryption

🔵 External — Observability Backend

OTel Collector · OTLP/gRPC :4317
Datadog / Grafana · / Honeycomb

Pattern Anatomy: Two Paths, One Fundamental Trade-off

The pattern has two ingestion flavors, and the choice between them defines everything that follows.

When This Pattern Makes Sense

You have an external observability backend (Datadog, Grafana Cloud, Honeycomb) that speaks OTLP and needs metrics from managed AWS services that have no installable agent.

Metric volume justifies Streams (>10k active series) — below that, the fixed stream cost rarely pays off versus selective polling.

Your metric freshness SLO is ≤60 seconds — Metric Streams delivers in 2–3s; 1-minute polling has effective latency of up to 90s.

The platform team wants to decouple the observability backend from AWS without rewriting application instrumentation — the bridge is transparent to product teams.

You need trace and metric correlation in a single backend: OTel allows enriching metrics with resource attributes (account ID, cluster, service) that native CloudWatch doesn't propagate.

Security and Guardrails: What the Tutorials Don't Tell You

The three control layers I implement in every deployment of this pattern:

Anti-Patterns: When This Bridge Will Explode in Production

Naive polling without backoff: Calling GetMetricData in a loop with a fixed 60s interval for hundreds of namespaces. In accounts with many services, you hit the rate limit in minutes and lose observability data exactly when you need it most — during incidents.
Lambda without DLQ and without error alarm: Silent transformation or export failures mean metrics disappear without any signal. In financial environments, this can mask service degradation for hours.
Forwarding all metrics from all namespaces: CloudWatch has over 200 namespaces. Forwarding everything to the external backend multiplies SaaS ingestion cost by 5–10x without proportional value. Filter by namespace and dimension in the Metric Stream itself.
No idempotency in the transformation Lambda: Firehose guarantees at-least-once delivery. Without payload hash deduplication, you inject duplicate series into the OTel backend, corrupting aggregations and sum/count-based alerts.
Using this pattern for traces and logs: The CloudWatch → OTel bridge is designed for metrics. Trying to forward CloudWatch Logs Insights or X-Ray traces through the same channel creates a fragile multi-purpose system. Use the OTel Collector directly in applications for traces and logs.
No reserved concurrency on the Lambda: During ingestion spikes (market open, nightly batch), the bridge Lambda can exhaust the account concurrency pool and throttle critical business functions.

Reference Design: What Actually Works in Financial Production

After implementing and debugging this pattern in three different financial environments, the design that works has these concrete characteristics:

Real Production Numbers

2–3s

Metric Streams Latency

From AWS service emission to Firehose. 1-minute interval polling has effective latency of 60–90s.

~$0.30

Cost per 100k series/month

$0.003/1k updates × ~100k series × ~1 update/min × 43,200 min/month ≈ $13k/month — filter aggressively.

Metrics per request in GetMetricData API

Hard quota. With 500 req/s soft limit, polling 25k metrics requires 500 requests — hits the limit in 1 second.

Filter at the Stream, Not in Lambda

Well-Architected Lenses for This Pattern

Security

IAM with resource tag conditions, KMS CMK on Firehose, Lambda in private VPC, controlled egress via NAT with fixed IP. The Lambda role must never have cloudwatch:* — minimum scope per namespace.

Reliability

DLQ at Firehose destination, circuit breaker for the external OTel endpoint, burn rate alarm on the bridge SLO. External collector failure testing must be part of the DR runbook.

My Curation Note

Senior Solutions Architect

Verdict: Use with Surgery, Not Enthusiasm

References

#observability#opentelemetry#cloudwatch#lambda#otel-collector#financial-grade#platform-engineering#aws

Liked this? Get the next one.

Architecture, AWS, AI and market deep dives — straight to your inbox. Free.

No spam · unsubscribe anytime

Analyzed source: CloudWatch metrics to OTel collectors

Ask Fernando about this

Get a focused answer about this article from my AI assistant, grounded in my work.

Join the conversation

Verify your email to join in — you'll also get the newsletter. No password.

Keep reading

Architecture newsletter

Architecture intelligence, in your inbox

Curated signals and original analysis on AWS, AI, distributed systems and the market — the way a solutions architect reads them.

Curated AWS · AI · architecture · market signals
New architecture studies & deep-dives when they ship
Sharp summaries — depth without the noise
No spam · double opt-in · unsubscribe anytime