CloudWatch to OTel: Tearing Down the Observability Bridge Pattern
Listen to article
Fernando's voiceFernando · 16:00
Powered by Amazon Polly + OmniVoice
The CloudWatch-to-OpenTelemetry bridge pattern solves a real observability fragmentation problem in multi-platform environments, but it carries operational costs and design pitfalls that rarely surface in tutorials. In this article I tear down the anatomy of this pattern, when it makes sense, and when it creates more problems than it solves.
In financial environments where platform teams need to consolidate signals from dozens of AWS workloads into a unified observability backend — whether Datadog, Grafana Cloud, Honeycomb, or an internal stack — the CloudWatch → OpenTelemetry bridge pattern appears as the obvious solution. But 'obvious' and 'correct' rarely coincide in architecture. This pattern has a specific anatomy, a narrow validity envelope, and failure modes that only surface in production under real load. I'm going to dissect every layer.
The Real Problem: Observability Fragmentation in AWS-Native Environments
Every organization that grows beyond two or three engineering teams faces the same tension: AWS services emit metrics natively to CloudWatch — Lambda, RDS, EKS, API Gateway, MSK — but the corporate observability backend speaks OTLP. The result is a split world: SREs need to open two consoles to correlate an incident, alerts live in different namespaces, and business dashboards become impossible to build without manual ETL.
The bridge pattern exists to solve exactly this. The core idea is simple: a Lambda function (or an OTel collector running on ECS/EKS) subscribes to CloudWatch metric streams via CloudWatch Metric Streams or polls the GetMetricData API, transforms the payload to OTLP format, and forwards it to a collector endpoint. In theory, you get a single observability control plane. In practice, the complexity hides in the details.
What makes this problem especially treacherous in financial environments is the combination of three factors: (1) metric volume — a mid-size AWS account with EKS, Multi-AZ RDS, and Lambda can easily generate 50,000+ metric series per minute; (2) business latency — anomaly detection SLOs require freshness of 60 seconds or less; (3) API cost — each GetMetricData call has a direct cost and a quota of 50 metrics per request, meaning naive polling at scale breaks both the budget and AWS rate limits.
Anatomy of the CloudWatch → OTel Bridge Pattern
Complete flow from metric emission by AWS services to the external observability backend, through the two ingestion paths (Metric Streams and polling) and the security and cost guardrails.
- Lambda · Invocations/Errors/Duration
- EKS / EC2 · CPU, Memory, Network
- RDS / Aurora · DBConnections, Latency
- API Gateway · 4xx/5xx, Latency
- CloudWatch · Namespaces
- CW Metric Streams · ~2-3s latency, Firehose
- Kinesis Firehose · JSON/OTel0.7 format
- Bridge Lambda · OTLP transform + retry
- SQS DLQ · failed batches
- KMS CMK · payload encryption
- OTel Collector · OTLP/gRPC :4317
- Datadog / Grafana · / Honeycomb
Pattern Anatomy: Two Paths, One Fundamental Trade-off
The pattern has two ingestion flavors, and the choice between them defines everything that follows.
CloudWatch Metric Streams + Kinesis Firehose is the low-latency path. Streams deliver data in OpenTelemetry 0.7 (protobuf) or JSON format with 2–3 second latency. The cost is predictable: $0.003 per 1,000 metric updates. For an account with 100k active series, that's roughly $300/month before Firehose costs. The critical architectural advantage is that you're not polling — data flows, and the Lambda at the Firehose destination receives batches, not individual calls.
Polling via GetMetricData is the fine-grained control path. You choose exactly which metrics, at what resolution, and what lookback period. But the API has a quota of 50 metrics per request and 500 requests per second per account (soft limit). In a large account, hitting that limit is a matter of minutes if the polling code doesn't implement exponential backoff with jitter and doesn't batch correctly. I've seen financial production systems break critical alerts because the poller entered throttling at 09:00 on a Monday morning — exactly when the market opens and transaction volume explodes.
The transformation Lambda needs three non-negotiable capabilities: (1) idempotency — Firehose can re-deliver batches on failure; the Lambda must detect duplicates via payload hash; (2) circuit breaker for the external OTel endpoint — if the collector is down, the Lambda cannot loop consuming concurrency; (3) DLQ with alarm — batches that fail after 3 attempts go to SQS DLQ and a CloudWatch alarm must fire within 5 minutes.
When This Pattern Makes Sense
Security and Guardrails: What the Tutorials Don't Tell You
In financial environments, the observability bridge is an underestimated data exfiltration vector. Business metrics — transaction volume, payment latency, authentication error rates — are sensitive information. A misconfigured OTel endpoint can leak this data out of the AWS account without any alarm.
The three control layers I implement in every deployment of this pattern:
IAM with resource conditions: The Lambda role must have cloudwatch:GetMetricData and cloudwatch:ListMetrics permission with condition aws:ResourceTag/Environment: production — never a wildcard. For Metric Streams, the Firehose role needs cloudwatch:PutMetricStream and firehose:PutRecord, but the destination Lambda role must be separate and have only s3:GetObject on the buffer bucket (if using S3 as fallback) and invocation permission.
KMS CMK for payload in transit: Firehose must be configured with ServerSideEncryption using a CMK managed by the security team. The transformation Lambda must decrypt with the same key. This ensures that even unauthorized access to the Firehose stream doesn't expose readable data.
VPC Endpoint for the OTel collector: If the collector runs on ECS inside the VPC, the Lambda must be in the same VPC and use private DNS. If the collector is external (SaaS), traffic must exit through a NAT Gateway with a fixed IP whitelisted in the vendor's firewall — never through an Internet Gateway without egress control. Add a WAF rule on API Gateway if the external collector exposes an HTTP endpoint.
A detail that burned one of my clients: the bridge Lambda running with a 15-minute timeout (maximum) and no reserved concurrency can consume the entire account concurrency pool during an ingestion spike, bringing down critical business functions. Reserve explicit concurrency — typically 10–20 instances are sufficient for a Firehose with a 5MB batch.
Anti-Patterns: When This Bridge Will Explode in Production
- Naive polling without backoff: Calling
GetMetricDatain a loop with a fixed 60s interval for hundreds of namespaces. In accounts with many services, you hit the rate limit in minutes and lose observability data exactly when you need it most — during incidents. - Lambda without DLQ and without error alarm: Silent transformation or export failures mean metrics disappear without any signal. In financial environments, this can mask service degradation for hours.
- Forwarding all metrics from all namespaces: CloudWatch has over 200 namespaces. Forwarding everything to the external backend multiplies SaaS ingestion cost by 5–10x without proportional value. Filter by namespace and dimension in the Metric Stream itself.
- No idempotency in the transformation Lambda: Firehose guarantees at-least-once delivery. Without payload hash deduplication, you inject duplicate series into the OTel backend, corrupting aggregations and sum/count-based alerts.
- Using this pattern for traces and logs: The CloudWatch → OTel bridge is designed for metrics. Trying to forward CloudWatch Logs Insights or X-Ray traces through the same channel creates a fragile multi-purpose system. Use the OTel Collector directly in applications for traces and logs.
- No reserved concurrency on the Lambda: During ingestion spikes (market open, nightly batch), the bridge Lambda can exhaust the account concurrency pool and throttle critical business functions.
Reference Design: What Actually Works in Financial Production
After implementing and debugging this pattern in three different financial environments, the design that works has these concrete characteristics:
Ingestion via Metric Streams with namespace filter: Configure the stream to include only relevant namespaces — AWS/Lambda, AWS/EKS, AWS/RDS, AWS/ApiGateway, AWS/MSK — and explicitly exclude high-cost, low-value namespaces like AWS/Billing and AWS/CloudFront (unless you monitor CDN). The format should be opentelemetry0.7 (protobuf), not JSON, to reduce payload size by ~40%.
Firehose with 60s/5MB buffer and S3 as fallback: Configure BufferingHints with IntervalInSeconds: 60 and SizeInMBs: 5. The fallback S3 bucket should have a lifecycle policy to expire objects after 7 days — it exists only for manual replay in case of external collector failure, not as permanent storage.
Transformation Lambda in Python/Rust with OTel SDK: Use opentelemetry-sdk to build the OTLP payload. Add fixed resource attributes at transformation time: cloud.account.id, cloud.region, deployment.environment. These attributes enable cross-account correlation in the backend. Timeout should be 3 minutes (not 15) — if transforming a batch takes more than 3 minutes, there's a volume problem that needs to be solved in the stream filter, not in the timeout.
Observability of the bridge itself: Instrument the Lambda with custom metrics: bridge.metrics.transformed.count, bridge.export.latency.p99, bridge.export.errors.count. Create a CloudWatch dashboard for the bridge itself — it's ironic but necessary: you need to observe the observability system. Define an SLO of bridge.export.errors.count < 0.1% with a 5-minute burn rate alarm.
Real Production Numbers
Filter at the Stream, Not in Lambda
CloudWatch Metric Stream supports filters by namespace and by dimension. Use them. Every metric you don't forward saves Firehose cost, Lambda cost, and ingestion cost in the external backend. The transformation Lambda should be dumb and fast — transform format, add resource attributes, and export. Filtering logic in Lambda is an anti-pattern: you've already paid for the data when it arrives at Firehose.
Well-Architected Lenses for This Pattern
Security
IAM with resource tag conditions, KMS CMK on Firehose, Lambda in private VPC, controlled egress via NAT with fixed IP. The Lambda role must never have cloudwatch:* — minimum scope per namespace.
Reliability
DLQ at Firehose destination, circuit breaker for the external OTel endpoint, burn rate alarm on the bridge SLO. External collector failure testing must be part of the DR runbook.
I implemented this pattern for the first time in 2022 in a payments environment and learned the hard way that the observability bridge needs its own observability — it sounds obvious, but in practice it's always the last item on the list. What concerns me most about this pattern is not the technical complexity, but the false sense of security it creates: teams assume that because metrics are 'flowing,' observability is working. It's not — not until you have freshness SLOs, error alarms on the bridge, and a runbook for external collector failure. In financial environments, I always require the team to demonstrate what happens when the external OTel endpoint is unavailable for 10 minutes before going to production. The answer to that question reveals whether the design has real guardrails or just good intentions.
Verdict: Use with Surgery, Not Enthusiasm
The CloudWatch → OTel Bridge pattern via Lambda and Metric Streams is technically sound and solves a real observability fragmentation problem. But it has a non-trivial operational cost and a narrow validity envelope. Use it when: (a) you have >10k active metric series from managed AWS services that need to be in an external OTLP backend; (b) your freshness SLO is ≤60 seconds; (c) the platform team has the capacity to operate and observe the bridge itself. Don't use it when: you're trying to consolidate traces and logs through the same channel; when volume doesn't justify the fixed Metric Stream cost; or when there isn't operational maturity to maintain a system that observes other systems. The highest risk is not technical — it's organizational: teams that deploy this pattern and then don't monitor it create an observability blind spot disguised as an observability solution.
References
Architecture, AWS, AI and market deep dives — straight to your inbox. Free.
No spam · unsubscribe anytime
Ask Fernando about this
Get a focused answer about this article from my AI assistant, grounded in my work.
Join the conversation
Sign in to comment
Verify your email to join in — you'll also get the newsletter. No password.
Keep reading
Architecture intelligence, in your inbox
Curated signals and original analysis on AWS, AI, distributed systems and the market — the way a solutions architect reads them.
- Curated AWS · AI · architecture · market signals
- New architecture studies & deep-dives when they ship
- Sharp summaries — depth without the noise
- No spam · double opt-in · unsubscribe anytime