ML Observability on EKS: Logs, Metrics and Tracing Head-to-Head
Listen to article
Fernando's voiceFernando · 18:38
Powered by Amazon Polly + OmniVoice
ML workloads on EKS generate telemetry volumes that expose the limits of any observability pipeline not designed for that profile. In this article I compare four collection and routing approaches for logs and metrics, focusing on real cost, diagnostic latency and fitness for regulated financial environments.
When a distributed training job on EKS starts diverging silently — gradients exploding, GPU workers idle, data throughput halved — the time between the first anomalous signal and a confirmed diagnosis is determined entirely by the quality of the observability pipeline you chose to build. In financial environments where every GPU-hour costs between $3 and $32 (p3.16xlarge to p4d.24xlarge), that diagnostic window is not an operational detail: it is a direct cost line. This article is an honest bake-off between four observability strategies for ML on EKS — native Fluent Bit, OpenTelemetry Collector (OTel), CloudWatch Container Insights with Enhanced Observability, and Datadog Agent with DogStatsD. Each carries a radically different cost, latency and operational complexity profile, and the wrong choice at scale costs more than the workload you are trying to monitor.
Why ML workloads on EKS are a special observability case
A stateless inference pod generates predictable telemetry: a few hundred log lines per minute, CPU/memory metrics, request latency. A distributed training job with PyTorch DDP or Ray Train across 32 GPU nodes is an entirely different category.
First, log volume is non-linear with worker count. Each rank emits epoch progress, checkpointing, gradient norm and NCCL collective logs. With 32 workers and 10-minute epochs, it is common to see 50–200 MB/min of unstructured stdout arriving at the collection DaemonSet — before any business-application log.
Second, GPU metrics have high cardinality. DCGM Exporter exposes ~40 metrics per GPU (SM utilization, memory bandwidth, NVLink throughput, tensor core activity, ECC errors). On a p4d.24xlarge node with 8 A100s that is 320 time series per node, scraped every 15 s. Across 20 nodes you have 6,400 active series — within CloudWatch's limit (10 custom metrics per namespace free up to 10 k) but dangerously close to blowing the custom-metrics budget if you do not filter at the source.
Third, distributed tracing in training differs from microservice tracing. There is no request trace; there is an execution graph of CUDA operations, collective communication and data I/O. OpenTelemetry still lacks native semantics for this — you must manually instrument PyTorch hooks or use MLflow Tracing, which is a separate layer.
These three characteristics — explosive log volume, high-cardinality GPU metrics and absent native tracing — define the evaluation criteria for this bake-off.
The four contenders: architecture and positioning
Fluent Bit (native EKS DaemonSet) is the default log collector in the aws-for-fluent-bit add-on. It tails /var/log/containers/*.log, parses JSON or regex, and routes to CloudWatch Logs, S3, Kinesis Data Streams or Firehose. Backpressure configuration via mem_buf_limit and storage.type filesystem is critical: without it, a 200 MB/min burst of training logs will OOM the DaemonSet on memory-constrained nodes. Its strength is lightness — ~50 MB memory in normal operation — and native IAM integration via IRSA. Its weakness is that it collects no metrics or traces; it is a pure log collector.
OpenTelemetry Collector (OTel) is the most versatile contender. Deployed as a DaemonSet or Deployment via opentelemetry-operator, it unifies logs (via filelog receiver), metrics (via prometheusreceiver scraping DCGM) and traces (via otlp receiver) in a single pipeline. Operational cost is higher: the pipeline needs tuning of batch processor (send_batch_size, timeout), memory_limiter and parallel exporters. But the ability to route to multiple backends simultaneously — CloudWatch, S3 via OTLP/Parquet, Jaeger — without duplicating agents is the real architectural differentiator.
CloudWatch Container Insights with Enhanced Observability for EKS is the AWS managed option. Enabled via the amazon-cloudwatch-observability add-on, it automatically installs a pre-configured OTel Collector, DCGM Exporter and Fluent Bit. Operational cost is zero, but financial cost is high: Enhanced Observability charges $0.009 per vCPU-hour and $0.009 per GB memory-hour per monitored node — on a 20-node p4d.24xlarge cluster (96 vCPUs each), that is ~$17,280/month in observability alone.
Datadog Agent with DogStatsD is the most complete enterprise option. The datadog/datadog Helm chart installs the Agent as a DaemonSet with log collection, metrics (including native DCGM integration), APM and NPM. The differentiator is ML Observability (formerly Weights & Biases integration) and automatic correlation between logs, metrics and traces. Cost is $15–23/host/month depending on tier, plus log ingestion at $0.10/GB after the free tier.
ML Observability Pipelines on EKS: Four Approaches in Parallel
Each column represents a collection strategy. GPU nodes and DCGM Exporter are shared. Arrows show telemetry flow to the analytics backend.
- Training Pod · PyTorch DDP / Ray
- DCGM Exporter · /metrics :9400
- stdout/stderr · /var/log/containers
- Fluent Bit · mem_buf_limit=256MB
- OTel Collector · filelog+prom+otlp
- CW Addon · amazon-cloudwatch-obs
- Datadog Agent · DogStatsD :8125
- CloudWatch Logs · /aws/eks/ml-cluster
- CloudWatch Metrics · Custom NS: ML/GPU
- S3 + Parquet · long-term archive
- Kinesis Data Streams · hot path routing
- Datadog SaaS · ML Observability
- Jaeger / Tempo · distributed traces
Technical Comparison: Four ML Observability Strategies on EKS
| Criterion | Native Fluent Bit | OTel Collector | CW Container Insights Enhanced | Datadog Agent | |
|---|---|---|---|---|---|
| Signals covered | Logs only | Logs + Metrics + Traces | Logs + Metrics (GPU via DCGM) | Logs + Metrics + Traces + APM | — |
| Memory overhead per node | ~50 MB | 150–400 MB (pipeline size) | 200–350 MB (bundle) | 300–600 MB (full agent) | — |
| Monthly cost (20 p4d.24xlarge nodes) | ~$180 (CW Logs ingestion) | ~$400–800 (CW EMF + Logs) | ~$17,280 (Enhanced vCPU/mem fee) | ~$460 + log ingestion | — |
| Diagnostic latency (P99) | 30–90 s (CW Logs Insights query) | 10–30 s (backend dependent) | 15–45 s (CW dashboards) | 5–15 s (Live Tail + alerts) | — |
| GPU metrics support (DCGM) | Not native | Via prometheusreceiver | Native (add-on installs DCGM) | Native (DCGM integration) | — |
| Compliance / data sovereignty | High (data stays in AWS) | High (configurable per backend) | High (data stays in AWS) | Medium (data leaves to SaaS) | — |
| Operational complexity | Low | High (pipeline YAML, tuning) | Low (managed) | Medium (Helm + API key rotation) | — |
| Vendor lock-in | AWS (moderate) | Minimal (open standard) | AWS (high) | Datadog (high) | — |
The real problem with Enhanced Observability: hidden vCPU cost
The amazon-cloudwatch-observability add-on with Enhanced Observability is the simplest option to enable — one eksctl enable addon and you have DCGM, Fluent Bit and OTel Collector pre-configured. But the pricing model is a trap for ML clusters.
Enhanced Observability charges per vCPU-hour and GB-memory-hour per monitored node, regardless of how many metrics you actually use. A p4d.24xlarge has 96 vCPUs and 1,152 GB RAM. At $0.009/vCPU-hour, the vCPU dimension alone costs $0.864/hour per node. With 20 nodes running 24/7 that is $12,441/month — and the memory dimension adds another ~$4,976/month. Total: ~$17,417/month.
For comparison, the GPU cost of those 20 nodes is ~$460,800/month (at $32.77/hour per node). So observability represents ~3.8% of compute cost — which might seem reasonable until you realise that a standalone OTel Collector with EMF exporter to CloudWatch Metrics covers 90% of the same use cases for ~$800/month.
Practical recommendation: use Enhanced Observability only on dev/staging clusters with smaller nodes (m5, c5) where the per-vCPU cost is negligible. On production clusters with large GPU instances, build the OTel pipeline manually with memory_limiter set to 80% of the container limit, batch processor with send_batch_size: 1000 and timeout: 10s, and prometheusreceiver scraping DCGM every 30 s (not 15 s — you halve metric volume without meaningful diagnostic loss).
Numbers that matter in production ML clusters
OTel Collector: the pipeline worth the operational cost
The OpenTelemetry Collector is the only one of the four contenders that solves the problem of multiple backends without agent duplication. In regulated financial environments this is often mandatory: you need to send logs to CloudWatch (7-year regulatory retention via S3 Glacier), metrics to an internal analytics backend (Prometheus + Thanos or Amazon Managed Prometheus) and traces to Jaeger or Tempo — all simultaneously, with different delivery guarantees.
The pipeline configuration I use in production for ML workloads has three critical stages:
1. Parallel receivers with memory isolation: filelog receiver with start_at: beginning disabled (prevents re-reading historical logs on pod restart), prometheusreceiver with scrape_interval: 30s and target_allocator enabled to distribute scraping across multiple Collectors when the cluster has >50 nodes.
2. Chained processors with circuit breaker: memory_limiter as the first processor (limit at 80% of the container's resources.limits.memory), followed by resource processor to add k8s.cluster.name, ml.job.id and gpu.node.type attributes — essential for cross-signal correlation. The batch processor comes last, not first: placing batch before memory_limiter is the most common mistake I see in architecture reviews.
3. Exporters with retry and persistent queue: awscloudwatchlogs exporter with log_stream_name derived from k8s.pod.name (avoids per-stream throttling), awsemf exporter with explicit metric_declarations to avoid sending all 320 DCGM series to CloudWatch (select the 8–12 that actually matter: DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_MEM_COPY_UTIL, DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, DCGM_FI_DEV_ECC_DBE_VOL_TOTAL). retry_on_failure with max_elapsed_time: 300s and sending_queue with storage: file_storage ensures a temporary CloudWatch throttle does not lose training data.
Decision Matrix: Which Strategy for Which Context
Native Fluent Bit
- Minimal overhead (~50 MB); ideal for nodes where memory is contested by GPU containers
- Native IRSA integration; no static credentials
- Routing to Kinesis for real-time alert hot-path
- Covers logs only; GPU metrics require a separate agent
- No native correlation between logs and metrics
Use as a complement to another collector, not as a standalone solution
OTel Collector
- Unifies logs, metrics and traces in a single DaemonSet
- Open standard; no lock-in; exports to any backend
- 21× lower cost than Enhanced Observability on large GPU clusters
- target_allocator scales scraping horizontally
- High configuration complexity; pipeline mistakes cause silent data loss
- Requires expertise in OTel pipeline YAML and memory tuning
Best choice for production on medium-to-large GPU clusters with multi-backend requirements
CW Container Insights Enhanced
- Zero operation; AWS-managed add-on
- Native DCGM without additional configuration
- Ideal for teams without OTel expertise
- Prohibitive cost on large GPU instances (~$17k/month for 20 p4d nodes)
- Strong CloudWatch lock-in; difficult to migrate historical data
Acceptable only on dev/staging clusters with smaller instances; never on large-scale GPU production
Datadog Agent
- Lowest diagnostic latency (<15 s with Live Tail)
- ML Observability with automatic run/experiment correlation
- Native APM for serving pipelines (Triton, TorchServe)
- Data leaves AWS; problematic in environments with financial data sovereignty restrictions
- Cost grows linearly with hosts; can exceed OTel+CW on large clusters
- API key rotation requires additional security process (Secrets Manager + External Secrets Operator)
Best for ML teams that need fast diagnosis and have no data sovereignty restrictions
Security and governance: what changes when ML logs contain sensitive data
In financial environments, training logs frequently contain information that should not leave the security perimeter: customer IDs in validation datasets, feature values derived from transactions, or simply the model name and version (which is sensitive competitive information). This adds a layer of requirements that most observability comparisons ignore.
CloudWatch Logs with KMS: all four collectors support sending to CloudWatch Logs, but only if you configure kms_key_id on the log group. Fluent Bit does this via auto_create_group Off and pre-creating the group with aws logs create-log-group --kms-key-id arn:aws:kms:.... OTel Collector requires the log group to exist before the first PutLogEvents — a common race condition in clusters that scale rapidly.
IAM with context conditions: the collector's IRSA policy must have Condition: StringEquals: aws:RequestedRegion: [region] to prevent cross-region exfiltration, and aws:SourceVpc if you use VPC Endpoints for CloudWatch. The Datadog Agent stores the API key in a Kubernetes Secret — use External Secrets Operator with AWS Secrets Manager and 90-day automatic rotation, not a manual kubectl create secret.
Data masking in the pipeline: OTel Collector has the transform processor with OTTL (OpenTelemetry Transformation Language) that allows masking fields before sending: replace_pattern(body, "customer_id=\\d+", "customer_id=REDACTED"). Fluent Bit has the lua filter for the same purpose. Datadog has Sensitive Data Scanner on the backend side, but that means the data has already left AWS — for regulated environments, masking must happen in the collector, not the backend.
Retention and immutability: for compliance with financial regulations (BACEN, CVM, SOX), configure S3 Object Lock in COMPLIANCE mode on log archive buckets with a 7-year retention period. Fluent Bit with S3 output supports s3_key_format /%Y/%m/%d/%H/ for temporal partitioning that facilitates audit queries with Athena.
The most expensive architecture mistake: DCGM scraping without cardinality filtering
I have seen teams send all ~40 DCGM Exporter metrics to CloudWatch Metrics without metric_declarations. On a 30-node p4d.24xlarge cluster that generates 9,600 custom time series. CloudWatch charges $0.30 per custom metric/month after the first 10,000 — but the real cost is not that. It is the API cost: PutMetricData has a limit of 1,000 values per call and 150 TPS per account. With 15 s scraping and 9,600 series you need ~10 calls per cycle per node — and you start seeing ThrottlingException that OTel Collector handles with exponential retry, increasing diagnostic latency exactly when you need it most. Filter to 8–12 essential metrics at the source, not the destination.
Anti-patterns I have seen in ML/EKS architecture reviews
- Enabling Enhanced Observability on GPU production clusters without calculating the per-vCPU-hour cost upfront
- Placing
batch processorbeforememory_limiterin the OTel pipeline — the batch accumulates data in memory before the limiter can act - Using
start_at: beginningin the filelog receiver without astorageextension — causes re-reading of all log history on every Collector restart - Storing Datadog API keys in Kubernetes Secrets without automatic rotation via External Secrets Operator
- Not configuring
mem_buf_limitin Fluent Bit on GPU nodes — a training log burst can OOM the DaemonSet and stop collection for all pods on the node - Sending PII data from validation datasets to SaaS backends without masking in the collector
In production I build the pipeline as OTel Collector with prometheusreceiver for DCGM (scrape at 30 s, not 15 s) + awsemf exporter with explicit metric_declarations for the 10 GPU metrics that actually matter + Fluent Bit only for the critical-log hot-path via Kinesis. Datadog is reserved for environments where the ML team needs <15 s diagnosis and there is no data sovereignty restriction — which in Brazilian banks is rarely the case. The most expensive lesson I have learned: observability cost on GPU clusters is not marginal — on a large-scale training cluster it can easily exceed the cost of a senior engineer per month if you do not consciously size the pipeline from the first deploy.
Final Recommendation: Composition, Not a Single Choice
There is no single winner in this bake-off because the four contenders solve different problems. The architecture I recommend for production ML clusters in financial environments is a deliberate composition:
Collection layer: OTel Collector as the primary DaemonSet, with opentelemetry-operator managing the lifecycle. Configure filelog receiver for logs, prometheusreceiver for DCGM (30 s, filtered to 10 metrics), and otlp receiver for serving traces. Use memory_limiter as the first processor, resource for ML attribute enrichment, and batch last.
Dual routing: awsemf exporter for critical metrics in CloudWatch (operational alerts), awscloudwatchlogs for KMS-encrypted logs, and otlphttp exporter to S3 via Parquet for long-term archival and Athena analysis. For clusters with serving trace requirements, add a jaeger exporter pointing to AWS X-Ray via OTel.
Fluent Bit as hot-path: keep the Fluent Bit DaemonSet only for routing critical logs (errors, OOM, checkpoint failures) to Kinesis Data Streams → Lambda → PagerDuty. This guarantees <30 s alert latency without OTel Collector overhead on the critical path.
Technical References
Architecture, AWS, AI and market deep dives — straight to your inbox. Free.
No spam · unsubscribe anytime
Ask Fernando about this
Get a focused answer about this article from my AI assistant, grounded in my work.
Join the conversation
Sign in to comment
Verify your email to join in — you'll also get the newsletter. No password.
Keep reading
Architecture intelligence, in your inbox
Curated signals and original analysis on AWS, AI, distributed systems and the market — the way a solutions architect reads them.
- Curated AWS · AI · architecture · market signals
- New architecture studies & deep-dives when they ship
- Sharp summaries — depth without the noise
- No spam · double opt-in · unsubscribe anytime