# ML Observability on EKS: Logs, Metrics and Tracing Head-to-Head

ML workloads on EKS generate telemetry volumes that expose the limits of any observability pipeline not designed for that profile. In this article I compare four collection and routing approaches for logs and metrics, focusing on real cost, diagnostic latency and fitness for regulated financial environments.

- URL: https://fernando.moretes.com/blog/observabilidade-para-ml-em-eks-com-logs-metricas-e-custo

- Markdown: https://fernando.moretes.com/blog/observabilidade-para-ml-em-eks-com-logs-metricas-e-custo/article.md?lang=en

- Published: 2026-05-31T00:00:00.000Z

- Category: Data Platforms

- Tags: eks, mlops, observability, opentelemetry, cloudwatch, fluent-bit, datadog, cost-optimization

- Reading time: 9 min

- Source: [EKS ML core logging](https://aws.amazon.com/blogs/architecture/)

---

When a distributed training job on EKS starts diverging silently — gradients exploding, GPU workers idle, data throughput halved — the time between the first anomalous signal and a confirmed diagnosis is determined entirely by the quality of the observability pipeline you chose to build. In financial environments where every GPU-hour costs between $3 and $32 (p3.16xlarge to p4d.24xlarge), that diagnostic window is not an operational detail: it is a direct cost line. This article is an honest bake-off between four observability strategies for ML on EKS — native Fluent Bit, OpenTelemetry Collector (OTel), CloudWatch Container Insights with Enhanced Observability, and Datadog Agent with DogStatsD. Each carries a radically different cost, latency and operational complexity profile, and the wrong choice at scale costs more than the workload you are trying to monitor.

## Why ML workloads on EKS are a special observability case

A stateless inference pod generates predictable telemetry: a few hundred log lines per minute, CPU/memory metrics, request latency. A distributed training job with PyTorch DDP or Ray Train across 32 GPU nodes is an entirely different category.

First, **log volume is non-linear with worker count**. Each rank emits epoch progress, checkpointing, gradient norm and NCCL collective logs. With 32 workers and 10-minute epochs, it is common to see 50–200 MB/min of unstructured stdout arriving at the collection DaemonSet — before any business-application log.

Second, **GPU metrics have high cardinality**. DCGM Exporter exposes ~40 metrics per GPU (SM utilization, memory bandwidth, NVLink throughput, tensor core activity, ECC errors). On a `p4d.24xlarge` node with 8 A100s that is 320 time series per node, scraped every 15 s. Across 20 nodes you have 6,400 active series — within CloudWatch's limit (10 custom metrics per namespace free up to 10 k) but dangerously close to blowing the custom-metrics budget if you do not filter at the source.

Third, **distributed tracing in training differs from microservice tracing**. There is no request trace; there is an execution graph of CUDA operations, collective communication and data I/O. OpenTelemetry still lacks native semantics for this — you must manually instrument PyTorch hooks or use MLflow Tracing, which is a separate layer.

These three characteristics — explosive log volume, high-cardinality GPU metrics and absent native tracing — define the evaluation criteria for this bake-off.

## The four contenders: architecture and positioning

**Fluent Bit (native EKS DaemonSet)** is the default log collector in the `aws-for-fluent-bit` add-on. It tails `/var/log/containers/*.log`, parses JSON or regex, and routes to CloudWatch Logs, S3, Kinesis Data Streams or Firehose. Backpressure configuration via `mem_buf_limit` and `storage.type filesystem` is critical: without it, a 200 MB/min burst of training logs will OOM the DaemonSet on memory-constrained nodes. Its strength is lightness — ~50 MB memory in normal operation — and native IAM integration via IRSA. Its weakness is that it collects no metrics or traces; it is a pure log collector.

**OpenTelemetry Collector (OTel)** is the most versatile contender. Deployed as a DaemonSet or Deployment via `opentelemetry-operator`, it unifies logs (via filelog receiver), metrics (via prometheusreceiver scraping DCGM) and traces (via otlp receiver) in a single pipeline. Operational cost is higher: the pipeline needs tuning of `batch processor` (send_batch_size, timeout), `memory_limiter` and parallel exporters. But the ability to route to multiple backends simultaneously — CloudWatch, S3 via OTLP/Parquet, Jaeger — without duplicating agents is the real architectural differentiator.

**CloudWatch Container Insights with Enhanced Observability for EKS** is the AWS managed option. Enabled via the `amazon-cloudwatch-observability` add-on, it automatically installs a pre-configured OTel Collector, DCGM Exporter and Fluent Bit. Operational cost is zero, but financial cost is high: Enhanced Observability charges $0.009 per vCPU-hour and $0.009 per GB memory-hour per monitored node — on a 20-node `p4d.24xlarge` cluster (96 vCPUs each), that is ~$17,280/month in observability alone.

**Datadog Agent with DogStatsD** is the most complete enterprise option. The `datadog/datadog` Helm chart installs the Agent as a DaemonSet with log collection, metrics (including native DCGM integration), APM and NPM. The differentiator is ML Observability (formerly Weights & Biases integration) and automatic correlation between logs, metrics and traces. Cost is $15–23/host/month depending on tier, plus log ingestion at $0.10/GB after the free tier.

## ML Observability Pipelines on EKS: Four Approaches in Parallel

Each column represents a collection strategy. GPU nodes and DCGM Exporter are shared. Arrows show telemetry flow to the analytics backend.

### 🖥️ EKS — Workload Layer

- Training Pod PyTorch DDP / Ray (compute)
- DCGM Exporter /metrics :9400 (compute)
- stdout/stderr /var/log/containers (data)

### 📦 EKS — Collection DaemonSets

- Fluent Bit mem_buf_limit=256MB (compute)
- OTel Collector filelog+prom+otlp (compute)
- CW Addon amazon-cloudwatch-obs (compute)
- Datadog Agent DogStatsD :8125 (compute)

### 🟧 AWS — Managed Backends

- CloudWatch Logs /aws/eks/ml-cluster (storage)
- CloudWatch Metrics Custom NS: ML/GPU (data)
- S3 + Parquet long-term archive (storage)
- Kinesis Data Streams hot path routing (messaging)

### 🔵 External — SaaS Backends

- Datadog SaaS ML Observability (external)
- Jaeger / Tempo distributed traces (external)

### Flows

- gpu_pod -> stdout: emits logs
- gpu_pod -> dcgm: GPU metrics
- stdout -> fluentbit: tail
- stdout -> otel: filelog receiver
- stdout -> cw_addon: managed tail
- stdout -> dd_agent: log collection
- dcgm -> otel: scrape /metrics
- dcgm -> cw_addon: scrape /metrics
- dcgm -> dd_agent: DCGM integration
- fluentbit -> cwlogs: PutLogEvents
- fluentbit -> kinesis: hot path
- otel -> cwlogs: awscloudwatchlogs
- otel -> cwmetrics: awsemf exporter
- otel -> s3_logs: OTLP/Parquet
- otel -> jaeger: traces
- cw_addon -> cwlogs: managed
- cw_addon -> cwmetrics: EMF
- dd_agent -> dd_backend: HTTPS/443

## Technical Comparison: Four ML Observability Strategies on EKS
| Criterion | Criterion | Native Fluent Bit | OTel Collector | CW Container Insights Enhanced | Datadog Agent |
| --- | --- | --- | --- | --- | --- |
| Signals covered | Logs only | Logs + Metrics + Traces | Logs + Metrics (GPU via DCGM) | Logs + Metrics + Traces + APM | — |
| Memory overhead per node | ~50 MB | 150–400 MB (pipeline size) | 200–350 MB (bundle) | 300–600 MB (full agent) | — |
| Monthly cost (20 p4d.24xlarge nodes) | ~$180 (CW Logs ingestion) | ~$400–800 (CW EMF + Logs) | ~$17,280 (Enhanced vCPU/mem fee) | ~$460 + log ingestion | — |
| Diagnostic latency (P99) | 30–90 s (CW Logs Insights query) | 10–30 s (backend dependent) | 15–45 s (CW dashboards) | 5–15 s (Live Tail + alerts) | — |
| GPU metrics support (DCGM) | Not native | Via prometheusreceiver | Native (add-on installs DCGM) | Native (DCGM integration) | — |
| Compliance / data sovereignty | High (data stays in AWS) | High (configurable per backend) | High (data stays in AWS) | Medium (data leaves to SaaS) | — |
| Operational complexity | Low | High (pipeline YAML, tuning) | Low (managed) | Medium (Helm + API key rotation) | — |
| Vendor lock-in | AWS (moderate) | Minimal (open standard) | AWS (high) | Datadog (high) | — |

## The real problem with Enhanced Observability: hidden vCPU cost

The `amazon-cloudwatch-observability` add-on with Enhanced Observability is the simplest option to enable — one `eksctl enable addon` and you have DCGM, Fluent Bit and OTel Collector pre-configured. But the pricing model is a trap for ML clusters.

Enhanced Observability charges per **vCPU-hour and GB-memory-hour per monitored node**, regardless of how many metrics you actually use. A `p4d.24xlarge` has 96 vCPUs and 1,152 GB RAM. At $0.009/vCPU-hour, the vCPU dimension alone costs **$0.864/hour per node**. With 20 nodes running 24/7 that is **$12,441/month** — and the memory dimension adds another ~$4,976/month. Total: ~$17,417/month.

For comparison, the GPU cost of those 20 nodes is ~$460,800/month (at $32.77/hour per node). So observability represents ~3.8% of compute cost — which might seem reasonable until you realise that a standalone OTel Collector with EMF exporter to CloudWatch Metrics covers 90% of the same use cases for ~$800/month.

Practical recommendation: use Enhanced Observability **only on dev/staging clusters with smaller nodes** (m5, c5) where the per-vCPU cost is negligible. On production clusters with large GPU instances, build the OTel pipeline manually with `memory_limiter` set to 80% of the container limit, `batch processor` with `send_batch_size: 1000` and `timeout: 10s`, and `prometheusreceiver` scraping DCGM every 30 s (not 15 s — you halve metric volume without meaningful diagnostic loss).

## Numbers that matter in production ML clusters

- **96×** — Time-series multiplier per GPU node. DCGM exposes ~40 metrics × 8 GPUs × 3 dimensions (node, GPU, process) on a p4d.24xlarge
- **21×** — Monthly cost ratio: Enhanced vs manual OTel (20 GPU nodes). $17,280 (Enhanced) vs ~$820 (OTel + CW EMF + Fluent Bit standalone)
- **<15 s** — Alert latency with Datadog Live Tail + monitor threshold. Compared to 60–120 s with CloudWatch Metric Alarms at high cardinality

## OTel Collector: the pipeline worth the operational cost

The OpenTelemetry Collector is the only one of the four contenders that solves the problem of **multiple backends without agent duplication**. In regulated financial environments this is often mandatory: you need to send logs to CloudWatch (7-year regulatory retention via S3 Glacier), metrics to an internal analytics backend (Prometheus + Thanos or Amazon Managed Prometheus) and traces to Jaeger or Tempo — all simultaneously, with different delivery guarantees.

The pipeline configuration I use in production for ML workloads has three critical stages:

**1. Parallel receivers with memory isolation**: `filelog` receiver with `start_at: beginning` disabled (prevents re-reading historical logs on pod restart), `prometheusreceiver` with `scrape_interval: 30s` and `target_allocator` enabled to distribute scraping across multiple Collectors when the cluster has >50 nodes.

**2. Chained processors with circuit breaker**: `memory_limiter` as the first processor (limit at 80% of the container's `resources.limits.memory`), followed by `resource` processor to add `k8s.cluster.name`, `ml.job.id` and `gpu.node.type` attributes — essential for cross-signal correlation. The `batch processor` comes last, not first: placing batch before memory_limiter is the most common mistake I see in architecture reviews.

**3. Exporters with retry and persistent queue**: `awscloudwatchlogs` exporter with `log_stream_name` derived from `k8s.pod.name` (avoids per-stream throttling), `awsemf` exporter with explicit `metric_declarations` to avoid sending all 320 DCGM series to CloudWatch (select the 8–12 that actually matter: `DCGM_FI_DEV_GPU_UTIL`, `DCGM_FI_DEV_MEM_COPY_UTIL`, `DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL`, `DCGM_FI_DEV_ECC_DBE_VOL_TOTAL`). `retry_on_failure` with `max_elapsed_time: 300s` and `sending_queue` with `storage: file_storage` ensures a temporary CloudWatch throttle does not lose training data.

## Decision Matrix: Which Strategy for Which Context

### Native Fluent Bit

**Pros**
- Minimal overhead (~50 MB); ideal for nodes where memory is contested by GPU containers
- Native IRSA integration; no static credentials
- Routing to Kinesis for real-time alert hot-path

**Cons**
- Covers logs only; GPU metrics require a separate agent
- No native correlation between logs and metrics

**Verdict:** Use as a complement to another collector, not as a standalone solution

### OTel Collector

**Pros**
- Unifies logs, metrics and traces in a single DaemonSet
- Open standard; no lock-in; exports to any backend
- 21× lower cost than Enhanced Observability on large GPU clusters
- target_allocator scales scraping horizontally

**Cons**
- High configuration complexity; pipeline mistakes cause silent data loss
- Requires expertise in OTel pipeline YAML and memory tuning

**Verdict:** Best choice for production on medium-to-large GPU clusters with multi-backend requirements

### CW Container Insights Enhanced

**Pros**
- Zero operation; AWS-managed add-on
- Native DCGM without additional configuration
- Ideal for teams without OTel expertise

**Cons**
- Prohibitive cost on large GPU instances (~$17k/month for 20 p4d nodes)
- Strong CloudWatch lock-in; difficult to migrate historical data

**Verdict:** Acceptable only on dev/staging clusters with smaller instances; never on large-scale GPU production

### Datadog Agent

**Pros**
- Lowest diagnostic latency (<15 s with Live Tail)
- ML Observability with automatic run/experiment correlation
- Native APM for serving pipelines (Triton, TorchServe)

**Cons**
- Data leaves AWS; problematic in environments with financial data sovereignty restrictions
- Cost grows linearly with hosts; can exceed OTel+CW on large clusters
- API key rotation requires additional security process (Secrets Manager + External Secrets Operator)

**Verdict:** Best for ML teams that need fast diagnosis and have no data sovereignty restrictions

## Security and governance: what changes when ML logs contain sensitive data

In financial environments, training logs frequently contain information that should not leave the security perimeter: customer IDs in validation datasets, feature values derived from transactions, or simply the model name and version (which is sensitive competitive information). This adds a layer of requirements that most observability comparisons ignore.

**CloudWatch Logs with KMS**: all four collectors support sending to CloudWatch Logs, but only if you configure `kms_key_id` on the log group. Fluent Bit does this via `auto_create_group Off` and pre-creating the group with `aws logs create-log-group --kms-key-id arn:aws:kms:...`. OTel Collector requires the log group to exist before the first `PutLogEvents` — a common race condition in clusters that scale rapidly.

**IAM with context conditions**: the collector's IRSA policy must have `Condition: StringEquals: aws:RequestedRegion: [region]` to prevent cross-region exfiltration, and `aws:SourceVpc` if you use VPC Endpoints for CloudWatch. The Datadog Agent stores the API key in a Kubernetes Secret — use External Secrets Operator with AWS Secrets Manager and 90-day automatic rotation, not a manual `kubectl create secret`.

**Data masking in the pipeline**: OTel Collector has the `transform` processor with OTTL (OpenTelemetry Transformation Language) that allows masking fields before sending: `replace_pattern(body, "customer_id=\\d+", "customer_id=REDACTED")`. Fluent Bit has the `lua` filter for the same purpose. Datadog has Sensitive Data Scanner on the backend side, but that means the data has already left AWS — for regulated environments, masking must happen in the collector, not the backend.

**Retention and immutability**: for compliance with financial regulations (BACEN, CVM, SOX), configure S3 Object Lock in COMPLIANCE mode on log archive buckets with a 7-year retention period. Fluent Bit with S3 output supports `s3_key_format /%Y/%m/%d/%H/` for temporal partitioning that facilitates audit queries with Athena.

> **The most expensive architecture mistake: DCGM scraping without cardinality filtering:** I have seen teams send all ~40 DCGM Exporter metrics to CloudWatch Metrics without `metric_declarations`. On a 30-node p4d.24xlarge cluster that generates 9,600 custom time series. CloudWatch charges $0.30 per custom metric/month after the first 10,000 — but the real cost is not that. It is the API cost: `PutMetricData` has a limit of 1,000 values per call and 150 TPS per account. With 15 s scraping and 9,600 series you need ~10 calls per cycle per node — and you start seeing `ThrottlingException` that OTel Collector handles with exponential retry, increasing diagnostic latency exactly when you need it most. Filter to 8–12 essential metrics at the source, not the destination.

## Anti-patterns I have seen in ML/EKS architecture reviews

- Enabling Enhanced Observability on GPU production clusters without calculating the per-vCPU-hour cost upfront
- Placing `batch processor` before `memory_limiter` in the OTel pipeline — the batch accumulates data in memory before the limiter can act
- Using `start_at: beginning` in the filelog receiver without a `storage` extension — causes re-reading of all log history on every Collector restart
- Storing Datadog API keys in Kubernetes Secrets without automatic rotation via External Secrets Operator
- Not configuring `mem_buf_limit` in Fluent Bit on GPU nodes — a training log burst can OOM the DaemonSet and stop collection for all pods on the node
- Sending PII data from validation datasets to SaaS backends without masking in the collector

> **My curation note:** In production I build the pipeline as OTel Collector with `prometheusreceiver` for DCGM (scrape at 30 s, not 15 s) + `awsemf` exporter with explicit `metric_declarations` for the 10 GPU metrics that actually matter + Fluent Bit only for the critical-log hot-path via Kinesis. Datadog is reserved for environments where the ML team needs <15 s diagnosis and there is no data sovereignty restriction — which in Brazilian banks is rarely the case. The most expensive lesson I have learned: observability cost on GPU clusters is not marginal — on a large-scale training cluster it can easily exceed the cost of a senior engineer per month if you do not consciously size the pipeline from the first deploy.

## Final Recommendation: Composition, Not a Single Choice

There is no single winner in this bake-off because the four contenders solve different problems. The architecture I recommend for production ML clusters in financial environments is a deliberate composition:

**Collection layer**: OTel Collector as the primary DaemonSet, with `opentelemetry-operator` managing the lifecycle. Configure `filelog receiver` for logs, `prometheusreceiver` for DCGM (30 s, filtered to 10 metrics), and `otlp receiver` for serving traces. Use `memory_limiter` as the first processor, `resource` for ML attribute enrichment, and `batch` last.

**Dual routing**: `awsemf` exporter for critical metrics in CloudWatch (operational alerts), `awscloudwatchlogs` for KMS-encrypted logs, and `otlphttp` exporter to S3 via Parquet for long-term archival and Athena analysis. For clusters with serving trace requirements, add a `jaeger` exporter pointing to AWS X-Ray via OTel.

**Fluent Bit as hot-path**: keep the Fluent Bit DaemonSet only for routing critical logs (errors, OOM, checkpoint failures) to Kinesis Data Streams → Lambda → PagerDuty. This guarantees <30 s alert latency without OTel Collector overhead on the critical path.

**Rating:** OTel Collector + Fluent Bit hot-path

## Technical References

- [AWS: Amazon CloudWatch Observability EKS Add-on](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-EKS-otel.html)
- [AWS: CloudWatch Container Insights Enhanced Observability Pricing](https://aws.amazon.com/cloudwatch/pricing/)
- [AWS: Using AWS Distro for OpenTelemetry with EKS](https://aws-otel.github.io/docs/getting-started/adot-eks-add-on)
- [OpenTelemetry: Collector Configuration — Memory Limiter Processor](https://github.com/open-telemetry/opentelemetry-collector/tree/main/processor/memorylimiterprocessor)
- [NVIDIA: DCGM Exporter for Kubernetes](https://github.com/NVIDIA/dcgm-exporter)
- [AWS: Fluent Bit for EKS — aws-for-fluent-bit](https://github.com/aws/aws-for-fluent-bit)
- [AWS Architecture Blog: EKS ML Core Logging](https://aws.amazon.com/blogs/architecture/)
- [OpenTelemetry: Target Allocator for Prometheus Receiver](https://github.com/open-telemetry/opentelemetry-operator/tree/main/cmd/otel-allocator)
