# Design Doc: LLM Observability — from GPU Utilization to Response Quality

This document proposes an end-to-end observability architecture for LLM inference platforms running on Amazon SageMaker AI and Amazon Bedrock, covering everything from hardware metrics (GPU utilization, memory) to semantic response quality, behavioral drift, and per-tenant cost. The design integrates CloudWatch, Amazon Managed Grafana, prompt-level tracing, and automated regression alarms, with clear separation of concerns across collection, storage, evaluation, and alerting layers.

- URL: https://fernando.moretes.com/studies/design-doc-llm-observability-quality-cost-sagemaker-bedrock

- Markdown: https://fernando.moretes.com/studies/design-doc-llm-observability-quality-cost-sagemaker-bedrock/study.md?lang=en

- Type: Design Doc / RFC

- Company: LLM operations platform (cenário)

- Domain: Observabilidade / IA

- Date: 2026-06-06

- Tags: llm-observability, sagemaker, bedrock, cloudwatch, grafana, mlops, gpu-metrics, eval-driven-ops

- Reading time: 13 min

---

Operating LLMs in production without structured observability means flying blind: you don't know if the model is degrading, if a tenant is monopolizing GPU, or if response quality silently dropped after a hot-patch. This document specifies the complete observability architecture for a multi-tenant LLM inference platform on SageMaker AI and Bedrock — from hardware telemetry to semantic eval scores, with per-prompt tracing, regression alarms, and attributable cost per customer.

## The Problem: Visibility Gaps in LLM Inference

LLM inference platforms introduce a class of observability problems that traditional APM tools were not designed to solve. In conventional systems, high latency and application errors are sufficient to diagnose most incidents. In LLMs, you can have acceptable latency, zero HTTP errors, and still be delivering semantically incorrect, truncated, or increasingly hallucinated responses — and no alarm will fire.

The scenario this document addresses is a multi-tenant platform where multiple customers consume inference endpoints provisioned on SageMaker AI (proprietary fine-tuned models) and Amazon Bedrock (foundation models via managed API). Each tenant has distinct SLAs, distinct token volumes, and distinct cost tolerances. The identified visibility gaps are:

**1. Hardware layer disconnected from application layer.** GPU metrics (utilization, HBM memory, temperature) exist in CloudWatch via `/aws/sagemaker/Endpoints`, but are not automatically correlated with inference latency or output quality. A GPU utilization spike at 98% may mean optimal throughput or it may be the precursor signal of an OOM that will bring down the endpoint.

**2. No per-prompt tracing.** LLM inference requests are neither idempotent nor uniform: a 4,000-token prompt has a radically different latency and cost profile than a 200-token one. Without a trace ID propagated from the client to the endpoint, it is impossible to correlate a quality complaint from a specific tenant with the actual request that caused it.

**3. Response quality is not a native metric.** No AWS service automatically emits a semantic quality metric. Eval scores (RAGAS, G-Eval, BERTScore, or LLM-as-judge evaluators) must be calculated asynchronously and injected into the metrics pipeline as custom metrics. Without this, quality degradation is only detected via support tickets — too late.

**4. Per-tenant cost is invisible in the default model.** Bedrock charges per token; SageMaker charges per instance-hour. In a multi-tenant environment, cost allocation per customer requires explicit instrumentation: input/output token counts per tenant, mapping to instance SKU, and temporal aggregation. Without this, the product cannot price correctly or identify tenants consuming disproportionate resources.

## Goals and Non-Goals

- ✅ GOAL: Collect and correlate GPU metrics (utilization, memory, temperature) with inference metrics (tokens/s, latency p50/p99, batch size) in real time.
- ✅ GOAL: Implement distributed per-prompt tracing with trace ID propagated from client to endpoint, stored in AWS X-Ray and correlated with inference logs.
- ✅ GOAL: Calculate and emit quality eval scores (LLM-as-judge + reference metrics) asynchronously for each sampled request, injecting them as custom metrics into CloudWatch.
- ✅ GOAL: Calculate per-tenant cost per period (daily/monthly) with input/output token granularity, separating SageMaker and Bedrock costs.
- ✅ GOAL: Detect model behavioral drift via monitoring of eval score distribution over time, with automatic regression alarms.
- ✅ GOAL: Provide operational dashboards in Amazon Managed Grafana with views by tenant, endpoint, and model.

## Scenario Fact Sheet

- **Platform:** LLM Operations Platform (composite scenario)
- **Inference services:** Amazon SageMaker AI (dedicated endpoints) + Amazon Bedrock (managed API)
- **Estimated scale:** ~50 active tenants, ~2M requests/day, ~8B tokens/day (estimate)
- **SageMaker instances:** ml.g5.12xlarge and ml.p4d.24xlarge with multi-model endpoints
- **Models:** Fine-tuned Llama 3 (SageMaker) + Claude 3.5 Sonnet / Titan (Bedrock)
- **Observability stack:** CloudWatch Metrics + Logs + X-Ray, Amazon Managed Grafana, Lambda (async eval), Kinesis Data Streams, DynamoDB (cost/tenant)
- **Target latency SLA:** p50 < 800ms, p99 < 4s (first token); throughput > 50 tokens/s per endpoint
- **Observability data retention:** Metrics: 15 months (CloudWatch); Logs: 90 days hot, 2 years S3 Glacier IR; Traces: 30 days (X-Ray)

## Proposed Design: Four-Layer Architecture

The observability architecture is organized into four layers with distinct responsibilities and well-defined interfaces between them. The central principle is to separate collection from storage, storage from evaluation, and evaluation from alerting — each layer can evolve independently.

### Layer 1 — Instrumentation and Collection

Every inference request passes through an **Inference Proxy** (Lambda or ECS Fargate), which is the single entry point for both backends (SageMaker and Bedrock). The proxy has three observability responsibilities:

- **Trace ID generation**: each request receives an `x-trace-id` UUID v4 that is propagated as an HTTP header to the downstream endpoint and as a segment attribute in X-Ray. The trace ID is also injected into the structured log payload (JSON) emitted to CloudWatch Logs.
- **Token counting**: before dispatching the request, the proxy tokenizes the prompt with the tokenizer corresponding to the model (via `tiktoken` or `transformers` library) and records `input_token_count` as a custom CloudWatch metric with dimensions `tenant_id`, `model_id`, and `endpoint_type`. Output token count is recorded on the response.
- **Latency capture**: the proxy measures `time_to_first_token` (TTFT) and `total_latency` with millisecond precision and emits them as CloudWatch metrics with 1-second resolution (high-resolution metrics).

For SageMaker endpoints, native GPU metrics (`GPUUtilization`, `GPUMemoryUtilization`, `DiskUtilization`) are already automatically emitted to the `/aws/sagemaker/Endpoints` namespace. For Bedrock, invocation metrics (`InvocationLatency`, `InputTokenCount`, `OutputTokenCount`) are emitted to `/aws/bedrock/modelinvocations`. Both sources are consumed directly by Grafana via the CloudWatch data source.

### Layer 2 — Asynchronous Evaluation Pipeline

Response quality cannot be evaluated on the critical path without impacting latency. The design uses sampling: **100% of requests have metadata captured** (prompt hash, tenant, model, tokens, latency), but only a configurable fraction (default: 5% per tenant, adjustable by SLA) goes through the full eval pipeline.

The proxy publishes an event to a **Kinesis Data Stream** (`llm-eval-stream`) containing: `trace_id`, `tenant_id`, `model_id`, `prompt` (truncated to 2,000 chars for cost control), `completion`, `timestamp`, and context metadata (e.g., RAG chunks if applicable). A consumer Lambda function (`eval-worker`) processes these events in micro-batches and executes:

1. **LLM-as-judge**: invokes an evaluator model on Bedrock (Claude 3 Haiku for cost) with a structured rubric prompt that returns 1-5 scores for relevance, coherence, factual fidelity, and absence of hallucination.
2. **Reference metrics** (when ground truth available): BERTScore F1 for semantic similarity.
3. **Custom metric emission**: scores are emitted to CloudWatch under the `LLMPlatform/EvalScores` namespace with dimensions `tenant_id`, `model_id`, and `eval_dimension`.

Full eval results are stored in **DynamoDB** (table `eval-results`, 90-day TTL) for ad-hoc querying and in **S3** (Parquet format, partitioned by `date/tenant/model`) for historical analysis and evaluator training.

### Layer 3 — Per-Tenant Cost

A scheduled Lambda function (`cost-aggregator`, hourly execution) reads custom token count metrics from CloudWatch via `GetMetricData` and applies the current pricing table (Bedrock: price per token per model; SageMaker: price per instance-hour divided by number of active tenants in the period via token weight). The result is written to DynamoDB (table `tenant-cost`, key `tenant_id#date`) and exposed via API Gateway to the billing system.

### Layer 4 — Alerts and Dashboards

CloudWatch alarms cover three categories: (a) **infrastructure**: GPU utilization > 90% for 5 minutes, p99 latency > per-tenant threshold; (b) **quality**: average eval score per tenant dropping more than 15% relative to a 7-day rolling baseline; (c) **cost**: a tenant's token consumption exceeding 120% of configured daily budget. Quality regression alarms use **CloudWatch Anomaly Detection** with a 2 standard deviation band over the historical eval score series.

Amazon Managed Grafana consumes CloudWatch as the primary data source and X-Ray for trace visualization. Dashboards are organized in three layers: executive view (cost and quality per tenant), operational view (GPU, latency, throughput per endpoint), and debug view (individual trace by `trace_id`).

## LLM Observability Architecture — Complete Flow

The diagram shows the observability data flow from client request to dashboards and alarms, covering all four layers: collection, async evaluation, cost, and alerting.

### 👤 Clients / Tenants

- Tenant Client SDK / REST (user)

### 🔀 Inference Proxy Layer

- Inference Proxy Lambda / ECS Fargate (compute)
- AWS X-Ray Trace ID propagation (security)

### 🤖 Inference Backends

- SageMaker AI ml.g5 / ml.p4d endpoints (ai)
- Amazon Bedrock Claude / Titan / Llama (ai)

### 📊 Metrics & Logs Collection

- CloudWatch Metrics GPU / latency / tokens high-res 1s (data)
- CloudWatch Logs Structured JSON trace_id + tenant_id (storage)

### ⚡ Async Eval Pipeline

- Kinesis Data Stream llm-eval-stream 5% sample (messaging)
- eval-worker Lambda LLM-as-judge BERTScore (compute)
- Bedrock Evaluator Claude 3 Haiku rubric scoring (ai)
- DynamoDB eval-results TTL 90d (storage)
- S3 Parquet date/tenant/model historical (storage)

### 💰 Cost Attribution

- cost-aggregator Lambda (hourly) GetMetricData (compute)
- DynamoDB tenant-cost tenant_id#date (storage)
- API Gateway billing endpoint (edge)

### 🔔 Alerting & Dashboards

- CloudWatch Alarms GPU / quality / cost Anomaly Detection (data)
- SNS PagerDuty / Slack routing by severity (messaging)
- Amazon Managed Grafana exec / ops / debug dashboards (frontend)

### Flows

- tenant -> proxy: request + tenant header
- proxy -> xray: trace segment
- proxy -> sagemaker: dedicated inference
- proxy -> bedrock: managed inference
- proxy -> cw_metrics: tokens, TTFT, latency
- proxy -> cw_logs: structured JSON log
- proxy -> kinesis: 5% eval sample
- sagemaker -> cw_metrics: native GPU / mem
- bedrock -> cw_metrics: invocation metrics
- kinesis -> eval_lambda: micro-batch
- eval_lambda -> bedrock_eval: rubric prompt
- eval_lambda -> cw_metrics: custom eval scores
- eval_lambda -> eval_dynamo: full result
- eval_lambda -> eval_s3: historical Parquet
- cw_metrics -> cost_lambda: GetMetricData tokens
- cost_lambda -> cost_dynamo: aggregated cost
- cost_dynamo -> billing_api: billing read
- cw_metrics -> cw_alarms: threshold / anomaly
- cw_alarms -> sns: alarm state change
- cw_metrics -> grafana: data source
- xray -> grafana: trace visualization
- eval_dynamo -> grafana: eval scores query

## Evaluated Design Alternatives

### Option A (chosen): CloudWatch + Managed Grafana + Kinesis Eval

**Pros**
- Native integration with SageMaker and Bedrock — zero additional instrumentation for GPU and invocation metrics
- Amazon Managed Grafana eliminates operational overhead of self-hosted Grafana cluster
- Kinesis Data Streams offers replay and reprocessing of eval events without loss
- Native CloudWatch Anomaly Detection for regression alarms without external model

**Cons**
- CloudWatch GetMetricData has cost per API call — hourly cost aggregation needs optimization to avoid bill surprise
- Async eval latency (minutes) means quality alarms are not real-time

**Verdict:** Best balance of native integration, operational cost, and maturity for the described scenario

### Option B: OpenTelemetry + Prometheus + self-hosted Grafana OSS

**Pros**
- Fully open-source stack, no vendor lock-in for the observability layer
- OpenTelemetry offers standardized, portable instrumentation across clouds

**Cons**
- Prometheus was not designed for high-cardinality metrics (tenant_id × model_id × eval_dimension) — cardinality explosion risk
- Significant operational overhead: HA Prometheus cluster, Thanos or Cortex for long retention, HA Grafana cluster
- Integration with native SageMaker/Bedrock metrics requires custom exporters

**Verdict:** Valid for multi-cloud environments or zero lock-in requirements, but high operational cost for small teams

### Option C: Datadog LLM Observability (dedicated product)

**Pros**
- Dedicated LLM observability product with integrated LLM-as-judge, prompt tracing, and pre-built dashboards
- Faster time-to-value for teams without experience building eval pipelines

**Cons**
- Per-host/span cost can be prohibitive at 2M requests/day scale — estimated 3-5x higher cost than Option A
- Vendor lock-in in external tool; prompt/completion data leaves AWS perimeter — compliance risk for regulated tenants
- Per-tenant cost is not native — requires customization similar to Option A

**Verdict:** Suitable for proof of concept or small teams; not recommended in production with sensitive data or high scale

## Critical Design Decisions and Trade-offs

### Eval Pipeline Sampling Rate

The decision to sample 5% of requests for full eval is the most important trade-off in this design. 100% eval with LLM-as-judge is technically feasible but economically unviable: at 2M requests/day using Claude 3 Haiku as evaluator (estimated 500 tokens per evaluation), the additional cost would be on the order of $1,500-2,000/day in eval calls alone — not counting additional latency if done synchronously. At 5%, the cost drops to ~$75-100/day and statistical coverage is sufficient to detect quality regressions with confidence for volumes above 10,000 requests/day per tenant.

The sampling rate is configurable per tenant via a parameter in Systems Manager Parameter Store (`/llm-platform/{tenant_id}/eval_sample_rate`), allowing temporary increase to 100% during incident investigations or for tenants with premium quality SLAs.

### Metric Granularity vs. CloudWatch Cost

High-resolution metrics (1 second) in CloudWatch cost $0.30/metric/month, versus $0.10/metric/month for standard 1-minute resolution. For TTFT latency and token throughput, 1-second resolution is necessary to detect latency spikes that resolve within a minute. For cost and eval score metrics, 1-minute resolution is sufficient. The design uses high resolution selectively only for critical latency metrics, reducing metric cost by ~60% compared to instrumenting everything at high resolution.

### Prompt Data Isolation per Tenant

Prompts from regulated tenants (financial, health) must not be sent to the eval pipeline if that implies transmission to an external model or unencrypted storage. The design addresses this with two measures: (1) the eval pipeline uses exclusively Bedrock models in the same AWS account, with no data egress; (2) prompts are truncated to 2,000 characters before entering the Kinesis stream, and the DynamoDB `eval-results` table uses CMK (Customer Managed Key) via AWS KMS with a per-tenant key. Tenants can opt out of prompt storage entirely via a configuration flag — in that case, only the prompt hash is stored for correlation.

### Drift Detection vs. Fixed Threshold Alarms

Fixed threshold alarms for eval score (e.g., "alert if score < 3.0") are brittle because score distribution varies by application domain and model. A summarization model may have an average score of 3.8; a code generation model may have an average score of 4.2. A single threshold would generate false positives on one and false negatives on the other. The solution is to use CloudWatch Anomaly Detection with an ML model trained on the historical eval score series per `(tenant_id, model_id)`. The alarm fires when the score drops more than 2 standard deviations below the expected band for that specific context — detecting relative, not absolute, regression.

## Phased Rollout Plan

1. **Phase 1 — Weeks 1-2: Collection Foundation** — Deploy Inference Proxy (Lambda) with trace ID generation, token counting, and custom metric emission to CloudWatch. Validate that native SageMaker GPU metrics are arriving correctly. Configure X-Ray with 10% sampling rate for all requests. Create CloudWatch namespaces and basic latency/throughput dashboards in Grafana.

2. **Phase 2 — Weeks 3-4: Async Evaluation Pipeline** — Deploy Kinesis Data Stream `llm-eval-stream` and `eval-worker` function. Implement LLM-as-judge with Claude 3 Haiku on Bedrock. Configure 5% sampling by default. Set up DynamoDB `eval-results` tables with KMS CMK and S3 bucket for historical Parquet. Validate that eval scores are arriving as custom metrics in CloudWatch.

3. **Phase 3 — Weeks 5-6: Per-Tenant Cost and Alerts** — Deploy `cost-aggregator` function with pricing logic for SageMaker and Bedrock. Create `tenant-cost` table in DynamoDB and expose via API Gateway. Configure CloudWatch alarms for all three categories (infra, quality, cost). Initial training of Anomaly Detection models (requires at least 2 weeks of historical eval score data).

4. **Phase 4 — Weeks 7-8: Dashboards and Hardening** — Build complete Grafana dashboards (executive, operational, debug). Configure alarm routing via SNS to PagerDuty (critical severity) and Slack (warning severity). Security review: validate per-tenant data isolation, test KMS key rotation, review Inference Proxy IAM policies. Load test the eval pipeline at 10x expected volume to validate that the Kinesis stream does not become a bottleneck.

5. **Phase 5 — Weeks 9-10: Gradual Tenant Rollout** — Activate complete pipeline for pilot tenants (2-3 non-regulated, lower-volume tenants). Collect operational feedback and adjust alarm thresholds. Progressive activation for all tenants over 2 weeks, with intensive monitoring of the observability pipeline's own cost (target: observability overhead < 3% of total inference cost).

> **Critical Risks and Mitigations:** **1. CloudWatch metric cardinality explosion.** With 50 tenants × 10 models × 8 eval dimensions, the number of unique time series can exceed 4,000. CloudWatch charges per unique metric — at $0.30/metric/month at high resolution, that's $1,200/month in eval metrics alone. Mitigation: use composite dimensions (`tenant_model_eval` as concatenated string) to reduce cardinality, and monthly review of metric cost via Cost Explorer with `observability-layer` tag.

**2. The LLM evaluator may become the bottleneck.** If `eval-worker` cannot process the Kinesis stream at the same rate events arrive, lag grows and quality alarms become stale. Mitigation: configure Enhanced Fan-Out on Kinesis, Lambda auto-scaling based on `IteratorAge` (alarm if > 5 minutes), and circuit breaker that automatically reduces sampling rate if lag exceeds threshold.

**3. Quality drift of the LLM evaluator itself.** If Anthropic updates Claude 3 Haiku (the evaluator), historical scores may become incomparable with new scores — invalidating the Anomaly Detection baseline. Mitigation: version the evaluator rubric prompt, store the evaluator model version alongside each score in DynamoDB, and recalculate the baseline when the evaluator model changes.

**4. Prompt data in CloudWatch logs.** If the Inference Proxy logs the full request/response payload for debugging, prompt data from regulated tenants ends up in CloudWatch Logs without the same level of control as the eval pipeline. Mitigation: configure log filtering in the proxy to mask `prompt` and `completion` fields by default, with explicit opt-in per tenant for full logging with reduced 7-day retention.

**5. Observability pipeline cost exceeding budget.** At high scale, the combined cost of Kinesis + Lambda + Bedrock eval + DynamoDB can surprise. Mitigation: implement AWS Budgets alert specific to the `observability-layer` tag, with alarm at 80% of monthly budget and automatic action to reduce sampling rate.

## Success Metrics and Targets

- **Tracing coverage:** 100% of requests with propagated trace ID; X-Ray sampling ≥ 10% for latency analysis
- **Quality alarm latency:** Quality regression detected in < 15 minutes after degradation onset (with 5% sampling)
- **Per-tenant cost accuracy:** Attributed cost with < 2% error compared to actual AWS invoice (validated monthly)
- **Observability overhead:** Total observability pipeline cost < 3% of total inference cost
- **Proxy-added latency:** < 5ms latency overhead introduced by Inference Proxy (p99)
- **Eval pipeline availability:** ≥ 99.5% (graceful degradation: eval failure does not impact inference)
- **Dashboard coverage:** 100% of tenants with individual dashboard; GPU anomaly MTTD < 3 minutes

> **My Senior Take:** The most common mistake I see in production LLM platforms is treating observability as a second-phase feature — 'we'll instrument after it's working.' In LLMs, this is especially dangerous because quality degradation is silent: the system keeps returning HTTP 200 while delivering progressively worse responses. When the customer complains, you have no data to diagnose when it started, which tenant was affected first, or whether it was a model change, a load spike, or a specific prompt that triggered the problem.

The most important decision in this design is not technological — it's the separation between infrastructure metrics (GPU, latency) and semantic quality metrics. These are two feedback loops with different speeds: infra you monitor in seconds, quality you monitor in minutes with sampling. Trying to do both with the same tool and the same granularity leads to either exorbitant cost or insufficient coverage.

On LLM-as-judge: it's a pragmatic solution, but it needs to be treated as a component with its own lifecycle that can degrade. I version the rubric prompt as code, store the evaluator prompt hash alongside each score, and have a quarterly recalibration process where I compare automatic evaluator scores with human evaluation on a sample of 200-500 requests. Without this, you're measuring quality with a ruler that may be shrinking.

Finally: the cost of the observability pipeline needs to be a first-class citizen in the product budget. At scale, Kinesis + Lambda + Bedrock eval + CloudWatch high-res metrics add up to non-trivial amounts. I always implement `observability-layer` as a separate billing tag and monitor that cost with the same discipline I monitor inference cost. The < 3% of inference cost target is reasonable and achievable with the described optimizations — but requires active attention, it's not automatic.

## Verdict

This architecture solves a real and frequently neglected problem: the gap between knowing an LLM is responding and knowing it is responding well. The design is deliberately conservative in its technology choices — CloudWatch, Kinesis, Lambda, managed Grafana — because in inference platforms the operational risk is already high enough without adding unnecessary complexity in the observability layer.

The central trade-offs are clear: 5% sampling for eval is the balance point between statistical coverage and cost; CloudWatch Anomaly Detection is less flexible than a custom drift model but eliminates a significant operational dependency; the Inference Proxy as a single instrumentation point is a potential failure point that needs to be addressed with redundancy and circuit breakers.

What this design does not solve — and what would be the natural next step — is the feedback loop between observability and model improvement: using the collected eval data to feed fine-tuning or RLHF pipelines. That is explicitly out of scope here, but the architecture was designed so that the necessary data (historical Parquet in S3, per-prompt scores in DynamoDB) is available when that loop needs to be closed.

## References

- [AWS Machine Learning Blog — LLM Inference Observability](https://aws.amazon.com/blogs/machine-learning/)
- [Amazon Managed Grafana — Product Page](https://aws.amazon.com/grafana/)
- [Amazon CloudWatch — Product Page](https://aws.amazon.com/cloudwatch/)
- [Amazon SageMaker AI — Monitor Endpoints with CloudWatch](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html)
- [Amazon Bedrock — Model Invocation Logging](https://docs.aws.amazon.com/bedrock/latest/userguide/model-invocation-logging.html)
- [AWS X-Ray — Distributed Tracing](https://docs.aws.amazon.com/xray/latest/devguide/aws-xray.html)
- [CloudWatch Anomaly Detection](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html)
- [RAGAS — Retrieval Augmented Generation Assessment](https://github.com/explodinggradients/ragas)

## Case sources

- [AWS AI Blog — LLM inference observability](https://aws.amazon.com/blogs/machine-learning/)
- [Amazon Managed Grafana](https://aws.amazon.com/grafana/)
- [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/)