# Migrating to Stateful Cloud Native Platforms on AWS EKS

Migrating stateful workloads to cloud native platforms is not merely a containerization exercise — it is a sequence of isolation, data consistency, and operational automation decisions that determine whether the platform survives production. In this article, I walk through the journey of a financial-grade platform that moved from manually managed VMs to a multitenant EKS environment with Kubernetes operators, managed persistent storage, and end-to-end observability.

- URL: https://fernando.moretes.com/blog/servicos-stateful-multitenant-com-isolamento-e-automacao

- Markdown: https://fernando.moretes.com/blog/servicos-stateful-multitenant-com-isolamento-e-automacao/article.md?lang=en

- Published: 2026-05-26T12:00:00.000Z

- Category: Solution Architecture

- Tags: eks, stateful, kubernetes, multitenancy, migration, financial-grade, operators, observability

- Reading time: 11 min

- Source: [Stateful cloud native platforms](https://www.cncf.io/blog/)

---

Stateful multitenant cloud native platforms are the new minefield of infrastructure engineering. When state matters — and in financial systems it always does — Kubernetes' elasticity promise collides with the reality of consistency, cross-tenant data isolation, and partial failures that have no simple rollback. I have watched this migration be rushed, with teams treating stateful pods as if they were Lambda functions and paying the price in data corruption incidents and tenant isolation breaches. What I present here is the structured journey I would recommend — and have executed — for moving a financial platform from legacy VMs to a multitenant EKS environment with real guarantees.

## The Starting Point: VMs, Implicit State, and the Illusion of Control

The original platform ran on `r6i.4xlarge` EC2 instances with directly mounted gp2 EBS disks, a per-tenant PostgreSQL database on RDS Single-AZ, and a manually managed Redis cache layer bootstrapped via shell scripts. The model worked — until it didn't. Each tenant had its own instance set, which provided strong isolation but made operational cost prohibitive: 40 tenants meant 40 infrastructure stacks, 40 patch pipelines, 40 failover runbooks. The operations team spent more time managing infrastructure than delivering value.

State was implicit everywhere: user sessions in local memory, processing queues in unpartitioned PostgreSQL tables, and tenant configurations scattered across manually versioned `.env` files. There was no reliable inventory of which instance served which tenant at any given moment. When a node failed, recovery involved SSH, manual log inspection, and EBS snapshot restoration — an average MTTR of 47 minutes for data incidents.

Pressure to migrate came from two directions: cost (the EC2 bill grew linearly with tenant count) and compliance (a SOC 2 Type II audit identified the lack of auditable logical separation between tenants as a high-risk finding). The decision to move to EKS was not made out of trend-chasing — it was made because the Kubernetes operator model offered the lifecycle automation the team needed to scale without proportionally growing the operations headcount.

## The Migration Journey: Six Phases with Real Decisions

1. **Phase 1 — State Inventory and Workload Classification** — Before touching any infrastructure, we mapped every state surface: transactional data (PostgreSQL), session cache (Redis), job queues (PG tables), document blobs (already on S3), and tenant configuration (local files). Each category received a classification: stateless, stateful-ephemeral, or stateful-durable. Stateless workloads migrated first. Stateful-durable workloads migrated last — and required dedicated operators.

2. **Phase 2 — EKS Multitenancy Model Design** — We evaluated three models: cluster-per-tenant (strong isolation, high cost), namespace-per-tenant on a shared cluster (cost-efficient, isolation via RBAC and NetworkPolicy), and namespace-per-tenant with dedicated node pools per criticality tier. We chose the third: namespaces per tenant with separate node groups for Gold tenants (dedicated `r6i.2xlarge` instances with taints) and Silver/Bronze (shared node group with `m6i.xlarge`). This balanced cost and blast-radius isolation.

3. **Phase 3 — Persistence: EBS CSI, EFS, and Aurora Serverless v2** — For transactional data, we migrated from RDS Single-AZ to Aurora PostgreSQL Serverless v2 with read replicas per tenant-group. The EBS CSI driver with a gp3 `StorageClass` (3000 IOPS baseline, 125 MB/s throughput) replaced manually mounted gp2 volumes. For artifacts shared across pods within the same tenant (reports, uploads), we used EFS with per-namespace access points, enforcing filesystem isolation without accidental cross-tenant sharing.

4. **Phase 4 — Kubernetes Operators for Tenant Lifecycle** — We developed a custom operator (using controller-runtime) that reacted to a `TenantWorkspace` CRD. On `TenantWorkspace` creation, the operator provisioned: a namespace with tenant labels, ResourceQuota (CPU/memory/PVC), isolation NetworkPolicy, a ServiceAccount with IRSA binding to the tenant's KMS key, and a secret in AWS Secrets Manager referenced via External Secrets Operator. New tenant provisioning time dropped from 4 hours (manual) to 8 minutes (automated).

5. **Phase 5 — Data Migration with Controlled Cutover Window** — We used AWS DMS with continuous replication (CDC) to synchronize legacy RDS databases with the new Aurora clusters during a 72-hour shadow period. Cutover was executed tenant by tenant, starting with lowest transaction volume. Each cutover followed the pattern: (1) drain connections from the legacy app, (2) confirm replication lag < 5 seconds, (3) promote Aurora as primary, (4) update internal DNS via Route 53 weighted routing, (5) monitor for 30 minutes before decommissioning the legacy instance.

6. **Phase 6 — Observability and SLO Validation** — We instrumented all services with OpenTelemetry SDK (traces and metrics), exporting to AWS X-Ray (traces) and CloudWatch Container Insights (infrastructure metrics). A Datadog dashboard consolidated per-tenant SLOs: 99.9% availability (30-day window), P99 latency < 800ms for write operations, and error rate < 0.1%. Burn rate alerts were configured to fire when the error budget consumed more than 5% in 1 hour — well before violating the SLO.

## The Isolation Model: Beyond the Namespace

Namespace per tenant is necessary, but far from sufficient in a financial environment. The namespace is a naming and RBAC boundary — it does not prevent a misconfigured pod from accessing another tenant's network, nor does it prevent an encryption key from being inadvertently shared.

The isolation model we implemented has four layers:

**1. Network isolation:** A default `NetworkPolicy` denies all ingress and egress traffic within the cluster, except explicitly permitted routes. Each tenant can only communicate with its own namespace and the `platform-services` namespace (containing the egress proxy and authentication service). We used Calico as the CNI for `GlobalNetworkPolicy` support, which applies rules before namespace-level policies.

**2. Identity isolation:** Each tenant ServiceAccount has a dedicated IAM Role via IRSA (`eks.amazonaws.com/role-arn` annotation). The IAM Role has a `StringEquals` condition on `sts:AssumeRoleWithWebIdentity` that verifies the OIDC token `sub` — ensuring only pods from the correct namespace can assume that role. This prevents privilege escalation even if a pod manages to create a ServiceAccount in another namespace.

**3. Encryption isolation:** Each tenant has its own KMS Customer Managed Key (CMK). EBS volumes, Secrets Manager secrets, and S3 data for each tenant are encrypted with the tenant's CMK. The CMK key policy uses `kms:ViaService` and `aws:SourceAccount` conditions to ensure only calls originating from the correct services can use the key.

**4. Resource isolation:** Per-namespace `ResourceQuota` and `LimitRange` prevent a tenant from monopolizing CPU or memory on the shared node group. For Gold tenants, taints on dedicated nodes ensure no pod from another tenant is scheduled there, even under scheduling pressure.

## Stateful Multitenant Platform Architecture on EKS

Tenant provisioning and isolation flow: from TenantWorkspace CRD to persistent state, with security and observability layers.

### 🤖 EKS — Operator Plane

- TenantWorkspace Operator (controller-runtime) (compute)
- External Secrets Operator (security)

### 🔐 AWS — Security & Identity

- IRSA / OIDC per-tenant IAM Role (security)
- KMS CMK per-tenant key (security)
- Secrets Manager per-tenant secret (security)

### 📦 EKS — Tenant Namespace (Gold/Silver/Bronze)

- Namespace + NetworkPolicy + ResourceQuota (network)
- Stateful App Pods (gp3 EBS PVC) (compute)
- Redis StatefulSet ephemeral-stateful (data)

### 🗄️ AWS — Persistent Storage

- Aurora PostgreSQL Serverless v2 (per tenant-group) (data)
- EFS Access Point per-namespace (storage)
- S3 Bucket SSE-KMS per tenant (storage)

### 📊 Observability

- OTel Collector traces + metrics (compute)
- Datadog SLO dashboards (external)

### Flows

- admin -> operator: applies CRD
- operator -> ns: provisions namespace
- operator -> irsa: creates IAM binding
- operator -> eso: triggers secret sync
- eso -> secrets: reads secret
- irsa -> kms: authorizes CMK
- ns -> app: contains pods
- app -> aurora: transactional write
- app -> cache: session cache
- app -> efs: shared artifacts
- app -> s3: SSE-KMS blobs
- app -> otel: traces + metrics
- otel -> dd: exports SLO data

## Kubernetes Operators: The Difference Between Real Automation and Scripts with Kubectl

The decision to invest in a custom operator rather than Helm + kubectl scripts was the most internally debated. The argument against was legitimate: operators have high development cost, introduce a critical component into the control plane, and are difficult to debug when the reconciliation loop enters an inconsistent state.

The argument in favor, which prevailed, was based on three properties scripts do not have:

**Continuous reconciliation:** A script runs once. An operator continuously observes desired state versus actual state. When an engineer accidentally deleted a tenant's `NetworkPolicy` in production (it happened), the operator recreated it within 30 seconds — before any unauthorized traffic could occur. With scripts, that would have been a security incident.

**Full lifecycle management:** The operator handles creation, update, and deletion (with finalizers to ensure data is not deleted before backup). A provisioning script rarely includes safe deprovisioning logic — and when it does, it is tested with far less rigor.

**Native status and observability:** The `TenantWorkspace` CRD has a `.status.conditions` field with conditions like `StorageProvisioned`, `NetworkPolicyApplied`, `SecretsSync`. Any tool consuming the Kubernetes API can observe each tenant's state in real time. This was critical for the support team: instead of SSH-ing into nodes, they run `kubectl get tenantworkspace -n platform-ops` and see every tenant's state in seconds.

The real cost of the operator was approximately 3 development sprints and 1 sprint of integration test hardening. Operational break-even was reached in month four, when the tenant count exceeded 60 and management complexity would have required hiring an additional operations engineer.

## Before and After: Operational Metrics from the Migration

- **47min → 4min** — Average MTTR for data incidents. Operator auto-reconciliation + burn rate alerts eliminated the manual diagnosis phase
- **4h → 8min** — New tenant provisioning time. TenantWorkspace CRD + operator replaced 4 hours of manual scripting and approvals
- **68% ↓** — Infrastructure cost reduction per tenant. Shared node groups for Silver/Bronze + Aurora Serverless v2 (scales to zero in idle hours) vs. dedicated EC2 24/7

## Stateful Persistence in Kubernetes: What Nobody Tells You About PVCs in Production

PersistentVolumeClaims backed by EBS have a fundamental limitation that frequently surprises migrating teams: an EBS volume can only be mounted on a single node at a time (`ReadWriteOnce`). This has direct implications for deployment strategies.

With a `Deployment` (not `StatefulSet`), a rolling update may attempt to create the new pod on a different node before the old pod releases the volume — resulting in the new pod stuck in `ContainerCreating` indefinitely, waiting for the volume to detach. In financial environments with short maintenance windows, this can be catastrophic.

The solution we adopted was threefold: (1) use `StatefulSet` for all EBS-backed PVC workloads, which guarantees the old pod is terminated before the new one is created; (2) configure `podManagementPolicy: Parallel` only for StatefulSets where startup order does not matter (Redis cache), keeping `OrderedReady` for those with initialization dependencies; (3) implement `PodDisruptionBudget` with `minAvailable: 1` to ensure node drains (during EKS upgrades) do not remove the sole pod of a critical tenant without a ready replacement.

For the use case of artifacts shared across multiple pods of the same tenant (asynchronously generated reports, for example), EFS with `ReadWriteMany` was the right choice — but with an important caveat: EFS has significantly higher write latency than EBS for synchronous operations (typically 1-5ms vs. 0.1-0.5ms for gp3). Any code path that wrote to EFS synchronously was refactored to asynchronous writes via SQS, with EFS as the final destination after processing.

> **Critical Risks That Almost Cost Us the Migration:** **1. Cross-tenant state leakage via shared cache:** In an early version, Redis was shared across Silver tenants using key prefixes as separators. A prefix bug in a code release exposed session data from one tenant to another for 12 minutes before detection. The fix was moving to Redis StatefulSets per tenant-group (not per individual tenant — prohibitive cost) with ACL authentication per tenant.

**2. EBS volume attachment storm during node group upgrades:** During a rolling upgrade of a node group with 30 tenants, all EBS volumes attempted to detach and reattach simultaneously. The 28-volume EBS limit per `r6i.2xlarge` instance was hit, causing scheduling failures. The fix was setting `maxUnavailable: 1` on the node group and using PodDisruptionBudgets to enforce a controlled pod migration pace.

**3. NetworkPolicy configuration drift:** With the reconciliation operator inactive for 40 minutes during a maintenance window, an emergency manual deploy created a pod without the correct tenant labels, which fell outside the NetworkPolicies and had unrestricted access to the cluster's internal network. This reinforced the rule: the operator must never be disabled, even during maintenance — it must be updated with zero downtime via rolling update.

## Observability in Multitenant Platforms: Tenant as a First-Class Dimension

The most common mistake I see in multitenant platforms is treating observability as an infrastructure concern — CPU, memory, and latency metrics aggregated at the cluster level. This is useless when a customer calls complaining of slowness: you cannot isolate the problem without tenant dimensions.

The most impactful architectural decision in the observability phase was defining `tenant_id` as a mandatory dimension on all platform metrics, traces, and logs. This was implemented in three layers:

**Application:** The OTel SDK was configured with a `Resource` that includes `tenant.id` and `tenant.tier` as attributes. All spans inherit these attributes automatically. On the metrics side, we used `exemplars` to link high-cardinality metrics to specific traces — essential for investigating P99 for a specific tenant.

**Infrastructure:** CloudWatch Container Insights was configured with `enhanced observability` on EKS, collecting per-pod metrics with Kubernetes labels. A CloudWatch Logs Insights metric filter extracts `tenant_id` from structured logs and creates custom metrics with that dimension.

**Per-tenant SLOs:** In Datadog, each tenant has an independent SLO monitor. The burn rate alert uses a 1-hour window (fast alert) and 6-hour window (trend alert). When the 1-hour burn rate exceeds 14.4x (consuming 2% of the budget in 1 hour), a PagerDuty alert is created with `tenant_id` in the title — the on-call engineer immediately knows which tenant is impacted without investigating dashboards.

The ingestion cost of high-cardinality metrics was a real concern. The solution was using `histogram_quantile` in Prometheus (running as a Managed Service via AMP) for latency metrics, rather than per-percentile gauges — reducing cardinality by 80% with equivalent precision for SLOs.

## Well-Architected Pillars Assessment

- **security**: Four-layer isolation (network, identity, encryption, resources) with IRSA + KMS CMK per tenant. Default-deny NetworkPolicy with Calico GlobalNetworkPolicy. CloudTrail audit of all KMS calls by tenant_id.
- **reliability**: StatefulSets with PodDisruptionBudgets ensure availability during upgrades. Aurora Serverless v2 with read replicas and automatic failover < 30s. Operator with continuous reconciliation prevents configuration drift.

> **My Curator's Note:** If I were starting this migration today, the one thing I would do differently is invest in the Kubernetes operator before migrating the first tenant — not after the fifth. The temptation to use Helm charts and bash scripts for the first few tenants is enormous, and the technical debt it creates is disproportionate. The hardest lesson I carry from this journey is that isolation in multitenant platforms is not a feature you add later: it is a property that must be mathematically proven before the first production tenant, with namespace penetration tests and NetworkPolicy failure simulations. In financial environments, a cross-tenant data leak is not a severity-2 bug — it is a regulatory event.

## Verdict: Stateful Cloud Native Platforms Are Viable in Financial Production — With the Right Conditions

Migrating to a stateful cloud native multitenant platform on EKS is technically feasible and operationally superior to the per-tenant VM model — but only if you accept that the initial investment is larger than it appears. The Kubernetes operator, the four-layer isolation model, and tenant-dimensioned observability are not optional optimizations: they are prerequisites for operating safely at scale. The numbers speak for themselves: 68% cost reduction per tenant, MTTR from 47 minutes to 4 minutes, and provisioning from 4 hours to 8 minutes are real results — but they were preceded by 3 months of platform work that delivered no business features. If your organization does not have tolerance for that upfront investment, the per-tenant VM model, expensive as it is, is safer than a poorly isolated Kubernetes platform. Cloud native is not the right destination for everyone — but for those who do the work properly, the compounding operational advantage over time is hard to argue against.

## References

- [CNCF Blog: Stateful Cloud Native Platforms](https://www.cncf.io/blog/)
- [AWS EKS Best Practices Guide: Multi-tenancy](https://aws.github.io/aws-eks-best-practices/security/docs/multitenancy/)
- [AWS EKS EBS CSI Driver Documentation](https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html)
- [AWS IRSA: IAM Roles for Service Accounts](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html)
- [Aurora Serverless v2: How it works](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-serverless-v2.html)
- [Kubernetes: Operator Pattern](https://kubernetes.io/docs/concepts/extend-kubernetes/operator/)
- [External Secrets Operator](https://external-secrets.io/latest/)
- [OpenTelemetry: Resource Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/resource/)
