Migrating to Stateful Cloud Native Platforms on AWS EKS
Listen to article
Fernando's voiceFernando · 21:09
Powered by Amazon Polly + OmniVoice
Migrating stateful workloads to cloud native platforms is not merely a containerization exercise — it is a sequence of isolation, data consistency, and operational automation decisions that determine whether the platform survives production. In this article, I walk through the journey of a financial-grade platform that moved from manually managed VMs to a multitenant EKS environment with Kubernetes operators, managed persistent storage, and end-to-end observability.
Stateful multitenant cloud native platforms are the new minefield of infrastructure engineering. When state matters — and in financial systems it always does — Kubernetes' elasticity promise collides with the reality of consistency, cross-tenant data isolation, and partial failures that have no simple rollback. I have watched this migration be rushed, with teams treating stateful pods as if they were Lambda functions and paying the price in data corruption incidents and tenant isolation breaches. What I present here is the structured journey I would recommend — and have executed — for moving a financial platform from legacy VMs to a multitenant EKS environment with real guarantees.
The Starting Point: VMs, Implicit State, and the Illusion of Control
The original platform ran on r6i.4xlarge EC2 instances with directly mounted gp2 EBS disks, a per-tenant PostgreSQL database on RDS Single-AZ, and a manually managed Redis cache layer bootstrapped via shell scripts. The model worked — until it didn't. Each tenant had its own instance set, which provided strong isolation but made operational cost prohibitive: 40 tenants meant 40 infrastructure stacks, 40 patch pipelines, 40 failover runbooks. The operations team spent more time managing infrastructure than delivering value.
State was implicit everywhere: user sessions in local memory, processing queues in unpartitioned PostgreSQL tables, and tenant configurations scattered across manually versioned .env files. There was no reliable inventory of which instance served which tenant at any given moment. When a node failed, recovery involved SSH, manual log inspection, and EBS snapshot restoration — an average MTTR of 47 minutes for data incidents.
Pressure to migrate came from two directions: cost (the EC2 bill grew linearly with tenant count) and compliance (a SOC 2 Type II audit identified the lack of auditable logical separation between tenants as a high-risk finding). The decision to move to EKS was not made out of trend-chasing — it was made because the Kubernetes operator model offered the lifecycle automation the team needed to scale without proportionally growing the operations headcount.
The Migration Journey: Six Phases with Real Decisions
- 1
Phase 1 — State Inventory and Workload Classification
Before touching any infrastructure, we mapped every state surface: transactional data (PostgreSQL), session cache (Redis), job queues (PG tables), document blobs (already on S3), and tenant configuration (local files). Each category received a classification: stateless, stateful-ephemeral, or stateful-durable. Stateless workloads migrated first. Stateful-durable workloads migrated last — and required dedicated operators.
- 2
Phase 2 — EKS Multitenancy Model Design
We evaluated three models: cluster-per-tenant (strong isolation, high cost), namespace-per-tenant on a shared cluster (cost-efficient, isolation via RBAC and NetworkPolicy), and namespace-per-tenant with dedicated node pools per criticality tier. We chose the third: namespaces per tenant with separate node groups for Gold tenants (dedicated
r6i.2xlargeinstances with taints) and Silver/Bronze (shared node group withm6i.xlarge). This balanced cost and blast-radius isolation. - 3
Phase 3 — Persistence: EBS CSI, EFS, and Aurora Serverless v2
For transactional data, we migrated from RDS Single-AZ to Aurora PostgreSQL Serverless v2 with read replicas per tenant-group. The EBS CSI driver with a gp3
StorageClass(3000 IOPS baseline, 125 MB/s throughput) replaced manually mounted gp2 volumes. For artifacts shared across pods within the same tenant (reports, uploads), we used EFS with per-namespace access points, enforcing filesystem isolation without accidental cross-tenant sharing. - 4
Phase 4 — Kubernetes Operators for Tenant Lifecycle
We developed a custom operator (using controller-runtime) that reacted to a
TenantWorkspaceCRD. OnTenantWorkspacecreation, the operator provisioned: a namespace with tenant labels, ResourceQuota (CPU/memory/PVC), isolation NetworkPolicy, a ServiceAccount with IRSA binding to the tenant's KMS key, and a secret in AWS Secrets Manager referenced via External Secrets Operator. New tenant provisioning time dropped from 4 hours (manual) to 8 minutes (automated). - 5
Phase 5 — Data Migration with Controlled Cutover Window
We used AWS DMS with continuous replication (CDC) to synchronize legacy RDS databases with the new Aurora clusters during a 72-hour shadow period. Cutover was executed tenant by tenant, starting with lowest transaction volume. Each cutover followed the pattern: (1) drain connections from the legacy app, (2) confirm replication lag < 5 seconds, (3) promote Aurora as primary, (4) update internal DNS via Route 53 weighted routing, (5) monitor for 30 minutes before decommissioning the legacy instance.
- 6
Phase 6 — Observability and SLO Validation
We instrumented all services with OpenTelemetry SDK (traces and metrics), exporting to AWS X-Ray (traces) and CloudWatch Container Insights (infrastructure metrics). A Datadog dashboard consolidated per-tenant SLOs: 99.9% availability (30-day window), P99 latency < 800ms for write operations, and error rate < 0.1%. Burn rate alerts were configured to fire when the error budget consumed more than 5% in 1 hour — well before violating the SLO.
The Isolation Model: Beyond the Namespace
Namespace per tenant is necessary, but far from sufficient in a financial environment. The namespace is a naming and RBAC boundary — it does not prevent a misconfigured pod from accessing another tenant's network, nor does it prevent an encryption key from being inadvertently shared.
The isolation model we implemented has four layers:
1. Network isolation: A default NetworkPolicy denies all ingress and egress traffic within the cluster, except explicitly permitted routes. Each tenant can only communicate with its own namespace and the platform-services namespace (containing the egress proxy and authentication service). We used Calico as the CNI for GlobalNetworkPolicy support, which applies rules before namespace-level policies.
2. Identity isolation: Each tenant ServiceAccount has a dedicated IAM Role via IRSA (eks.amazonaws.com/role-arn annotation). The IAM Role has a StringEquals condition on sts:AssumeRoleWithWebIdentity that verifies the OIDC token sub — ensuring only pods from the correct namespace can assume that role. This prevents privilege escalation even if a pod manages to create a ServiceAccount in another namespace.
3. Encryption isolation: Each tenant has its own KMS Customer Managed Key (CMK). EBS volumes, Secrets Manager secrets, and S3 data for each tenant are encrypted with the tenant's CMK. The CMK key policy uses kms:ViaService and aws:SourceAccount conditions to ensure only calls originating from the correct services can use the key.
4. Resource isolation: Per-namespace ResourceQuota and LimitRange prevent a tenant from monopolizing CPU or memory on the shared node group. For Gold tenants, taints on dedicated nodes ensure no pod from another tenant is scheduled there, even under scheduling pressure.
Stateful Multitenant Platform Architecture on EKS
Tenant provisioning and isolation flow: from TenantWorkspace CRD to persistent state, with security and observability layers.
- TenantWorkspace · Operator (controller-runtime)
- External Secrets · Operator
- IRSA / OIDC · per-tenant IAM Role
- KMS CMK · per-tenant key
- Secrets Manager · per-tenant secret
- Namespace · + NetworkPolicy + ResourceQuota
- Stateful App Pods · (gp3 EBS PVC)
- Redis StatefulSet · ephemeral-stateful
- Aurora PostgreSQL · Serverless v2 (per tenant-group)
- EFS Access Point · per-namespace
- S3 Bucket · SSE-KMS per tenant
- OTel Collector · traces + metrics
- Datadog · SLO dashboards
Kubernetes Operators: The Difference Between Real Automation and Scripts with Kubectl
The decision to invest in a custom operator rather than Helm + kubectl scripts was the most internally debated. The argument against was legitimate: operators have high development cost, introduce a critical component into the control plane, and are difficult to debug when the reconciliation loop enters an inconsistent state.
The argument in favor, which prevailed, was based on three properties scripts do not have:
Continuous reconciliation: A script runs once. An operator continuously observes desired state versus actual state. When an engineer accidentally deleted a tenant's NetworkPolicy in production (it happened), the operator recreated it within 30 seconds — before any unauthorized traffic could occur. With scripts, that would have been a security incident.
Full lifecycle management: The operator handles creation, update, and deletion (with finalizers to ensure data is not deleted before backup). A provisioning script rarely includes safe deprovisioning logic — and when it does, it is tested with far less rigor.
Native status and observability: The TenantWorkspace CRD has a .status.conditions field with conditions like StorageProvisioned, NetworkPolicyApplied, SecretsSync. Any tool consuming the Kubernetes API can observe each tenant's state in real time. This was critical for the support team: instead of SSH-ing into nodes, they run kubectl get tenantworkspace -n platform-ops and see every tenant's state in seconds.
The real cost of the operator was approximately 3 development sprints and 1 sprint of integration test hardening. Operational break-even was reached in month four, when the tenant count exceeded 60 and management complexity would have required hiring an additional operations engineer.
Before and After: Operational Metrics from the Migration
Stateful Persistence in Kubernetes: What Nobody Tells You About PVCs in Production
PersistentVolumeClaims backed by EBS have a fundamental limitation that frequently surprises migrating teams: an EBS volume can only be mounted on a single node at a time (ReadWriteOnce). This has direct implications for deployment strategies.
With a Deployment (not StatefulSet), a rolling update may attempt to create the new pod on a different node before the old pod releases the volume — resulting in the new pod stuck in ContainerCreating indefinitely, waiting for the volume to detach. In financial environments with short maintenance windows, this can be catastrophic.
The solution we adopted was threefold: (1) use StatefulSet for all EBS-backed PVC workloads, which guarantees the old pod is terminated before the new one is created; (2) configure podManagementPolicy: Parallel only for StatefulSets where startup order does not matter (Redis cache), keeping OrderedReady for those with initialization dependencies; (3) implement PodDisruptionBudget with minAvailable: 1 to ensure node drains (during EKS upgrades) do not remove the sole pod of a critical tenant without a ready replacement.
For the use case of artifacts shared across multiple pods of the same tenant (asynchronously generated reports, for example), EFS with ReadWriteMany was the right choice — but with an important caveat: EFS has significantly higher write latency than EBS for synchronous operations (typically 1-5ms vs. 0.1-0.5ms for gp3). Any code path that wrote to EFS synchronously was refactored to asynchronous writes via SQS, with EFS as the final destination after processing.
Critical Risks That Almost Cost Us the Migration
1. Cross-tenant state leakage via shared cache: In an early version, Redis was shared across Silver tenants using key prefixes as separators. A prefix bug in a code release exposed session data from one tenant to another for 12 minutes before detection. The fix was moving to Redis StatefulSets per tenant-group (not per individual tenant — prohibitive cost) with ACL authentication per tenant.
2. EBS volume attachment storm during node group upgrades: During a rolling upgrade of a node group with 30 tenants, all EBS volumes attempted to detach and reattach simultaneously. The 28-volume EBS limit per r6i.2xlarge instance was hit, causing scheduling failures. The fix was setting maxUnavailable: 1 on the node group and using PodDisruptionBudgets to enforce a controlled pod migration pace.
3. NetworkPolicy configuration drift: With the reconciliation operator inactive for 40 minutes during a maintenance window, an emergency manual deploy created a pod without the correct tenant labels, which fell outside the NetworkPolicies and had unrestricted access to the cluster's internal network. This reinforced the rule: the operator must never be disabled, even during maintenance — it must be updated with zero downtime via rolling update.
Observability in Multitenant Platforms: Tenant as a First-Class Dimension
The most common mistake I see in multitenant platforms is treating observability as an infrastructure concern — CPU, memory, and latency metrics aggregated at the cluster level. This is useless when a customer calls complaining of slowness: you cannot isolate the problem without tenant dimensions.
The most impactful architectural decision in the observability phase was defining tenant_id as a mandatory dimension on all platform metrics, traces, and logs. This was implemented in three layers:
Application: The OTel SDK was configured with a Resource that includes tenant.id and tenant.tier as attributes. All spans inherit these attributes automatically. On the metrics side, we used exemplars to link high-cardinality metrics to specific traces — essential for investigating P99 for a specific tenant.
Infrastructure: CloudWatch Container Insights was configured with enhanced observability on EKS, collecting per-pod metrics with Kubernetes labels. A CloudWatch Logs Insights metric filter extracts tenant_id from structured logs and creates custom metrics with that dimension.
Per-tenant SLOs: In Datadog, each tenant has an independent SLO monitor. The burn rate alert uses a 1-hour window (fast alert) and 6-hour window (trend alert). When the 1-hour burn rate exceeds 14.4x (consuming 2% of the budget in 1 hour), a PagerDuty alert is created with tenant_id in the title — the on-call engineer immediately knows which tenant is impacted without investigating dashboards.
The ingestion cost of high-cardinality metrics was a real concern. The solution was using histogram_quantile in Prometheus (running as a Managed Service via AMP) for latency metrics, rather than per-percentile gauges — reducing cardinality by 80% with equivalent precision for SLOs.
Well-Architected Pillars Assessment
Security
Four-layer isolation (network, identity, encryption, resources) with IRSA + KMS CMK per tenant. Default-deny NetworkPolicy with Calico GlobalNetworkPolicy. CloudTrail audit of all KMS calls by tenant_id.
Reliability
StatefulSets with PodDisruptionBudgets ensure availability during upgrades. Aurora Serverless v2 with read replicas and automatic failover < 30s. Operator with continuous reconciliation prevents configuration drift.
If I were starting this migration today, the one thing I would do differently is invest in the Kubernetes operator before migrating the first tenant — not after the fifth. The temptation to use Helm charts and bash scripts for the first few tenants is enormous, and the technical debt it creates is disproportionate. The hardest lesson I carry from this journey is that isolation in multitenant platforms is not a feature you add later: it is a property that must be mathematically proven before the first production tenant, with namespace penetration tests and NetworkPolicy failure simulations. In financial environments, a cross-tenant data leak is not a severity-2 bug — it is a regulatory event.
Verdict: Stateful Cloud Native Platforms Are Viable in Financial Production — With the Right Conditions
Migrating to a stateful cloud native multitenant platform on EKS is technically feasible and operationally superior to the per-tenant VM model — but only if you accept that the initial investment is larger than it appears. The Kubernetes operator, the four-layer isolation model, and tenant-dimensioned observability are not optional optimizations: they are prerequisites for operating safely at scale. The numbers speak for themselves: 68% cost reduction per tenant, MTTR from 47 minutes to 4 minutes, and provisioning from 4 hours to 8 minutes are real results — but they were preceded by 3 months of platform work that delivered no business features. If your organization does not have tolerance for that upfront investment, the per-tenant VM model, expensive as it is, is safer than a poorly isolated Kubernetes platform. Cloud native is not the right destination for everyone — but for those who do the work properly, the compounding operational advantage over time is hard to argue against.
References
Architecture, AWS, AI and market deep dives — straight to your inbox. Free.
No spam · unsubscribe anytime
Ask Fernando about this
Get a focused answer about this article from my AI assistant, grounded in my work.
Join the conversation
Sign in to comment
Verify your email to join in — you'll also get the newsletter. No password.
Keep reading
Architecture intelligence, in your inbox
Curated signals and original analysis on AWS, AI, distributed systems and the market — the way a solutions architect reads them.
- Curated AWS · AI · architecture · market signals
- New architecture studies & deep-dives when they ship
- Sharp summaries — depth without the noise
- No spam · double opt-in · unsubscribe anytime