Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

Solution ArchitectureMigration Story

Migrating to Stateful Cloud Native Platforms on AWS EKS

May 26, 2026 11 minexpert AI-assisted

Listen to article

Fernando's voice

Fernando · 21:09

Download MP3

0:0021:09

Speed

The MP3 is saved to S3 after the first play.

Solution ArchitectureMigration Story

47min → 4min

Average MTTR for data incidents

Operator auto-reconciliation + burn rate alerts eliminated the manual diagnosis phase

4h → 8min

New tenant provisioning time

TenantWorkspace CRD + operator replaced 4 hours of manual scripting and approvals

68% ↓

Infrastructure cost reduction per tenant

Shared node groups for Silver/Bronze + Aurora Serverless v2 (scales to zero in idle hours) vs. dedicated EC2 24/7

fernando.moretes.com

Migrating stateful workloads to cloud native platforms is not merely a containerization exercise — it is a sequence of isolation, data consistency, and operational automation decisions that determine whether the platform survives production. In this article, I walk through the journey of a financial-grade platform that moved from manually managed VMs to a multitenant EKS environment with Kubernetes operators, managed persistent storage, and end-to-end observability.

Stateful multitenant cloud native platforms are the new minefield of infrastructure engineering. When state matters — and in financial systems it always does — Kubernetes' elasticity promise collides with the reality of consistency, cross-tenant data isolation, and partial failures that have no simple rollback. I have watched this migration be rushed, with teams treating stateful pods as if they were Lambda functions and paying the price in data corruption incidents and tenant isolation breaches. What I present here is the structured journey I would recommend — and have executed — for moving a financial platform from legacy VMs to a multitenant EKS environment with real guarantees.

The Starting Point: VMs, Implicit State, and the Illusion of Control

The original platform ran on r6i.4xlarge EC2 instances with directly mounted gp2 EBS disks, a per-tenant PostgreSQL database on RDS Single-AZ, and a manually managed Redis cache layer bootstrapped via shell scripts. The model worked — until it didn't. Each tenant had its own instance set, which provided strong isolation but made operational cost prohibitive: 40 tenants meant 40 infrastructure stacks, 40 patch pipelines, 40 failover runbooks. The operations team spent more time managing infrastructure than delivering value.

State was implicit everywhere: user sessions in local memory, processing queues in unpartitioned PostgreSQL tables, and tenant configurations scattered across manually versioned .env files. There was no reliable inventory of which instance served which tenant at any given moment. When a node failed, recovery involved SSH, manual log inspection, and EBS snapshot restoration — an average MTTR of 47 minutes for data incidents.

Pressure to migrate came from two directions: cost (the EC2 bill grew linearly with tenant count) and compliance (a SOC 2 Type II audit identified the lack of auditable logical separation between tenants as a high-risk finding). The decision to move to EKS was not made out of trend-chasing — it was made because the Kubernetes operator model offered the lifecycle automation the team needed to scale without proportionally growing the operations headcount.

The Migration Journey: Six Phases with Real Decisions

1
Phase 1 — State Inventory and Workload Classification
Before touching any infrastructure, we mapped every state surface: transactional data (PostgreSQL), session cache (Redis), job queues (PG tables), document blobs (already on S3), and tenant configuration (local files). Each category received a classification: stateless, stateful-ephemeral, or stateful-durable. Stateless workloads migrated first. Stateful-durable workloads migrated last — and required dedicated operators.
2
Phase 2 — EKS Multitenancy Model Design
We evaluated three models: cluster-per-tenant (strong isolation, high cost), namespace-per-tenant on a shared cluster (cost-efficient, isolation via RBAC and NetworkPolicy), and namespace-per-tenant with dedicated node pools per criticality tier. We chose the third: namespaces per tenant with separate node groups for Gold tenants (dedicated r6i.2xlarge instances with taints) and Silver/Bronze (shared node group with m6i.xlarge). This balanced cost and blast-radius isolation.
3
Phase 3 — Persistence: EBS CSI, EFS, and Aurora Serverless v2
For transactional data, we migrated from RDS Single-AZ to Aurora PostgreSQL Serverless v2 with read replicas per tenant-group. The EBS CSI driver with a gp3 StorageClass (3000 IOPS baseline, 125 MB/s throughput) replaced manually mounted gp2 volumes. For artifacts shared across pods within the same tenant (reports, uploads), we used EFS with per-namespace access points, enforcing filesystem isolation without accidental cross-tenant sharing.
4
Phase 4 — Kubernetes Operators for Tenant Lifecycle
We developed a custom operator (using controller-runtime) that reacted to a TenantWorkspace CRD. On TenantWorkspace creation, the operator provisioned: a namespace with tenant labels, ResourceQuota (CPU/memory/PVC), isolation NetworkPolicy, a ServiceAccount with IRSA binding to the tenant's KMS key, and a secret in AWS Secrets Manager referenced via External Secrets Operator. New tenant provisioning time dropped from 4 hours (manual) to 8 minutes (automated).
5
Phase 5 — Data Migration with Controlled Cutover Window
We used AWS DMS with continuous replication (CDC) to synchronize legacy RDS databases with the new Aurora clusters during a 72-hour shadow period. Cutover was executed tenant by tenant, starting with lowest transaction volume. Each cutover followed the pattern: (1) drain connections from the legacy app, (2) confirm replication lag < 5 seconds, (3) promote Aurora as primary, (4) update internal DNS via Route 53 weighted routing, (5) monitor for 30 minutes before decommissioning the legacy instance.
6
Phase 6 — Observability and SLO Validation
We instrumented all services with OpenTelemetry SDK (traces and metrics), exporting to AWS X-Ray (traces) and CloudWatch Container Insights (infrastructure metrics). A Datadog dashboard consolidated per-tenant SLOs: 99.9% availability (30-day window), P99 latency < 800ms for write operations, and error rate < 0.1%. Burn rate alerts were configured to fire when the error budget consumed more than 5% in 1 hour — well before violating the SLO.

The Isolation Model: Beyond the Namespace

Namespace per tenant is necessary, but far from sufficient in a financial environment. The namespace is a naming and RBAC boundary — it does not prevent a misconfigured pod from accessing another tenant's network, nor does it prevent an encryption key from being inadvertently shared.

The isolation model we implemented has four layers:

1. Network isolation: A default NetworkPolicy denies all ingress and egress traffic within the cluster, except explicitly permitted routes. Each tenant can only communicate with its own namespace and the platform-services namespace (containing the egress proxy and authentication service). We used Calico as the CNI for GlobalNetworkPolicy support, which applies rules before namespace-level policies.

2. Identity isolation: Each tenant ServiceAccount has a dedicated IAM Role via IRSA (eks.amazonaws.com/role-arn annotation). The IAM Role has a StringEquals condition on sts:AssumeRoleWithWebIdentity that verifies the OIDC token sub — ensuring only pods from the correct namespace can assume that role. This prevents privilege escalation even if a pod manages to create a ServiceAccount in another namespace.

3. Encryption isolation: Each tenant has its own KMS Customer Managed Key (CMK). EBS volumes, Secrets Manager secrets, and S3 data for each tenant are encrypted with the tenant's CMK. The CMK key policy uses kms:ViaService and aws:SourceAccount conditions to ensure only calls originating from the correct services can use the key.

4. Resource isolation: Per-namespace ResourceQuota and LimitRange prevent a tenant from monopolizing CPU or memory on the shared node group. For Gold tenants, taints on dedicated nodes ensure no pod from another tenant is scheduled there, even under scheduling pressure.

Stateful Multitenant Platform Architecture on EKS

Tenant provisioning and isolation flow: from TenantWorkspace CRD to persistent state, with security and observability layers.

🤖 EKS — Operator Plane

TenantWorkspace · Operator (controller-runtime)
External Secrets · Operator

🔐 AWS — Security & Identity

IRSA / OIDC · per-tenant IAM Role
KMS CMK · per-tenant key
Secrets Manager · per-tenant secret

📦 EKS — Tenant Namespace (Gold/Silver/Bronze)

Namespace · + NetworkPolicy + ResourceQuota
Stateful App Pods · (gp3 EBS PVC)
Redis StatefulSet · ephemeral-stateful

🗄️ AWS — Persistent Storage

Aurora PostgreSQL · Serverless v2 (per tenant-group)
EFS Access Point · per-namespace
S3 Bucket · SSE-KMS per tenant

📊 Observability

OTel Collector · traces + metrics
Datadog · SLO dashboards

Kubernetes Operators: The Difference Between Real Automation and Scripts with Kubectl

The decision to invest in a custom operator rather than Helm + kubectl scripts was the most internally debated. The argument against was legitimate: operators have high development cost, introduce a critical component into the control plane, and are difficult to debug when the reconciliation loop enters an inconsistent state.

The argument in favor, which prevailed, was based on three properties scripts do not have:

Continuous reconciliation: A script runs once. An operator continuously observes desired state versus actual state. When an engineer accidentally deleted a tenant's NetworkPolicy in production (it happened), the operator recreated it within 30 seconds — before any unauthorized traffic could occur. With scripts, that would have been a security incident.

Full lifecycle management: The operator handles creation, update, and deletion (with finalizers to ensure data is not deleted before backup). A provisioning script rarely includes safe deprovisioning logic — and when it does, it is tested with far less rigor.

Native status and observability: The TenantWorkspace CRD has a .status.conditions field with conditions like StorageProvisioned, NetworkPolicyApplied, SecretsSync. Any tool consuming the Kubernetes API can observe each tenant's state in real time. This was critical for the support team: instead of SSH-ing into nodes, they run kubectl get tenantworkspace -n platform-ops and see every tenant's state in seconds.

The real cost of the operator was approximately 3 development sprints and 1 sprint of integration test hardening. Operational break-even was reached in month four, when the tenant count exceeded 60 and management complexity would have required hiring an additional operations engineer.

Before and After: Operational Metrics from the Migration

47min → 4min

Average MTTR for data incidents

Operator auto-reconciliation + burn rate alerts eliminated the manual diagnosis phase

4h → 8min

New tenant provisioning time

TenantWorkspace CRD + operator replaced 4 hours of manual scripting and approvals

68% ↓

Infrastructure cost reduction per tenant

Shared node groups for Silver/Bronze + Aurora Serverless v2 (scales to zero in idle hours) vs. dedicated EC2 24/7

Stateful Persistence in Kubernetes: What Nobody Tells You About PVCs in Production

PersistentVolumeClaims backed by EBS have a fundamental limitation that frequently surprises migrating teams: an EBS volume can only be mounted on a single node at a time (ReadWriteOnce). This has direct implications for deployment strategies.

With a Deployment (not StatefulSet), a rolling update may attempt to create the new pod on a different node before the old pod releases the volume — resulting in the new pod stuck in ContainerCreating indefinitely, waiting for the volume to detach. In financial environments with short maintenance windows, this can be catastrophic.

The solution we adopted was threefold: (1) use StatefulSet for all EBS-backed PVC workloads, which guarantees the old pod is terminated before the new one is created; (2) configure podManagementPolicy: Parallel only for StatefulSets where startup order does not matter (Redis cache), keeping OrderedReady for those with initialization dependencies; (3) implement PodDisruptionBudget with minAvailable: 1 to ensure node drains (during EKS upgrades) do not remove the sole pod of a critical tenant without a ready replacement.

For the use case of artifacts shared across multiple pods of the same tenant (asynchronously generated reports, for example), EFS with ReadWriteMany was the right choice — but with an important caveat: EFS has significantly higher write latency than EBS for synchronous operations (typically 1-5ms vs. 0.1-0.5ms for gp3). Any code path that wrote to EFS synchronously was refactored to asynchronous writes via SQS, with EFS as the final destination after processing.

Critical Risks That Almost Cost Us the Migration

1. Cross-tenant state leakage via shared cache: In an early version, Redis was shared across Silver tenants using key prefixes as separators. A prefix bug in a code release exposed session data from one tenant to another for 12 minutes before detection. The fix was moving to Redis StatefulSets per tenant-group (not per individual tenant — prohibitive cost) with ACL authentication per tenant. 2. EBS volume attachment storm during node group upgrades: During a rolling upgrade of a node group with 30 tenants, all EBS volumes attempted to detach and reattach simultaneously. The 28-volume EBS limit per r6i.2xlarge instance was hit, causing scheduling failures. The fix was setting maxUnavailable: 1 on the node group and using PodDisruptionBudgets to enforce a controlled pod migration pace. 3. NetworkPolicy configuration drift: With the reconciliation operator inactive for 40 minutes during a maintenance window, an emergency manual deploy created a pod without the correct tenant labels, which fell outside the NetworkPolicies and had unrestricted access to the cluster's internal network. This reinforced the rule: the operator must never be disabled, even during maintenance — it must be updated with zero downtime via rolling update.

Observability in Multitenant Platforms: Tenant as a First-Class Dimension

The most common mistake I see in multitenant platforms is treating observability as an infrastructure concern — CPU, memory, and latency metrics aggregated at the cluster level. This is useless when a customer calls complaining of slowness: you cannot isolate the problem without tenant dimensions.

The most impactful architectural decision in the observability phase was defining tenant_id as a mandatory dimension on all platform metrics, traces, and logs. This was implemented in three layers:

Application: The OTel SDK was configured with a Resource that includes tenant.id and tenant.tier as attributes. All spans inherit these attributes automatically. On the metrics side, we used exemplars to link high-cardinality metrics to specific traces — essential for investigating P99 for a specific tenant.

Infrastructure: CloudWatch Container Insights was configured with enhanced observability on EKS, collecting per-pod metrics with Kubernetes labels. A CloudWatch Logs Insights metric filter extracts tenant_id from structured logs and creates custom metrics with that dimension.

Per-tenant SLOs: In Datadog, each tenant has an independent SLO monitor. The burn rate alert uses a 1-hour window (fast alert) and 6-hour window (trend alert). When the 1-hour burn rate exceeds 14.4x (consuming 2% of the budget in 1 hour), a PagerDuty alert is created with tenant_id in the title — the on-call engineer immediately knows which tenant is impacted without investigating dashboards.

The ingestion cost of high-cardinality metrics was a real concern. The solution was using histogram_quantile in Prometheus (running as a Managed Service via AMP) for latency metrics, rather than per-percentile gauges — reducing cardinality by 80% with equivalent precision for SLOs.

Well-Architected Pillars Assessment

Security

Four-layer isolation (network, identity, encryption, resources) with IRSA + KMS CMK per tenant. Default-deny NetworkPolicy with Calico GlobalNetworkPolicy. CloudTrail audit of all KMS calls by tenant_id.

Reliability

StatefulSets with PodDisruptionBudgets ensure availability during upgrades. Aurora Serverless v2 with read replicas and automatic failover < 30s. Operator with continuous reconciliation prevents configuration drift.

My Curator's Note

Senior Solutions Architect

If I were starting this migration today, the one thing I would do differently is invest in the Kubernetes operator before migrating the first tenant — not after the fifth. The temptation to use Helm charts and bash scripts for the first few tenants is enormous, and the technical debt it creates is disproportionate. The hardest lesson I carry from this journey is that isolation in multitenant platforms is not a feature you add later: it is a property that must be mathematically proven before the first production tenant, with namespace penetration tests and NetworkPolicy failure simulations. In financial environments, a cross-tenant data leak is not a severity-2 bug — it is a regulatory event.

Verdict: Stateful Cloud Native Platforms Are Viable in Financial Production — With the Right Conditions

Migrating to a stateful cloud native multitenant platform on EKS is technically feasible and operationally superior to the per-tenant VM model — but only if you accept that the initial investment is larger than it appears. The Kubernetes operator, the four-layer isolation model, and tenant-dimensioned observability are not optional optimizations: they are prerequisites for operating safely at scale. The numbers speak for themselves: 68% cost reduction per tenant, MTTR from 47 minutes to 4 minutes, and provisioning from 4 hours to 8 minutes are real results — but they were preceded by 3 months of platform work that delivered no business features. If your organization does not have tolerance for that upfront investment, the per-tenant VM model, expensive as it is, is safer than a poorly isolated Kubernetes platform. Cloud native is not the right destination for everyone — but for those who do the work properly, the compounding operational advantage over time is hard to argue against.

References

CNCF Blog: Stateful Cloud Native Platforms AWS EKS Best Practices Guide: Multi-tenancy AWS EKS EBS CSI Driver Documentation AWS IRSA: IAM Roles for Service Accounts Aurora Serverless v2: How it works Kubernetes: Operator Pattern External Secrets Operator OpenTelemetry: Resource Semantic Conventions

#eks#stateful#kubernetes#multitenancy#migration#financial-grade#operators#observability

Liked this? Get the next one.

Architecture, AWS, AI and market deep dives — straight to your inbox. Free.

No spam · unsubscribe anytime

Analyzed source: Stateful cloud native platforms

Ask Fernando about this

Get a focused answer about this article from my AI assistant, grounded in my work.

Join the conversation

Verify your email to join in — you'll also get the newsletter. No password.

Keep reading

Solution ArchitectureAWS Transform at 1 Year: Agentic Legacy Modernization in ProductionAWS Transform arrived promising AI-agent-driven legacy modernization — after one year, it's worth examining what it actually delivers, where it falls short, and what the real adoption cost is in critical systems. This analysis is grounded in concrete technical evidence, not marketing.Read AWS & CloudEC2 G7 & NVIDIA Blackwell: GPU Inference Architecture for ProductionEC2 G7 instances, accelerated by NVIDIA RTX PRO 4500 Blackwell Server Edition GPUs, represent a generational leap that goes well beyond benchmark numbers. In this analysis, I examine the real GPU-to-GPU communication mechanisms, failure patterns in multi-node clusters, and the architectural decisions that separate a functional inference deployment from a financial-grade fault-tolerant system.Read Financial SystemsAWS FinOps Agent: Architecture, Mechanisms, and Production Trade-offsAWS FinOps Agent, announced in preview at AWS Summit New York 2026, represents a paradigm shift: from reactive dashboards to autonomous agents that investigate cost anomalies, generate recommendations, and execute actions in external systems like Jira and Slack. In this article, I dissect the agent's internal architecture, the failure modes nobody mentions, and the trade-offs any financial engineering team needs to understand before putting it into production.Read

Architecture newsletter

Architecture intelligence, in your inbox

Curated signals and original analysis on AWS, AI, distributed systems and the market — the way a solutions architect reads them.

Curated AWS · AI · architecture · market signals
New architecture studies & deep-dives when they ship
Sharp summaries — depth without the noise
No spam · double opt-in · unsubscribe anytime

Solution ArchitectureMigration Story

Migrating to Stateful Cloud Native Platforms on AWS EKS

May 26, 2026 11 minexpert AI-assisted

Listen to article

Fernando's voice

Fernando · 21:09

Download MP3

0:0021:09

Speed

The MP3 is saved to S3 after the first play.

Solution ArchitectureMigration Story

47min → 4min

Average MTTR for data incidents

Operator auto-reconciliation + burn rate alerts eliminated the manual diagnosis phase

4h → 8min

New tenant provisioning time

TenantWorkspace CRD + operator replaced 4 hours of manual scripting and approvals

68% ↓

Infrastructure cost reduction per tenant

Shared node groups for Silver/Bronze + Aurora Serverless v2 (scales to zero in idle hours) vs. dedicated EC2 24/7

fernando.moretes.com

The Starting Point: VMs, Implicit State, and the Illusion of Control

The Migration Journey: Six Phases with Real Decisions

1
Phase 1 — State Inventory and Workload Classification
Before touching any infrastructure, we mapped every state surface: transactional data (PostgreSQL), session cache (Redis), job queues (PG tables), document blobs (already on S3), and tenant configuration (local files). Each category received a classification: stateless, stateful-ephemeral, or stateful-durable. Stateless workloads migrated first. Stateful-durable workloads migrated last — and required dedicated operators.
2
Phase 2 — EKS Multitenancy Model Design
We evaluated three models: cluster-per-tenant (strong isolation, high cost), namespace-per-tenant on a shared cluster (cost-efficient, isolation via RBAC and NetworkPolicy), and namespace-per-tenant with dedicated node pools per criticality tier. We chose the third: namespaces per tenant with separate node groups for Gold tenants (dedicated r6i.2xlarge instances with taints) and Silver/Bronze (shared node group with m6i.xlarge). This balanced cost and blast-radius isolation.
3
Phase 3 — Persistence: EBS CSI, EFS, and Aurora Serverless v2
For transactional data, we migrated from RDS Single-AZ to Aurora PostgreSQL Serverless v2 with read replicas per tenant-group. The EBS CSI driver with a gp3 StorageClass (3000 IOPS baseline, 125 MB/s throughput) replaced manually mounted gp2 volumes. For artifacts shared across pods within the same tenant (reports, uploads), we used EFS with per-namespace access points, enforcing filesystem isolation without accidental cross-tenant sharing.
4
Phase 4 — Kubernetes Operators for Tenant Lifecycle
We developed a custom operator (using controller-runtime) that reacted to a TenantWorkspace CRD. On TenantWorkspace creation, the operator provisioned: a namespace with tenant labels, ResourceQuota (CPU/memory/PVC), isolation NetworkPolicy, a ServiceAccount with IRSA binding to the tenant's KMS key, and a secret in AWS Secrets Manager referenced via External Secrets Operator. New tenant provisioning time dropped from 4 hours (manual) to 8 minutes (automated).
5
Phase 5 — Data Migration with Controlled Cutover Window
We used AWS DMS with continuous replication (CDC) to synchronize legacy RDS databases with the new Aurora clusters during a 72-hour shadow period. Cutover was executed tenant by tenant, starting with lowest transaction volume. Each cutover followed the pattern: (1) drain connections from the legacy app, (2) confirm replication lag < 5 seconds, (3) promote Aurora as primary, (4) update internal DNS via Route 53 weighted routing, (5) monitor for 30 minutes before decommissioning the legacy instance.
6
Phase 6 — Observability and SLO Validation
We instrumented all services with OpenTelemetry SDK (traces and metrics), exporting to AWS X-Ray (traces) and CloudWatch Container Insights (infrastructure metrics). A Datadog dashboard consolidated per-tenant SLOs: 99.9% availability (30-day window), P99 latency < 800ms for write operations, and error rate < 0.1%. Burn rate alerts were configured to fire when the error budget consumed more than 5% in 1 hour — well before violating the SLO.

The Isolation Model: Beyond the Namespace

The isolation model we implemented has four layers:

Stateful Multitenant Platform Architecture on EKS

Tenant provisioning and isolation flow: from TenantWorkspace CRD to persistent state, with security and observability layers.

🤖 EKS — Operator Plane

TenantWorkspace · Operator (controller-runtime)
External Secrets · Operator

🔐 AWS — Security & Identity

IRSA / OIDC · per-tenant IAM Role
KMS CMK · per-tenant key
Secrets Manager · per-tenant secret

📦 EKS — Tenant Namespace (Gold/Silver/Bronze)

Namespace · + NetworkPolicy + ResourceQuota
Stateful App Pods · (gp3 EBS PVC)
Redis StatefulSet · ephemeral-stateful

🗄️ AWS — Persistent Storage

Aurora PostgreSQL · Serverless v2 (per tenant-group)
EFS Access Point · per-namespace
S3 Bucket · SSE-KMS per tenant

📊 Observability

OTel Collector · traces + metrics
Datadog · SLO dashboards

Kubernetes Operators: The Difference Between Real Automation and Scripts with Kubectl

The argument in favor, which prevailed, was based on three properties scripts do not have:

Before and After: Operational Metrics from the Migration

47min → 4min

Average MTTR for data incidents

Operator auto-reconciliation + burn rate alerts eliminated the manual diagnosis phase

4h → 8min

New tenant provisioning time

TenantWorkspace CRD + operator replaced 4 hours of manual scripting and approvals

68% ↓

Infrastructure cost reduction per tenant

Shared node groups for Silver/Bronze + Aurora Serverless v2 (scales to zero in idle hours) vs. dedicated EC2 24/7

Stateful Persistence in Kubernetes: What Nobody Tells You About PVCs in Production

Critical Risks That Almost Cost Us the Migration

Observability in Multitenant Platforms: Tenant as a First-Class Dimension

The most impactful architectural decision in the observability phase was defining tenant_id as a mandatory dimension on all platform metrics, traces, and logs. This was implemented in three layers:

Well-Architected Pillars Assessment

Security

Reliability

My Curator's Note

Senior Solutions Architect

Verdict: Stateful Cloud Native Platforms Are Viable in Financial Production — With the Right Conditions

References

#eks#stateful#kubernetes#multitenancy#migration#financial-grade#operators#observability

Liked this? Get the next one.

Architecture, AWS, AI and market deep dives — straight to your inbox. Free.

No spam · unsubscribe anytime

Analyzed source: Stateful cloud native platforms

Ask Fernando about this

Get a focused answer about this article from my AI assistant, grounded in my work.

Join the conversation

Verify your email to join in — you'll also get the newsletter. No password.

Keep reading

Architecture newsletter

Architecture intelligence, in your inbox

Curated signals and original analysis on AWS, AI, distributed systems and the market — the way a solutions architect reads them.

Curated AWS · AI · architecture · market signals
New architecture studies & deep-dives when they ship
Sharp summaries — depth without the noise
No spam · double opt-in · unsubscribe anytime

Listen to article

The Starting Point: VMs, Implicit State, and the Illusion of Control

The Migration Journey: Six Phases with Real Decisions

Phase 1 — State Inventory and Workload Classification

Phase 2 — EKS Multitenancy Model Design

Phase 3 — Persistence: EBS CSI, EFS, and Aurora Serverless v2

Phase 4 — Kubernetes Operators for Tenant Lifecycle

Phase 5 — Data Migration with Controlled Cutover Window

Phase 6 — Observability and SLO Validation

The Isolation Model: Beyond the Namespace

Stateful Multitenant Platform Architecture on EKS

Kubernetes Operators: The Difference Between Real Automation and Scripts with Kubectl

Before and After: Operational Metrics from the Migration

Stateful Persistence in Kubernetes: What Nobody Tells You About PVCs in Production

Critical Risks That Almost Cost Us the Migration

Observability in Multitenant Platforms: Tenant as a First-Class Dimension

Well-Architected Pillars Assessment

Security

Reliability

Verdict: Stateful Cloud Native Platforms Are Viable in Financial Production — With the Right Conditions

References

Ask Fernando about this

Join the conversation

Keep reading

Architecture intelligence, in your inbox

Listen to article

The Starting Point: VMs, Implicit State, and the Illusion of Control

The Migration Journey: Six Phases with Real Decisions

Phase 1 — State Inventory and Workload Classification

Phase 2 — EKS Multitenancy Model Design

Phase 3 — Persistence: EBS CSI, EFS, and Aurora Serverless v2

Phase 4 — Kubernetes Operators for Tenant Lifecycle

Phase 5 — Data Migration with Controlled Cutover Window

Phase 6 — Observability and SLO Validation

The Isolation Model: Beyond the Namespace

Stateful Multitenant Platform Architecture on EKS

Kubernetes Operators: The Difference Between Real Automation and Scripts with Kubectl

Before and After: Operational Metrics from the Migration

Stateful Persistence in Kubernetes: What Nobody Tells You About PVCs in Production

Critical Risks That Almost Cost Us the Migration

Observability in Multitenant Platforms: Tenant as a First-Class Dimension

Well-Architected Pillars Assessment

Security

Reliability

Verdict: Stateful Cloud Native Platforms Are Viable in Financial Production — With the Right Conditions

References

Ask Fernando about this

Join the conversation

Keep reading

Architecture intelligence, in your inbox