Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

AWS & CloudTrend Briefing

Oracle HA on AWS: FSx for ONTAP as a Lever for Gradual Modernization

Jun 6, 2026 10 minexpert AI-assisted

Listen to article

Fernando's voice

Fernando · 20:21

Download MP3

0:0020:21

Speed

The MP3 is saved to S3 after the first play.

AWS & CloudTrend Briefing

~0.2ms

I/O latency with FSx ONTAP Multi-AZ over NVMe-oF (iSCSI/NFS)

Comparable to last-generation on-premises SAN; sufficient for Oracle OLTP with redo log on a dedicated volume

RPO < 20s

RPO with synchronous SnapMirror between AZs + Oracle Data Guard

Synchronous SnapMirror guarantees zero data loss at the block level; Data Guard adds logical protection against corruption

60-70%

Storage cost reduction vs. provisioned io2 Block Express for Oracle

FSx ONTAP with thin provisioning and inline deduplication eliminates the typical over-provisioning of reserved IOPS

fernando.moretes.com

The most relevant signal in this cycle for architects still carrying critical Oracle workloads is not to abandon them — it is to orchestrate them with cloud-native primitives that deliver real HA without forcing a rewrite. FSx for NetApp ONTAP changes the storage equation for Oracle on AWS in concrete ways. This briefing analyzes what changes in practice, the real trade-offs, and how to position this layer within a gradual modernization strategy.

There is a category of workload that no modernization roadmap manages to eliminate on schedule: Oracle in production, in a financial environment, with a 99.99% SLA and regulatory audit over every transaction. After 16 years moving between engineering and solution architecture, I have learned that the honest answer to that scenario is not 'migrate to Aurora' — it is 'reduce operational risk today and open the modernization window tomorrow'. The recent AWS Architecture Blog signal on Oracle HA with FSx for NetApp ONTAP is exactly that window: a combination of cloud-native primitives that delivers enterprise-grade HA without requiring you to abandon Oracle Data Guard, ASM, or the license contracts your CFO has already amortized. This briefing breaks down what this architecture actually delivers, where it fails if misconfigured, and how to position it as the first move in an incremental modernization strategy — not as a final destination.

Numbers that define the context

~0.2ms

I/O latency with FSx ONTAP Multi-AZ over NVMe-oF (iSCSI/NFS)

Comparable to last-generation on-premises SAN; sufficient for Oracle OLTP with redo log on a dedicated volume

RPO < 20s

RPO with synchronous SnapMirror between AZs + Oracle Data Guard

Synchronous SnapMirror guarantees zero data loss at the block level; Data Guard adds logical protection against corruption

60-70%

Storage cost reduction vs. provisioned io2 Block Express for Oracle

FSx ONTAP with thin provisioning and inline deduplication eliminates the typical over-provisioning of reserved IOPS

The Real Problem: Why Oracle HA on AWS Has Always Been Hard

Oracle in high availability has storage requirements that do not map well to the classic AWS ephemeral/EBS disk model. ASM (Automatic Storage Management) requires direct access to block devices with low latency and predictable I/O behavior. Data Guard, in Maximum Availability mode, needs the redo log to be confirmed on the standby before returning to the primary — which amplifies any latency variation between zones. The historical result was a choice between three evils: (1) use EBS io2 with super-provisioned IOPS to compensate for variability, inflating costs by 3-4x; (2) accept a degraded RPO by operating Data Guard in Maximum Performance mode; or (3) keep Oracle on-premises and only 'extend' to AWS via Direct Connect, losing elasticity benefits.

FSx for NetApp ONTAP changes this equation because it is a managed service that exposes enterprise SAN semantics — consistent snapshots, block-based replication with SnapMirror, thin provisioning, deduplication, and inline compression — over Multi-AZ AWS infrastructure. The critical difference from EBS is that ONTAP treats the volume as a first-class entity with file-system-level consistency guarantees, not just at the block level. For Oracle, this means an ONTAP snapshot captures a consistent state of the datafile without requiring BEGIN BACKUP mode — reducing the risk window during backups and simplifying the recovery runbook.

Oracle HA Architecture with FSx for ONTAP — Layered Gradual Modernization

Data and replication flow between primary, standby, and incremental modernization layers. SnapMirror replicates blocks between AZs; Data Guard protects against logical corruption; the modernization layer consumes ONTAP snapshots for analytics and eventual migration.

🟦 AZ-1 — Primary

EC2 R7i.8xlarge · Oracle DB Primary
FSx ONTAP · Data + Redo Volumes
ASM Diskgroup · iSCSI LUN

🟩 AZ-2 — Standby

EC2 R7i.8xlarge · Oracle DB Standby
FSx ONTAP · Mirror Volumes

🔒 Security & Networking

NLB · DB Endpoint
KMS CMK · Volume Encryption
Security Groups · iSCSI Port 3260

🚀 Modernization Layer

ONTAP Snapshot · Export to S3
AWS Glue · ETL / Schema Evolve
Aurora PostgreSQL · Target (future)

Concrete Configuration: What Really Matters in the Storage Layer

The most common trap I see in Oracle/FSx implementations is treating FSx ONTAP as a generic NAS and mounting everything via NFS v3. For Oracle in financial production, the correct protocol is iSCSI with dedicated LUNs per ASM diskgroup — one for data (DATA), one for redo log (REDO), and one for FRA (Fast Recovery Area). The separation is not cosmetic: it allows distinct QoS policies per ONTAP volume, and the redo log in particular benefits from a volume with tiering-policy none (no movement to cold capacity) and snapshot-policy none (redo snapshots are useless and consume space).

At the FSx level, the Multi-AZ deployment automatically creates an active/passive pair of file servers with synchronous replication via SnapMirror between AZs. The critical configuration point is the preferred-subnet and standby-subnet — they must be in distinct AZs with symmetric routes via Transit Gateway or VPC peering, never via Internet Gateway. Latency between AZs in the same AWS region is typically 1-2ms; with synchronous SnapMirror, this translates to ~2-4ms of additional overhead on the redo commit — acceptable for OLTP, but needs to be measured with AWR before going to production.

For encryption, the correct path is KMS with a customer-managed CMK, with kms:GenerateDataKeyWithoutPlaintext restricted by aws:SourceVpc condition — preventing the key from being used outside the database VPC. Combine this with aws:RequestedRegion to prevent accidental cross-region exfiltration. FSx ONTAP supports encryption at rest by default, but the CMK needs to be explicitly configured at filesystem creation — it is not retroactive.

What Changes for Architects with This Signal

Oracle HA is no longer a provisioned IOPS problem: FSx ONTAP with adaptive QoS eliminates defensive io2 over-provisioning — storage cost drops 60-70% without degrading SLA.

ONTAP snapshots as a modernization primitive: An ONTAP snapshot clone can be mounted read-only by a Glue/EMR instance for incremental ETL without impacting the primary — this is the foundation of a zero-downtime migration strategy.

Data Guard + SnapMirror are not redundant: SnapMirror protects against infrastructure failure (hardware, AZ); Data Guard protects against logical corruption (application bug, accidental truncation). The two layers are complementary, not substitutable.

The failover runbook changes: With FSx Multi-AZ, storage failover is automatic and transparent to Oracle (the iSCSI endpoint does not change). Data Guard failover is still manual or requires Oracle Observer — combine both for RTO < 60s.

Oracle licensing on AWS has nuances: EC2 with BYOL Oracle SE2/EE counts vCPUs, not sockets. R7i.8xlarge has 32 vCPUs = 16 Oracle cores. Size to fit the licensing model before choosing the instance type — the mistake here costs more than all the infrastructure.

Observability needs to cover three layers: CloudWatch for FSx metrics (VolumeReadOps, StorageCapacityUtilization), AWR/ASH for internal Oracle, and OpenTelemetry to correlate application latency with storage events — without all three, incident diagnosis takes hours.

Strategic Positioning: FSx ONTAP as First Move, Not as Destination

The most dangerous framing mistake I hear in architecture committees is treating the Oracle/FSx migration as an isolated infrastructure project. It is not. It is the first move in a three-act sequence that, if well orchestrated, leads to a final state where Oracle is optional — not mandatory.

Act 1 — Stabilize: Migrate Oracle to EC2 + FSx ONTAP Multi-AZ with Data Guard. The goal is not to modernize; it is to eliminate the operational risk of on-premises hardware and gain ONTAP snapshots as a backup/clone primitive. In this act, you do not touch the schema, do not change application drivers, do not question the data model. KPIs: RTO < 60s, RPO < 20s, storage cost -60%.

Act 2 — Instrument: With Oracle stable on AWS, you have access to ONTAP snapshots that can be cloned in seconds and mounted on analysis instances. Use this to feed a Glue pipeline that reads Oracle datafiles via JDBC (not via block snapshot — JDBC guarantees transactional consistency) and writes to S3 in Parquet format. This pipeline is the embryo of your data mesh: it separates operational data from analytical data without touching the primary.

Act 3 — Migrate Incrementally: With the Glue pipeline running, you have real visibility into the Oracle schema — dependencies, data types, procedures that need to be rewritten. Now the conversation with the development team about migrating to Aurora PostgreSQL or RDS Oracle has concrete data: which tables have high volume, which procedures are critical, what is the cost of rewriting each module. The decision to migrate stops being political and becomes engineering.

This sequencing matters because each act delivers independent value. If Act 3 never happens due to regulatory or political constraints, you still left on-premises, reduced cost, and gained real HA. The risk of a big-bang migration — which is the alternative — is that if it fails at Act 3, you have lost everything.

Licensing Trap: The Invisible Cost That Sinks the Business Case

Oracle counts physical cores for BYOL licensing on AWS, but AWS publishes a conversion factor: 1 vCPU = 0.5 Oracle core for Hyper-Threading instances. An R7i.16xlarge has 64 vCPUs = 32 Oracle cores. If you migrated from an on-premises server with 2 sockets × 8 cores = 16 Oracle cores, you just doubled your license exposure without noticing. Always calculate the Oracle Processor Core Factor before choosing the instance type. For financial environments with license audits, consider Dedicated Hosts for explicit control over the number of physical sockets — the cost of the dedicated host may be less than the audit penalty.

Observability and Operations: What CloudWatch Does Not Tell You About Oracle

The most critical gap in Oracle/FSx operations that I find in Well-Architected reviews is the absence of correlation between storage metrics and internal Oracle events. CloudWatch delivers excellent FSx metrics — VolumeReadLatency, VolumeWriteLatency, StorageCapacityUtilization, NetworkThroughput — but they do not talk to Oracle's AWR (Automatic Workload Repository). When a DBA reports 'the database is slow', you need to know whether the bottleneck is storage I/O (FSx), internal latch contention (Oracle), or network latency between application and database.

The solution I implement is a three-layer pipeline: (1) CloudWatch with alarms on VolumeReadLatency > 1ms for 5 consecutive minutes — this triggers an SNS that notifies the operations team before the user notices; (2) a Python Lambda script that queries V$SYSMETRIC and V$ACTIVE_SESSION_HISTORY via JDBC every 60 seconds and publishes custom metrics to CloudWatch with WaitClass and Event dimensions — this maps Oracle wait events to CloudWatch metrics; (3) OpenTelemetry in the application middleware that propagates trace_id down to the JDBC driver, allowing correlation of a slow HTTP request with the specific Oracle wait event that caused it.

For FSx specifically, StorageCapacityUtilization above 80% is a critical signal — ONTAP starts degrading thin provisioning performance above that threshold. Configure a CloudWatch alarm with an automatic volume expansion action via AWS CLI (aws fsx update-volume --volume-id ... --ontap-configuration TieringPolicy=...). FSx ONTAP supports online expansion without downtime — use this as part of the capacity management runbook.

Well-Architected Lenses for Oracle HA with FSx ONTAP

Security

KMS CMK with aws:SourceVpc + aws:RequestedRegion condition. Security Groups restricting iSCSI port 3260 only to Oracle instance ENIs. VPC Endpoints for FSx eliminating Internet traffic. CloudTrail with data events for the FSx filesystem.

Reliability

FSx Multi-AZ with synchronous SnapMirror + Data Guard Maximum Availability delivers RTO < 60s and RPO < 20s. Test failover monthly with AWS Fault Injection Simulator (FIS) simulating AZ failure — do not rely solely on the theoretical runbook.

Anti-Patterns That Destroy the Business Case

Mounting everything via NFS v3: NFS v3 does not support adequate locking for Oracle datafiles. Use iSCSI for DATA and REDO, NFS v4.1 only for backups and exports.
Using ONTAP snapshot as a Data Guard substitute: Snapshots protect against infrastructure failure, not logical corruption. A DELETE FROM orders WHERE 1=1 without WHERE clause is replicated by SnapMirror before you notice.
Not calculating Oracle Core Factor before choosing instance type: Accidentally doubling Oracle core count in a migration can cost more in licensing than all AWS infrastructure per year.
Enabling automatic tiering on the redo log volume: The redo log needs sub-millisecond latency. Tiering to S3 introduces object retrieval latency that can cause commit failure. tiering-policy none is mandatory for REDO.
Treating Oracle/FSx migration as an isolated infrastructure project: Without the Glue/S3 pipeline from Act 2, you merely replicated the on-premises problem to the cloud with different cost. The real value lies in the incremental modernization that ONTAP snapshots enable.

The Counter-Intuitive Insight: Two Replication Layers Are Cheaper Than One

The instinctive reaction to seeing SnapMirror + Data Guard together is 'we are paying twice for the same thing'. In practice, it is the opposite. Synchronous SnapMirror eliminates the need to provision extra IOPS on the standby to keep redo apply current — the storage is already replicated. Data Guard in Maximum Availability with redo log on FSx ONTAP has lower network overhead than on EBS because ONTAP absorbs latency variability locally. The total cost of both layers is lower than the cost of a single super-provisioned EBS io2 trying to do the work of both.

The Horizon: When Oracle Becomes Optional

The question every financial CTO asks me after seeing this architecture is: 'when can we leave Oracle for good?' The honest answer is that it depends on three variables that are not technical: (1) the volume of PL/SQL stored procedures with embedded business logic — which need to be rewritten, not migrated; (2) the Oracle license contracts in force — which may have early exit penalty clauses; and (3) the regulatory tolerance for database changes in systems of record.

What the FSx ONTAP + Glue + S3 architecture does is transform these three variables from absolute blockers into manageable variables. The Glue pipeline from Act 2 delivers a live inventory of the Oracle schema — you know exactly how many procedures exist, which are called frequently, which can be replaced by application logic. This transforms the conversation from 'migrate the database' to 'migrate module by module', with each module having an independent business case.

In financial environments I have followed, the most common path is to keep Oracle for the transactional core (ledger, positions, settlement) and migrate to Aurora PostgreSQL the peripheral modules (regulatory reporting, onboarding, KYC). This reduces Oracle license exposure by 40-60% without touching the highest-risk systems. FSx ONTAP remains as the storage layer for residual Oracle, and the Glue pipeline feeds the data mesh with data from both databases — creating a unified analytical layer that does not depend on which relational database is underneath.

This is the realistic end state: not 'zero Oracle', but 'Oracle as a choice, not a prison'. And FSx ONTAP is the infrastructure that makes that horizon reachable without a big-bang that no financial regulator would approve.

Curator's Note

Senior Solutions Architect

In practice, what I would do first is implement the three-layer observability pipeline before any migration — CloudWatch, V$SYSMETRIC via Lambda, and OpenTelemetry in the middleware. Without this baseline, you have no way to prove that the migration improved (or did not degrade) performance, and in a regulated financial environment, that proof is as important as the SLA. The most expensive lesson I have learned in Oracle/cloud projects is that the problem is rarely the database — it is the absence of data to diagnose the database. FSx ONTAP elegantly solves the storage problem, but observability is what saves you at 3am during a production incident.

Verdict: Adopt as First Move, Not as Final Solution

FSx for NetApp ONTAP with Oracle Data Guard in Multi-AZ is the most mature Oracle HA architecture available on AWS today — and the AWS Architecture Blog signal confirms that AWS is investing in this direction. For architects in financial environments, the recommendation is clear: use this combination as Act 1 of a three-act modernization strategy. It delivers RTO < 60s, RPO < 20s, a 60-70% reduction in storage cost, and — most importantly — opens the window for incremental modernization via ONTAP snapshots and Glue pipelines without requiring a big-bang. The risk of not acting is continuing to pay the operational and licensing cost of an on-premises Oracle that increasingly hinders the adoption of modern architectures. The risk of acting is manageable if you correctly sequence the three acts and instrument observability before migrating. Start Act 1 this quarter.

References

AWS Architecture Blog: Oracle HA with FSx for NetApp ONTAP FSx for NetApp ONTAP — User Guide: iSCSI Configuration Oracle Data Guard Concepts and Administration AWS Well-Architected Framework — Reliability Pillar Oracle Licensing on AWS — Processor Core Factor Table FSx ONTAP — Monitoring with CloudWatch AWS Fault Injection Simulator — Chaos Engineering for RDS/EC2 Data Mesh: Delivering Data-Driven Value at Scale — Zhamak Dehghani

#oracle#fsx-ontap#high-availability#migration#financial-grade#storage#modernization#aws

Liked this? Get the next one.

Architecture, AWS, AI and market deep dives — straight to your inbox. Free.

No spam · unsubscribe anytime

Analyzed source: Oracle HA with FSx for NetApp ONTAP

Ask Fernando about this

Get a focused answer about this article from my AI assistant, grounded in my work.

Join the conversation

Verify your email to join in — you'll also get the newsletter. No password.

Keep reading

AWS & CloudADR: Nextflow Profiles on HealthOmics — Separating Config from LogicAWS HealthOmics now supports Nextflow profiles, enabling predefined execution settings to be activated at runtime without modifying workflow source code. This analysis examines the architectural decision behind that separation of concerns, its real trade-offs, and the operational consequences for teams running bioinformatics pipelines at scale.Read AWS & CloudEC2 G7 & NVIDIA Blackwell: GPU Inference Architecture for ProductionEC2 G7 instances, accelerated by NVIDIA RTX PRO 4500 Blackwell Server Edition GPUs, represent a generational leap that goes well beyond benchmark numbers. In this analysis, I examine the real GPU-to-GPU communication mechanisms, failure patterns in multi-node clusters, and the architectural decisions that separate a functional inference deployment from a financial-grade fault-tolerant system.Read Solution ArchitectureAWS Transform at 1 Year: Agentic Legacy Modernization in ProductionAWS Transform arrived promising AI-agent-driven legacy modernization — after one year, it's worth examining what it actually delivers, where it falls short, and what the real adoption cost is in critical systems. This analysis is grounded in concrete technical evidence, not marketing.Read

Architecture newsletter

Architecture intelligence, in your inbox

Curated signals and original analysis on AWS, AI, distributed systems and the market — the way a solutions architect reads them.

Curated AWS · AI · architecture · market signals
New architecture studies & deep-dives when they ship
Sharp summaries — depth without the noise
No spam · double opt-in · unsubscribe anytime

AWS & CloudTrend Briefing

Oracle HA on AWS: FSx for ONTAP as a Lever for Gradual Modernization

Jun 6, 2026 10 minexpert AI-assisted

Listen to article

Fernando's voice

Fernando · 20:21

Download MP3

0:0020:21

Speed

The MP3 is saved to S3 after the first play.

AWS & CloudTrend Briefing

~0.2ms

I/O latency with FSx ONTAP Multi-AZ over NVMe-oF (iSCSI/NFS)

Comparable to last-generation on-premises SAN; sufficient for Oracle OLTP with redo log on a dedicated volume

RPO < 20s

RPO with synchronous SnapMirror between AZs + Oracle Data Guard

Synchronous SnapMirror guarantees zero data loss at the block level; Data Guard adds logical protection against corruption

60-70%

Storage cost reduction vs. provisioned io2 Block Express for Oracle

FSx ONTAP with thin provisioning and inline deduplication eliminates the typical over-provisioning of reserved IOPS

fernando.moretes.com

Numbers that define the context

~0.2ms

I/O latency with FSx ONTAP Multi-AZ over NVMe-oF (iSCSI/NFS)

Comparable to last-generation on-premises SAN; sufficient for Oracle OLTP with redo log on a dedicated volume

RPO < 20s

RPO with synchronous SnapMirror between AZs + Oracle Data Guard

Synchronous SnapMirror guarantees zero data loss at the block level; Data Guard adds logical protection against corruption

60-70%

Storage cost reduction vs. provisioned io2 Block Express for Oracle

FSx ONTAP with thin provisioning and inline deduplication eliminates the typical over-provisioning of reserved IOPS

The Real Problem: Why Oracle HA on AWS Has Always Been Hard

Oracle HA Architecture with FSx for ONTAP — Layered Gradual Modernization

🟦 AZ-1 — Primary

EC2 R7i.8xlarge · Oracle DB Primary
FSx ONTAP · Data + Redo Volumes
ASM Diskgroup · iSCSI LUN

🟩 AZ-2 — Standby

EC2 R7i.8xlarge · Oracle DB Standby
FSx ONTAP · Mirror Volumes

🔒 Security & Networking

NLB · DB Endpoint
KMS CMK · Volume Encryption
Security Groups · iSCSI Port 3260

🚀 Modernization Layer

ONTAP Snapshot · Export to S3
AWS Glue · ETL / Schema Evolve
Aurora PostgreSQL · Target (future)

Concrete Configuration: What Really Matters in the Storage Layer

What Changes for Architects with This Signal

Oracle HA is no longer a provisioned IOPS problem: FSx ONTAP with adaptive QoS eliminates defensive io2 over-provisioning — storage cost drops 60-70% without degrading SLA.

Strategic Positioning: FSx ONTAP as First Move, Not as Destination

Licensing Trap: The Invisible Cost That Sinks the Business Case

Observability and Operations: What CloudWatch Does Not Tell You About Oracle

Well-Architected Lenses for Oracle HA with FSx ONTAP

Security

Reliability

Anti-Patterns That Destroy the Business Case

Mounting everything via NFS v3: NFS v3 does not support adequate locking for Oracle datafiles. Use iSCSI for DATA and REDO, NFS v4.1 only for backups and exports.
Using ONTAP snapshot as a Data Guard substitute: Snapshots protect against infrastructure failure, not logical corruption. A DELETE FROM orders WHERE 1=1 without WHERE clause is replicated by SnapMirror before you notice.
Not calculating Oracle Core Factor before choosing instance type: Accidentally doubling Oracle core count in a migration can cost more in licensing than all AWS infrastructure per year.
Enabling automatic tiering on the redo log volume: The redo log needs sub-millisecond latency. Tiering to S3 introduces object retrieval latency that can cause commit failure. tiering-policy none is mandatory for REDO.
Treating Oracle/FSx migration as an isolated infrastructure project: Without the Glue/S3 pipeline from Act 2, you merely replicated the on-premises problem to the cloud with different cost. The real value lies in the incremental modernization that ONTAP snapshots enable.

The Counter-Intuitive Insight: Two Replication Layers Are Cheaper Than One

The Horizon: When Oracle Becomes Optional

Curator's Note

Senior Solutions Architect

Verdict: Adopt as First Move, Not as Final Solution

References

#oracle#fsx-ontap#high-availability#migration#financial-grade#storage#modernization#aws

Liked this? Get the next one.

Architecture, AWS, AI and market deep dives — straight to your inbox. Free.

No spam · unsubscribe anytime

Analyzed source: Oracle HA with FSx for NetApp ONTAP

Ask Fernando about this

Get a focused answer about this article from my AI assistant, grounded in my work.

Join the conversation

Verify your email to join in — you'll also get the newsletter. No password.

Keep reading

Architecture newsletter

Architecture intelligence, in your inbox

Curated signals and original analysis on AWS, AI, distributed systems and the market — the way a solutions architect reads them.

Curated AWS · AI · architecture · market signals
New architecture studies & deep-dives when they ship
Sharp summaries — depth without the noise
No spam · double opt-in · unsubscribe anytime