Oracle HA on AWS: FSx for ONTAP as a Lever for Gradual Modernization
Listen to article
Fernando's voiceFernando · 20:21
Powered by Amazon Polly + OmniVoice
The most relevant signal in this cycle for architects still carrying critical Oracle workloads is not to abandon them — it is to orchestrate them with cloud-native primitives that deliver real HA without forcing a rewrite. FSx for NetApp ONTAP changes the storage equation for Oracle on AWS in concrete ways. This briefing analyzes what changes in practice, the real trade-offs, and how to position this layer within a gradual modernization strategy.
There is a category of workload that no modernization roadmap manages to eliminate on schedule: Oracle in production, in a financial environment, with a 99.99% SLA and regulatory audit over every transaction. After 16 years moving between engineering and solution architecture, I have learned that the honest answer to that scenario is not 'migrate to Aurora' — it is 'reduce operational risk today and open the modernization window tomorrow'. The recent AWS Architecture Blog signal on Oracle HA with FSx for NetApp ONTAP is exactly that window: a combination of cloud-native primitives that delivers enterprise-grade HA without requiring you to abandon Oracle Data Guard, ASM, or the license contracts your CFO has already amortized. This briefing breaks down what this architecture actually delivers, where it fails if misconfigured, and how to position it as the first move in an incremental modernization strategy — not as a final destination.
Numbers that define the context
The Real Problem: Why Oracle HA on AWS Has Always Been Hard
Oracle in high availability has storage requirements that do not map well to the classic AWS ephemeral/EBS disk model. ASM (Automatic Storage Management) requires direct access to block devices with low latency and predictable I/O behavior. Data Guard, in Maximum Availability mode, needs the redo log to be confirmed on the standby before returning to the primary — which amplifies any latency variation between zones. The historical result was a choice between three evils: (1) use EBS io2 with super-provisioned IOPS to compensate for variability, inflating costs by 3-4x; (2) accept a degraded RPO by operating Data Guard in Maximum Performance mode; or (3) keep Oracle on-premises and only 'extend' to AWS via Direct Connect, losing elasticity benefits.
FSx for NetApp ONTAP changes this equation because it is a managed service that exposes enterprise SAN semantics — consistent snapshots, block-based replication with SnapMirror, thin provisioning, deduplication, and inline compression — over Multi-AZ AWS infrastructure. The critical difference from EBS is that ONTAP treats the volume as a first-class entity with file-system-level consistency guarantees, not just at the block level. For Oracle, this means an ONTAP snapshot captures a consistent state of the datafile without requiring BEGIN BACKUP mode — reducing the risk window during backups and simplifying the recovery runbook.
Oracle HA Architecture with FSx for ONTAP — Layered Gradual Modernization
Data and replication flow between primary, standby, and incremental modernization layers. SnapMirror replicates blocks between AZs; Data Guard protects against logical corruption; the modernization layer consumes ONTAP snapshots for analytics and eventual migration.
- EC2 R7i.8xlarge · Oracle DB Primary
- FSx ONTAP · Data + Redo Volumes
- ASM Diskgroup · iSCSI LUN
- EC2 R7i.8xlarge · Oracle DB Standby
- FSx ONTAP · Mirror Volumes
- NLB · DB Endpoint
- KMS CMK · Volume Encryption
- Security Groups · iSCSI Port 3260
- ONTAP Snapshot · Export to S3
- AWS Glue · ETL / Schema Evolve
- Aurora PostgreSQL · Target (future)
Concrete Configuration: What Really Matters in the Storage Layer
The most common trap I see in Oracle/FSx implementations is treating FSx ONTAP as a generic NAS and mounting everything via NFS v3. For Oracle in financial production, the correct protocol is iSCSI with dedicated LUNs per ASM diskgroup — one for data (DATA), one for redo log (REDO), and one for FRA (Fast Recovery Area). The separation is not cosmetic: it allows distinct QoS policies per ONTAP volume, and the redo log in particular benefits from a volume with tiering-policy none (no movement to cold capacity) and snapshot-policy none (redo snapshots are useless and consume space).
At the FSx level, the Multi-AZ deployment automatically creates an active/passive pair of file servers with synchronous replication via SnapMirror between AZs. The critical configuration point is the preferred-subnet and standby-subnet — they must be in distinct AZs with symmetric routes via Transit Gateway or VPC peering, never via Internet Gateway. Latency between AZs in the same AWS region is typically 1-2ms; with synchronous SnapMirror, this translates to ~2-4ms of additional overhead on the redo commit — acceptable for OLTP, but needs to be measured with AWR before going to production.
For encryption, the correct path is KMS with a customer-managed CMK, with kms:GenerateDataKeyWithoutPlaintext restricted by aws:SourceVpc condition — preventing the key from being used outside the database VPC. Combine this with aws:RequestedRegion to prevent accidental cross-region exfiltration. FSx ONTAP supports encryption at rest by default, but the CMK needs to be explicitly configured at filesystem creation — it is not retroactive.
What Changes for Architects with This Signal
Strategic Positioning: FSx ONTAP as First Move, Not as Destination
The most dangerous framing mistake I hear in architecture committees is treating the Oracle/FSx migration as an isolated infrastructure project. It is not. It is the first move in a three-act sequence that, if well orchestrated, leads to a final state where Oracle is optional — not mandatory.
Act 1 — Stabilize: Migrate Oracle to EC2 + FSx ONTAP Multi-AZ with Data Guard. The goal is not to modernize; it is to eliminate the operational risk of on-premises hardware and gain ONTAP snapshots as a backup/clone primitive. In this act, you do not touch the schema, do not change application drivers, do not question the data model. KPIs: RTO < 60s, RPO < 20s, storage cost -60%.
Act 2 — Instrument: With Oracle stable on AWS, you have access to ONTAP snapshots that can be cloned in seconds and mounted on analysis instances. Use this to feed a Glue pipeline that reads Oracle datafiles via JDBC (not via block snapshot — JDBC guarantees transactional consistency) and writes to S3 in Parquet format. This pipeline is the embryo of your data mesh: it separates operational data from analytical data without touching the primary.
Act 3 — Migrate Incrementally: With the Glue pipeline running, you have real visibility into the Oracle schema — dependencies, data types, procedures that need to be rewritten. Now the conversation with the development team about migrating to Aurora PostgreSQL or RDS Oracle has concrete data: which tables have high volume, which procedures are critical, what is the cost of rewriting each module. The decision to migrate stops being political and becomes engineering.
This sequencing matters because each act delivers independent value. If Act 3 never happens due to regulatory or political constraints, you still left on-premises, reduced cost, and gained real HA. The risk of a big-bang migration — which is the alternative — is that if it fails at Act 3, you have lost everything.
Licensing Trap: The Invisible Cost That Sinks the Business Case
Oracle counts physical cores for BYOL licensing on AWS, but AWS publishes a conversion factor: 1 vCPU = 0.5 Oracle core for Hyper-Threading instances. An R7i.16xlarge has 64 vCPUs = 32 Oracle cores. If you migrated from an on-premises server with 2 sockets × 8 cores = 16 Oracle cores, you just doubled your license exposure without noticing. Always calculate the Oracle Processor Core Factor before choosing the instance type. For financial environments with license audits, consider Dedicated Hosts for explicit control over the number of physical sockets — the cost of the dedicated host may be less than the audit penalty.
Observability and Operations: What CloudWatch Does Not Tell You About Oracle
The most critical gap in Oracle/FSx operations that I find in Well-Architected reviews is the absence of correlation between storage metrics and internal Oracle events. CloudWatch delivers excellent FSx metrics — VolumeReadLatency, VolumeWriteLatency, StorageCapacityUtilization, NetworkThroughput — but they do not talk to Oracle's AWR (Automatic Workload Repository). When a DBA reports 'the database is slow', you need to know whether the bottleneck is storage I/O (FSx), internal latch contention (Oracle), or network latency between application and database.
The solution I implement is a three-layer pipeline: (1) CloudWatch with alarms on VolumeReadLatency > 1ms for 5 consecutive minutes — this triggers an SNS that notifies the operations team before the user notices; (2) a Python Lambda script that queries V$SYSMETRIC and V$ACTIVE_SESSION_HISTORY via JDBC every 60 seconds and publishes custom metrics to CloudWatch with WaitClass and Event dimensions — this maps Oracle wait events to CloudWatch metrics; (3) OpenTelemetry in the application middleware that propagates trace_id down to the JDBC driver, allowing correlation of a slow HTTP request with the specific Oracle wait event that caused it.
For FSx specifically, StorageCapacityUtilization above 80% is a critical signal — ONTAP starts degrading thin provisioning performance above that threshold. Configure a CloudWatch alarm with an automatic volume expansion action via AWS CLI (aws fsx update-volume --volume-id ... --ontap-configuration TieringPolicy=...). FSx ONTAP supports online expansion without downtime — use this as part of the capacity management runbook.
Well-Architected Lenses for Oracle HA with FSx ONTAP
Security
KMS CMK with aws:SourceVpc + aws:RequestedRegion condition. Security Groups restricting iSCSI port 3260 only to Oracle instance ENIs. VPC Endpoints for FSx eliminating Internet traffic. CloudTrail with data events for the FSx filesystem.
Reliability
FSx Multi-AZ with synchronous SnapMirror + Data Guard Maximum Availability delivers RTO < 60s and RPO < 20s. Test failover monthly with AWS Fault Injection Simulator (FIS) simulating AZ failure — do not rely solely on the theoretical runbook.
Anti-Patterns That Destroy the Business Case
- Mounting everything via NFS v3: NFS v3 does not support adequate locking for Oracle datafiles. Use iSCSI for DATA and REDO, NFS v4.1 only for backups and exports.
- Using ONTAP snapshot as a Data Guard substitute: Snapshots protect against infrastructure failure, not logical corruption. A
DELETE FROM orders WHERE 1=1without WHERE clause is replicated by SnapMirror before you notice. - Not calculating Oracle Core Factor before choosing instance type: Accidentally doubling Oracle core count in a migration can cost more in licensing than all AWS infrastructure per year.
- Enabling automatic tiering on the redo log volume: The redo log needs sub-millisecond latency. Tiering to S3 introduces object retrieval latency that can cause commit failure.
tiering-policy noneis mandatory for REDO. - Treating Oracle/FSx migration as an isolated infrastructure project: Without the Glue/S3 pipeline from Act 2, you merely replicated the on-premises problem to the cloud with different cost. The real value lies in the incremental modernization that ONTAP snapshots enable.
The Counter-Intuitive Insight: Two Replication Layers Are Cheaper Than One
The instinctive reaction to seeing SnapMirror + Data Guard together is 'we are paying twice for the same thing'. In practice, it is the opposite. Synchronous SnapMirror eliminates the need to provision extra IOPS on the standby to keep redo apply current — the storage is already replicated. Data Guard in Maximum Availability with redo log on FSx ONTAP has lower network overhead than on EBS because ONTAP absorbs latency variability locally. The total cost of both layers is lower than the cost of a single super-provisioned EBS io2 trying to do the work of both.
The Horizon: When Oracle Becomes Optional
The question every financial CTO asks me after seeing this architecture is: 'when can we leave Oracle for good?' The honest answer is that it depends on three variables that are not technical: (1) the volume of PL/SQL stored procedures with embedded business logic — which need to be rewritten, not migrated; (2) the Oracle license contracts in force — which may have early exit penalty clauses; and (3) the regulatory tolerance for database changes in systems of record.
What the FSx ONTAP + Glue + S3 architecture does is transform these three variables from absolute blockers into manageable variables. The Glue pipeline from Act 2 delivers a live inventory of the Oracle schema — you know exactly how many procedures exist, which are called frequently, which can be replaced by application logic. This transforms the conversation from 'migrate the database' to 'migrate module by module', with each module having an independent business case.
In financial environments I have followed, the most common path is to keep Oracle for the transactional core (ledger, positions, settlement) and migrate to Aurora PostgreSQL the peripheral modules (regulatory reporting, onboarding, KYC). This reduces Oracle license exposure by 40-60% without touching the highest-risk systems. FSx ONTAP remains as the storage layer for residual Oracle, and the Glue pipeline feeds the data mesh with data from both databases — creating a unified analytical layer that does not depend on which relational database is underneath.
This is the realistic end state: not 'zero Oracle', but 'Oracle as a choice, not a prison'. And FSx ONTAP is the infrastructure that makes that horizon reachable without a big-bang that no financial regulator would approve.
In practice, what I would do first is implement the three-layer observability pipeline before any migration — CloudWatch, V$SYSMETRIC via Lambda, and OpenTelemetry in the middleware. Without this baseline, you have no way to prove that the migration improved (or did not degrade) performance, and in a regulated financial environment, that proof is as important as the SLA. The most expensive lesson I have learned in Oracle/cloud projects is that the problem is rarely the database — it is the absence of data to diagnose the database. FSx ONTAP elegantly solves the storage problem, but observability is what saves you at 3am during a production incident.
Verdict: Adopt as First Move, Not as Final Solution
FSx for NetApp ONTAP with Oracle Data Guard in Multi-AZ is the most mature Oracle HA architecture available on AWS today — and the AWS Architecture Blog signal confirms that AWS is investing in this direction. For architects in financial environments, the recommendation is clear: use this combination as Act 1 of a three-act modernization strategy. It delivers RTO < 60s, RPO < 20s, a 60-70% reduction in storage cost, and — most importantly — opens the window for incremental modernization via ONTAP snapshots and Glue pipelines without requiring a big-bang. The risk of not acting is continuing to pay the operational and licensing cost of an on-premises Oracle that increasingly hinders the adoption of modern architectures. The risk of acting is manageable if you correctly sequence the three acts and instrument observability before migrating. Start Act 1 this quarter.
References
Architecture, AWS, AI and market deep dives — straight to your inbox. Free.
No spam · unsubscribe anytime
Ask Fernando about this
Get a focused answer about this article from my AI assistant, grounded in my work.
Join the conversation
Sign in to comment
Verify your email to join in — you'll also get the newsletter. No password.
Keep reading
Architecture intelligence, in your inbox
Curated signals and original analysis on AWS, AI, distributed systems and the market — the way a solutions architect reads them.
- Curated AWS · AI · architecture · market signals
- New architecture studies & deep-dives when they ship
- Sharp summaries — depth without the noise
- No spam · double opt-in · unsubscribe anytime