Pixels to Planning: Geospatial Data Platforms on AWS
Listen to article
Fernando's voiceFernando · 21:55
Powered by Amazon Polly + OmniVoice
Earth AI platforms are leaving research labs and entering real operational decisions — from climate credit to infrastructure asset management. Architecting this pipeline on AWS requires precise choices about raster/vector ingestion, geospatial partitioning, inference latency, and data lineage. This article documents the decisions I would make today, the anti-patterns I have seen in the field, and the checklist you can act on tomorrow.
When Google Research publishes on 'pixels to planning' — turning satellite imagery into sustainable planning decisions — the real technical signal is not in the computer vision model. It is in the data platform that must exist before any pixel is processed: petabyte-scale raster ingestion, geospatial partitioning that does not break under analytical load, lineage traceability that satisfies carbon credit auditors, and inference with predictable latency for operational decisions. I have designed data pipelines for financial-grade environments where a single wrong location attribute cost millions in incorrect hedging. The discipline I learned in those environments applies directly here — and that is what I will document.
The Real Problem: Geospatial Data Is Not Just Large Files
Most multispectral satellite image files arrive as Cloud-Optimized GeoTIFF (COG) or HDF5, ranging from 500 MB to 5 GB per scene depending on resolution and band count. Sentinel-2 alone produces roughly 1.6 TB per day globally. When you start ingesting multiple constellations — Sentinel, Landsat, Planet, SAR radar data — the volume grows to tens of terabytes daily before you even compute derived indices like NDVI, NDWI, or land surface temperature.
The mistake I see most often is treating this data like log files: dump everything into S3 with date-based prefixes and expect Athena to handle it. It does not. The problem is that geospatial queries have two conflicting partitioning axes: time (when the image was captured) and space (which tile/bounding box it covers). A typical analytical query — 'show me forest cover change in the Amazon between 2022 and 2024' — needs to cross hundreds of tiles over two years. With naive date-only partitioning, Athena scans time partitions without spatial filtering, generating S3 scan costs that can reach tens of dollars per query.
The correct solution is to use a table format with geospatial predicate pushdown — Apache Iceberg with GeoParquet extension, stored on S3, with the catalog managed via AWS Glue Data Catalog. GeoParquet encodes geometries as WKB binary columns with bounding box statistics per row group, allowing Athena (via Trino) to skip row groups that do not intersect the query polygon. In internal benchmarks I have run, this reduced scanned data volume by 60–80% for typical spatial window queries.
Financial-Grade Geospatial Pipeline on AWS
Full flow from satellite image ingestion to sustainable planning decisions, with lineage traceability and governance.
- S3 Raw Zone · COG / HDF5
- SQS FIFO · S3 Event Notifications
- Lambda · Validação & Tag
- Glue Spark Job · COG → GeoParquet/Iceberg
- S3 Curated Zone · Iceberg + GeoParquet
- Glue Data Catalog · Iceberg Metastore
- SageMaker Training · Segmentação / NDVI
- SageMaker Endpoint · Inferência em Tempo Real
- SageMaker Batch · Transform em Escala
- Athena · Query Espacial
- Lake Formation · RBAC + Column-level
- SageMaker Lineage · + OpenLineage
- API Gateway · Planning API
Geospatial Partitioning: The Decision That Defines Cost and Latency
After resolving the storage format, the second critical decision is the partitioning strategy in S3 and Iceberg. For geospatial data, I use a three-level hierarchy: year/month as the first-level partition (for temporal pruning), H3 grid resolution 4 as the second-level partition (covering approximately 1,770 km² per cell, generating ~5,000 cells for global coverage), and sensor as the third level.
H3 (Uber's hierarchical hexagonal grid library) has a critical property for geospatial data: cells at the same resolution level have approximately equal area, meaning partition file sizes are predictable — something rectangular bounding box partitioning does not guarantee. By indexing each scene's geometries with h3.polyfill() in the Glue Job, you can join tables from different sensors using the H3 index as a key, without expensive geometric intersection operations at query time.
An important operational detail: the Glue Job converting COG to GeoParquet should be configured with --conf spark.sql.shuffle.partitions=200 and G.2X workers (8 vCPU, 32 GB RAM) to process multispectral bands without disk spill. For Sentinel-2 10m resolution scenes (13 bands, ~800 MB per scene), processing a full day of global ingestion takes approximately 45 minutes with 20 G.2X workers — a cost of roughly USD 12 per execution. That number matters when you are planning a continuous operations budget.
The Iceberg table should be configured with write.target-file-size-bytes=268435456 (256 MB) and compaction scheduled via Glue Workflow every 6 hours to avoid the small files problem that degrades Athena performance.
Playbook: Building the Geospatial Pipeline on AWS
- 1
1. Define the Landing Zone with Lifecycle Policy
Create a dedicated S3 bucket for raw data with S3 Intelligent-Tiering enabled from day one. Configure Object Lock in COMPLIANCE mode for reference data (baseline images) that require immutability for carbon credit auditing. Enable S3 Event Notifications to SQS FIFO with deduplication ID based on the object ETag — this prevents duplicate reprocessing when the same file is re-uploaded by a data provider.
- 2
2. Build the COG → GeoParquet/Iceberg Glue Transformation Job
Use Glue 4.0 with native Iceberg support. Add
geoparquet,h3-py, andrasteriodependencies via Glue Python Shell or as an additional JAR. The job must: (a) read the COG with rasterio using windowed reads to avoid OOM, (b) compute the H3 index for each tile, (c) write to partitioned GeoParquet, (d) execute MERGE INTO on the Iceberg table to support idempotent re-ingestion using the composite key (scene_id, acquisition_date, sensor). - 3
3. Configure Fine-Grained Access Control with Lake Formation
Register the S3 location in Lake Formation and use Cell-Level Security to restrict access by geometry — for example, climate credit analysts for Brazil can only query H3 cells within the national territory polygon. This is implemented via Row Filter Expressions in Lake Formation with the condition
h3_index IN (SELECT h3_index FROM geo_reference WHERE country = 'BRA'). Combine with Column-Level Security to protect sensitive land ownership metadata. - 4
4. Train and Register Models with Full Traceability
Use SageMaker Experiments to track each training run with the Iceberg dataset version metadata (snapshot ID). Register the model in SageMaker Model Registry with manual approval for models that feed financial decisions (natural asset valuation, climate credit scoring). Configure SageMaker Lineage Tracking and export to OpenLineage/Marquez for integration with external data governance tools. The Iceberg snapshot ID as a training parameter is what allows exact reproduction of the dataset used in any audited model.
- 5
5. Expose Inference with Controlled Latency via API Gateway
For real-time planning queries (e.g., 'what is the deforestation risk in this polygon over the next 12 months?'), use SageMaker Real-Time Endpoint with ml.g4dn.xlarge instances (1 T4 GPU) and configure Autoscaling with a 70% GPU utilization target. Place REST API Gateway in front with Usage Plans and API Keys per tenant. Configure the API Gateway timeout to 29 seconds (maximum limit) and implement a circuit breaker in the integration Lambda to fall back to cached results in DynamoDB when the endpoint is under pressure.
- 6
6. Instrument Observability with OpenTelemetry
Instrument the full pipeline with OpenTelemetry: trace spans from SQS to the inference endpoint, ingestion throughput metrics (scenes/hour), Athena query latency by spatial query type, and model drift (prediction distribution vs. baseline). Send to CloudWatch with EMF (Embedded Metric Format) for custom metrics and configure alarms on P99 inference latency > 8 seconds and Glue Job error rate > 1%. These two SLOs are the minimum to operate with confidence.
GeoParquet + Iceberg: The Game-Changing Combo
If you can only make one architectural change today: migrate your geospatial data from raw GeoTIFF/Shapefile in S3 to GeoParquet on Iceberg tables with a Glue catalog. The migration cost is 1–2 engineering sprints. The return is a 60–80% reduction in Athena scan costs and elimination of geospatial joins at query time. This is not premature optimization — it is a prerequisite for any geospatial analysis at scale.
Data Governance for Financial Decisions: Lineage Is Not Optional
When Earth AI platform outputs feed financial decisions — carbon credit valuation, climate risk scoring for credit portfolios, parametric insurance pricing based on vegetation indices — lineage traceability stops being a best practice and becomes a regulatory requirement. In Brazil, BACEN already requires financial institutions to demonstrate the full methodology behind climate risk models (CMN Resolution 4.945/2021). In Europe, the EU AI Act classifies environmental risk scoring systems as 'high-risk', requiring training data documentation.
In practice, this means each model prediction must be traceable to: (1) the exact Iceberg dataset snapshot used in training, (2) the version of the Glue Job code that processed the raw data, (3) the model version in SageMaker Model Registry, and (4) the feature engineering repository commit. This is what I call quadruple traceability — and most implementations I see cover only (3).
The technical implementation uses SageMaker Lineage Tracking to capture the model → dataset → processing chain, with the Iceberg snapshot ID as the immutable anchor. For integration with external governance tools (Collibra, Alation, DataHub), export lineage events via EventBridge to a Lambda that converts them to OpenLineage format and sends them to the Marquez API. This pattern allows external auditors to query the full lineage without needing direct access to the AWS account.
A critical detail: Lake Formation must have audit logging enabled for all data access operations, with logs sent to a separate S3 bucket with Object Lock. In carbon credit audits I have participated in, the absence of data access audit logs was the primary blocker for certification.
Reference Numbers for Sizing
Defense in Depth: Location Data Is Sensitive Data
There is a dangerous tendency to treat geospatial data as 'just map data' — public, non-sensitive, requiring no rigorous controls. This is a serious mistake. High-resolution location data combined with satellite image time series can reveal: troop movements, industrial facility production capacity, crop conditions before public reports (material non-public information for insider trading purposes), and land occupation patterns with legal implications.
The security architecture I implement for these systems follows Zero Trust with four layers: network (VPC with private endpoints for S3, SageMaker, and Glue — no traffic leaving to the public internet), identity (IAM roles with aws:RequestedRegion and aws:SourceVpc conditions to ensure only services within the VPC access the data), data (dedicated KMS CMK per data classification — one CMK for raw data, another for processed data, another for model outputs — with key policies requiring kms:ViaService for access only via specific AWS services), and application (Lake Formation with Row/Column-level security).
A specific pattern I use to protect highly sensitive data: the processed data S3 bucket has a bucket policy with aws:PrincipalOrgPaths that restricts access only to specific OUs within the AWS Organization. This ensures that even if an IAM role is compromised in another account in the organization, it will not have access to the sensitive geospatial data.
For LGPD and GDPR compliance with data that may contain personal information (urban area imagery at sub-meter resolution), implement a face/license plate detection and anonymization step as part of the Glue transformation Job, before any data reaches the curated zone.
Anti-Patterns I Have Seen in the Field
- Storing raw GeoTIFF as the analytical source of truth: GeoTIFF has no predicate pushdown. Every query scans the entire file. For analysis, convert to GeoParquet/Iceberg. Keep the original GeoTIFF only for reproducibility and reprocessing.
- Using SageMaker Real-Time Endpoint for batch scoring of millions of polygons: The cost and latency of individual calls make this prohibitive. Use SageMaker Batch Transform with S3 as source/sink — processes 1M polygons in minutes at a fraction of the cost.
- Ignoring the small files problem after compaction: Iceberg with many incremental writes generates thousands of small files. Without scheduled compaction, Athena pays the overhead of listing and opening each file. Configure
OPTIMIZEvia Glue Workflow every 6 hours. - Training models without pinning the dataset snapshot ID: Without the Iceberg snapshot ID as a training parameter, you cannot reproduce the exact dataset of any model in production. This is unacceptable in regulated environments.
- Exposing geospatial inference endpoints without per-tenant rate limiting: Large bounding box queries can consume disproportionate resources. Implement Usage Plans in API Gateway with limits per tier (e.g., 100 req/min for basic tier, 1,000 req/min for enterprise).
- Assuming public satellite data (Sentinel, Landsat) requires no access control: The raw data may be public, but derived indices, trained models, and calculated risk scores are proprietary assets. Treat them as such from day one.
MLOps for Geospatial Models: Drift Is Not Just Statistical
Computer vision models for geospatial data have a type of drift that does not appear in standard statistical monitors: sensor drift. When a satellite provider updates radiometric calibration processing, or when you add a new constellation to the pipeline, the pixel value distribution changes even though the physical world has not. A model trained on Sentinel-2 L2A can silently degrade when it starts receiving data from a new sensor without retraining.
The solution is to implement drift monitoring at two levels: (1) feature drift using SageMaker Model Monitor with a baseline calculated separately by sensor and by season (summer vs. winter vegetation has completely different distributions), and (2) concept drift by monitoring the output prediction distribution against ground truth collected periodically via crowdsourcing or field validation.
For retraining, I use a continuous training with human approval pattern: a Step Functions workflow triggered when the drift score exceeds a threshold (e.g., PSI > 0.2 for any input feature), executes retraining with the last 90 days of Iceberg data, registers the new model in Model Registry with 'PendingApproval' status, and notifies the data science team via SNS. Manual approval is mandatory before production deployment — not for bureaucratic reasons, but because in financial decisions, a silently degrading model can cause systemic damage before detection.
A concrete capacity number: for land cover segmentation models (U-Net with ResNet-50 backbone, ~25M parameters), full retraining on 90 days of Sentinel-2 data for Brazil (~180K scenes) takes approximately 8 hours on an ml.p3.8xlarge instance (4 V100 GPUs) — a cost of ~USD 90. This is acceptable for monthly or drift-triggered retraining.
Frequently Asked Questions in Geospatial Platform Architecture
Should I use Amazon Location Service or build my own geospatial stack?
Amazon Location Service is excellent for geocoding, routing, and real-time asset tracking use cases. For satellite image analysis, spectral indices, and geospatial ML, you need the full stack (S3 + Glue + GeoParquet + SageMaker). The two are not competitors — use Location Service for the operational location layer and the analytical stack for image processing.
What is the estimated monthly cost for a national-scale Earth AI platform (Brazil)?
For national coverage of Brazil with Sentinel-2 (10m resolution, 5-day revisit): ~USD 800/month in S3 (2 years of processed data storage ~50 TB), ~USD 360/month in Glue Jobs (30 runs/month), ~USD 200/month in Athena (analytical queries), ~USD 400/month in SageMaker Endpoint (ml.g4dn.xlarge, 24/7). Total: ~USD 1,760/month before retraining and auxiliary services. Scales linearly with additional sensors.
How to handle clouds and missing data in satellite image time series?
Store the cloud mask (SCL band in Sentinel-2) as a separate column in GeoParquet. For time series analysis, use temporal interpolation with STAC (SpatioTemporal Asset Catalog) to identify the most recent cloud-free image for each pixel. ML models should be trained with synthetic cloud augmentation for robustness. For critical decisions, always include cloud cover percentage as a confidence metadata field in the inference output.
How to ensure the pipeline is resilient to satellite data ingestion failures?
Use SQS FIFO with a Dead Letter Queue (DLQ) configured for 3 attempts before moving to DLQ. Configure a monitoring Lambda that reads the DLQ every hour and sends alerts via SNS with the scene_id and error. For Glue Job failures, use the native retry mechanism (max 3 retries with exponential backoff) and configure EventBridge notification for persistent failures. Iceberg MERGE INTO guarantees idempotency — reprocessing an already-processed scene does not create duplicates.
AWS Well-Architected Lenses for Earth AI Platforms
Security
KMS CMK per data classification, Lake Formation with Cell-Level Security by geometry, private VPC endpoints for all data services, Object Lock in COMPLIANCE mode for auditable reference data, and full audit logging with 7-year retention for regulatory compliance.
Reliability
SQS FIFO with DLQ for resilient ingestion, Iceberg MERGE INTO for reprocessing idempotency, Multi-AZ for critical SageMaker Endpoints, and Step Functions with compensation for retraining workflows.
Performance efficiency
GeoParquet with H3 for geospatial predicate pushdown, Glue G.2X workers for multispectral band processing without spill, Iceberg compaction every 6 hours to avoid small files, and SageMaker Batch Transform for at-scale scoring.
What strikes me about the Google Research signal is not the model — it is the implication that sustainable planning decisions now depend on data pipelines that most organizations still do not know how to build. If I were starting this project today, I would invest the first two weeks exclusively in getting geospatial partitioning and data lineage right — not in the model. In my experience, the model is the easy part; the platform that feeds it with traceable, correctly partitioned, and governed data is where projects fail. The most expensive lesson I have learned: location data without access auditing is a regulatory liability waiting to be triggered — and in financial environments, that cost is always greater than the cost of doing it right from the start.
Verdict: Earth AI Is a Data Platform, Not an ML Project
The transition from 'pixels to planning' — from raw satellite imagery to sustainable operational decisions — is technically feasible on AWS today, at an accessible operational cost for mid-sized organizations. But success depends on architectural decisions that must be made before the first model is trained: storage format (GeoParquet/Iceberg, not raw GeoTIFF), partitioning strategy (hierarchical H3, not just by date), lineage governance (quadruple traceability), and defense in depth (KMS + Lake Formation + private VPC). For organizations operating in regulated environments — financial, insurance, carbon credit — traceability and audit logging are not optional. Start with the data platform. The model comes after.
Technical References
Architecture, AWS, AI and market deep dives — straight to your inbox. Free.
No spam · unsubscribe anytime
Ask Fernando about this
Get a focused answer about this article from my AI assistant, grounded in my work.
Join the conversation
Sign in to comment
Verify your email to join in — you'll also get the newsletter. No password.
Keep reading
Architecture intelligence, in your inbox
Curated signals and original analysis on AWS, AI, distributed systems and the market — the way a solutions architect reads them.
- Curated AWS · AI · architecture · market signals
- New architecture studies & deep-dives when they ship
- Sharp summaries — depth without the noise
- No spam · double opt-in · unsubscribe anytime