Document Automation with Bedrock: A Modernization Journey
Listen to article
Fernando's voiceFernando · 22:31
Powered by Amazon Polly + OmniVoice
Legacy document extraction pipelines in financial environments accumulate silent technical debt: brittle OCR, manual rules, and absent traceability. In this article, I narrate the modernization journey to Bedrock Data Automation, covering architecture decisions, managed risks, and what genuinely changes in operations. The analysis is grounded in real patterns from critical financial systems, not lab demos.
Every financial institution has a drawer full of PDFs nobody wants to touch. Credit contracts, income statements, technical reports, powers of attorney — documents arriving in inconsistent formats, passing through expensive manual review, and still feeding downstream systems with wrong data. When I evaluated migrating a legacy document extraction pipeline to Bedrock Data Automation, the question was not 'can the AI read this?'. The real question was: 'how do we guarantee traceability, confidence control, and regulatory compliance when a language model makes decisions about critical financial data?' This article documents that journey — the decisions, the risks, and what held up after go-live.
The Starting Point: Invisible Technical Debt
The legacy system was a three-layer composition that grew organically over eight years. At the base, an on-premises OCR engine (Tesseract with custom post-processing) running on EC2 c5.2xlarge instances with manual autoscaling. In the middle, a Python rule set — over 4,200 lines of regular expressions and conditional logic — attempting to normalize fields like tax IDs, dates in multiple formats, and monetary values with regional separators. At the top, an SQS queue feeding an RDS PostgreSQL database, with a human review layer triggered by a fixed confidence threshold of 70%.
The problem was not that the system did not work. It was that it failed in ways nobody could measure. Extraction accuracy for out-of-pattern documents dropped to 34% without any alarm firing, because the system had no extraction quality observability — only throughput metrics. The operational cost of human review consumed 61% of total pipeline cost, but that number was buried in an 'operations' cost center line, not attributed to the system.
When we ran a full diagnostic, we found three structural failures: no per-document traceability (no way to know which rule version extracted which field on which date), tight coupling between extraction schema and rule code (any new document type required an engineering sprint), and a complete absence of audit trail for regulatory purposes. For an environment under Central Bank supervision, this was classifiable operational risk.
The Modernization Journey: Six Sequential Decisions
- 1
Phase 0 — Document Inventory and Classification
Before touching any AWS service, we catalogued the 23 document types in production by volume, regulatory criticality, and extraction complexity. We used a 3x3 matrix (volume × complexity × risk) to prioritize which types would migrate first. High-volume, low-complexity documents (standardized pay stubs) were the pilot. High-complexity, high-risk documents (contracts with variable clauses) were deferred to the final phase, after validating the confidence model.
- 2
Phase 1 — Quality Baseline with Shadow Mode
We implemented Bedrock Data Automation in shadow mode: the legacy pipeline remained the source of truth, but each processed document was simultaneously sent to Bedrock via S3 event notification → Lambda → Bedrock Data Automation API. Results were compared field by field and stored in S3 with divergence metadata. This gave us four weeks of real data to calibrate confidence thresholds per document type, with zero operational risk. The critical finding in this phase: Bedrock performed 23% better on low-quality scanned documents, exactly where Tesseract failed most.
- 3
Phase 2 — Human-in-the-Loop Architecture with Amazon A2I
The legacy system's fixed 70% threshold was arbitrary and did not differentiate by field or document type. We replaced it with a three-tier confidence logic via Step Functions: (1) fields with confidence ≥ 0.92 pass through directly; (2) fields between 0.72 and 0.91 go to A2I review with field context and document excerpt; (3) fields below 0.72 or documents with more than 3 fields in review are escalated to a senior analyst with a 4-hour SLA. A2I was configured with custom task templates showing the document excerpt alongside the extracted value, reducing average review time from 4.2 minutes to 1.8 minutes per field.
- 4
Phase 3 — Immutable Audit Trail and Traceability
For regulatory compliance, every pipeline event is written to an S3 bucket with Object Lock in COMPLIANCE mode (7-year retention, per CMN Resolution 4.658). The event schema includes: document_id, pipeline_version, model_id (Bedrock model ARN), extraction_timestamp, field_name, raw_value, confidence_score, review_action (if applicable), reviewer_id (anonymized hash), final_value, and downstream_system_id. The KMS key policy restricts decrypt to specific audit roles, with condition aws:PrincipalTag/role: auditor. CloudTrail with data events enabled on the bucket guarantees a log of every object access, including denied attempts.
- 5
Phase 4 — Gradual Migration with Feature Flags
Traffic cutover was controlled by feature flags stored in AWS AppConfig, with rollout by document type and by volume percentage. We started with 5% of pay stub volume, monitoring extraction divergence via CloudWatch custom metrics (namespace DocumentAutomation/ExtractionQuality). The advancement criterion was: divergence < 2% for 72 consecutive hours for the document type in question. This allowed us to identify and fix an edge case in pay stubs from companies with parent/subsidiary tax IDs before reaching 100% volume, with no production impact.
- 6
Phase 5 — Decommissioning and Steady-State Observability
Legacy pipeline decommissioning was done document type by document type, not as a big bang. OCR EC2 instances were kept on standby for 30 days after each type migrated, with automatic reactivation alarm if Bedrock error rate exceeded 5% for 15 minutes. The steady-state dashboard includes: auto-approval rate by document type (SLO: ≥ 85%), full pipeline p99 latency (SLO: ≤ 8s for documents ≤ 10 pages), cost per processed document (alert if > $0.04), and human review backlog (alert if > 200 items pending for > 30 minutes).
Document Automation Pipeline with Bedrock — Target Architecture
Full flow from document ingestion to downstream system, with confidence branches, human review, and immutable audit trail.
- S3 Bucket · Documentos Brutos
- Lambda · Event Router
- Bedrock · Data Automation
- AppConfig · Feature Flags
- Step Functions · Confidence Router
- Amazon A2I · Revisão Humana
- S3 + Object Lock · Audit Trail 7 anos
- KMS · Chave de Auditoria
- CloudTrail · Data Events
- Sistema · Downstream
- CloudWatch · SLO Dashboard
Bedrock Data Automation: What Actually Changes in Configuration
Bedrock Data Automation is not an OCR wrapper with an LLM on top. The important operational distinction is that it operates on an extraction blueprint — a JSON schema defining expected fields, types, validations, and extraction instructions in natural language. This fundamentally changes the maintenance model: instead of debugging Python regex, you iterate on the blueprint and version it in S3.
In practice, we configured separate blueprints per document family, not per exact type. A 'income documents' blueprint covers pay stubs, tax returns, and bank statements with conditional instructions. This reduced the number of blueprints from 23 to 7, with equivalent coverage. Each blueprint is versioned with an immutable ARN — when we update, we create a new version and the pipeline continues using the previous version until the new one passes shadow mode validation.
The most relevant technical attention point is handling multi-page documents with distributed fields. Bedrock Data Automation processes the document as a unit, but fields like 'total value' in a 40-page contract may be on the last page while 'contracting parties' are on the first. We configured page_range_hints in the blueprint for document types where we know the field distribution, which reduced average contract processing latency from 11.2s to 6.8s without accuracy loss — the model does not need to 'search' for the field across the entire document.
For documents with complex tables (financial statements, for example), Bedrock Data Automation's structured output includes bounding box coordinates per cell. We store these coordinates in the audit trail, allowing a human auditor to see exactly where in the physical document each value was extracted from — something impossible in the legacy system.
Before and After: Operational Indicators
Step Functions as Orchestration Backbone: Design Decisions
The choice of Step Functions Express Workflows for orchestration was deliberate and not obvious. Express Workflows have a maximum duration of 5 minutes and do not persist state between executions — this seemed like a problem for documents entering human review, which can take hours. The solution was to split into two workflows: an Express Workflow for the happy path (extraction + validation + delivery, p99 at 12s), and a Standard Workflow for the human review path, which can last up to 24h with native wait state for the A2I callback.
The callback pattern is implemented with sendTaskSuccess / sendTaskFailure via the A2I API: when the reviewer completes the task in the A2I interface, a Lambda is triggered that calls sfn:SendTaskSuccess with the task token stored in DynamoDB. This eliminates polling and keeps Standard Workflow cost low — you pay per state transition, not per wait time.
An idempotency detail that cost a sprint to get right: the initial routing Lambda can be invoked more than once for the same document (S3 event delivery guarantees at-least-once). We implemented deduplication via DynamoDB with a 24h TTL: before invoking Bedrock, the Lambda checks whether document_id + s3_etag already exists in the table. If so, it returns the cached result. The s3_etag is critical here — document_id alone is not sufficient, because the same document can be resubmitted with corrections.
For observability, each Step Functions execution emits events to EventBridge, which feeds a Kinesis Data Firehose → S3 for historical analysis and a Lambda that publishes custom metrics to CloudWatch. X-Ray is enabled on all Lambdas and Step Functions, allowing latency tracing for each step individually — we identified that 34% of total latency was in the routing Lambda cold start, resolved with Provisioned Concurrency of 5 instances during peak hours.
Real Risks That Almost Broke the Migration
1. Model drift without notification. Bedrock Data Automation can update the underlying model without explicit notice if you do not pin the model ID with a version. In a regulatory environment, this is unacceptable — a silent model change can alter extraction behavior and invalidate audit trail traceability. Always use versioned model ARNs and configure a CloudWatch alarm to detect model_id changes in audit logs.
2. Bedrock throughput limits. Bedrock Data Automation has TPS quotas per account and per region. During batch processing peaks (end of month, for example), we hit the 10 TPS limit in us-east-1 and needed to implement exponential backoff with jitter in the invocation Lambda. Request quota increases in advance — the process takes 3 to 10 business days and approval is not guaranteed.
3. A2I cost at unexpected volume. If the confidence threshold is calibrated too conservatively, human review volume explodes. In a test with threshold ≥ 0.95, 43% of documents went to review — operationally unviable. A2I cost is per review task, not per document, so multiple fields in review on the same document multiply the cost. Monitor the review/auto-approval ratio daily in the first weeks.
4. Object Lock and error recovery. With S3 Object Lock in COMPLIANCE mode, you cannot delete or overwrite audit events — not even as root. If an incorrect event is written due to a bug, it stays there for the retention period. Implement rigorous event schema validation before writing to the audit bucket, with a Dead Letter Queue for malformed events.
AI Governance in a Regulatory Environment: Beyond Checkbox Compliance
The hardest question in this migration was not technical — it was governance. When an AI model extracts an income value that will feed a credit decision, who is responsible for the error? The regulatory answer requires the institution to demonstrate it has sufficient controls to detect, correct, and trace errors, regardless of origin (human or algorithmic).
We implemented three governance layers that go beyond what most reference architectures suggest. First, versioned model card: for each blueprint version and Bedrock model in use, we maintain a structured document in S3 with: production entry date, measured accuracy rate by document type, known limitations, and risk committee approval. This document is referenced in the audit trail of each extraction.
Second, confidence distribution monitoring: beyond auto-approval rate SLOs, we monitor the statistical distribution of confidence scores by document type week over week. A shift in distribution (even without SLO violation) is an early signal of model drift or a change in the pattern of incoming documents. We implemented this with CloudWatch Metric Math calculating the 25th percentile of the confidence score — if it drops more than 8 percentage points in 7 days, it automatically opens an investigation ticket.
Third, operationalized right to explanation: for each credit decision that used data extracted by the pipeline, the downstream system can query the audit API (Lambda + API Gateway with IAM authorizer) and receive the full audit trail for that document, including excerpts from the original document with bounding boxes highlighting the extracted fields. This is not just compliance — it is the ability to respond to a customer dispute in minutes, not days.
Total Cost of Ownership: The Real Math
Migrations to managed AI services frequently underestimate real cost because they compare legacy compute cost with the new service's API cost, ignoring adjacent costs. I will be specific about what we measured.
In the legacy system, monthly cost for 180,000 processed documents was: EC2 (c5.2xlarge × 4 instances, 24/7): $1,104; RDS PostgreSQL (db.r5.large Multi-AZ): $420; human review cost (analysts, 61% of time in review): $8,200; rule maintenance (0.3 engineering FTE): $2,100. Total: ~$11,824/month.
In the new system, for the same volume: Bedrock Data Automation (estimate based on public pricing, ~$0.015/page, average 3 pages/document): $8,100; Lambda + Step Functions + A2I: $340; S3 (including audit bucket with Object Lock): $180; human review cost (13% of volume, reduced time): $1,640; blueprint maintenance (0.05 FTE): $350. Total: ~$10,610/month.
The direct cost reduction is modest (~10%). The real gain is in three places that do not appear in the bill: (1) elimination of regulatory risk from absent audit trail — a Central Bank fine for lack of traceability can cost orders of magnitude more; (2) speed of onboarding new document types — from a 3-week sprint to 2 days of blueprint iteration; (3) scalability without fixed cost — the new pipeline has no idle EC2 instances on weekends, representing an additional $280/month savings during low-volume periods.
The attention point: if volume grows to 500,000 documents/month, Bedrock Data Automation cost grows linearly while EC2 cost would grow in steps. Above ~350,000 documents/month, it is worth re-evaluating whether a proprietary fine-tuned model or a hybrid solution (Bedrock for complex cases, lightweight model for simple cases) would be more cost-effective.
Well-Architected Pillars Assessment
Security
KMS with restrictive key policy (PrincipalTag condition), S3 Object Lock COMPLIANCE for audit trail, CloudTrail data events, IAM with least privilege per function (extraction, review, audit separated), VPC endpoints for Bedrock and S3 eliminating public internet traffic.
Reliability
Dead Letter Queue on all Lambdas, retry with exponential backoff and jitter for Bedrock calls, deduplication via DynamoDB with TTL, automatic fallback to legacy pipeline via feature flag if Bedrock error rate > 5% for 15 minutes.
Performance efficiency
page_range_hints for latency reduction on multi-page documents, Provisioned Concurrency on routing Lambda, Express Workflows for happy path (p99 12s), Standard Workflow isolation only for human review path.
Cost optimization
No idle EC2 instances, cost per processed document monitored with alert at $0.04, hybrid model re-evaluation planned for volume > 350k documents/month, S3 Intelligent-Tiering on raw documents bucket after 30 days.
If I could redo this migration with what I know today, I would have invested more time in the document inventory phase before touching any service — classification by regulatory risk, not by volume, should have been the primary prioritization criterion. The most expensive mistake I have seen in similar projects is treating the confidence threshold as a technical parameter when it is, in practice, a business risk decision that needs risk committee approval, not the engineering team's. The lesson I carry: in regulated financial environments, AI architecture does not start with the services diagram — it starts with the risk matrix and the accountability model. Everything else is implementation.
Verdict: The Migration Is Worth It, But Not the Way Most Teams Do It
Bedrock Data Automation is a real paradigm shift for document extraction pipelines in financial environments — not for accuracy gain in isolation, but for the combination of accuracy + traceability + maintainability that the blueprint model offers. The migration is worth it when the cost of maintaining manual rules and the regulatory risk of absent audit trail are honestly accounted for. What I do not recommend is big bang migration, threshold calibration without shadow mode, and delegating the confidence decision to the engineering team without risk involvement. Do the document inventory first, validate in shadow mode for at least four weeks, pin model ARNs with explicit versions, and treat the audit trail as a non-negotiable requirement from day zero — not as a compliance add-on.
References
Architecture, AWS, AI and market deep dives — straight to your inbox. Free.
No spam · unsubscribe anytime
Ask Fernando about this
Get a focused answer about this article from my AI assistant, grounded in my work.
Join the conversation
Sign in to comment
Verify your email to join in — you'll also get the newsletter. No password.
Keep reading
Architecture intelligence, in your inbox
Curated signals and original analysis on AWS, AI, distributed systems and the market — the way a solutions architect reads them.
- Curated AWS · AI · architecture · market signals
- New architecture studies & deep-dives when they ship
- Sharp summaries — depth without the noise
- No spam · double opt-in · unsubscribe anytime