OIDC Session Metadata and Zero Trust: An Architecture Decision Record
Listen to article
Fernando's voiceFernando · 17:16
Powered by Amazon Polly + OmniVoice
Session metadata support in Sign in with Google opens a genuine window for continuous, signal-driven adaptive access — not just at login time. In this ADR, I analyze the architectural forces, options considered, and the decision I would make in a high-criticality financial system integrated with AWS.
For years, we have treated the authenticated session as a binary state: the user is logged in or not. Session metadata support in Sign in with Google changes that equation by exposing continuous identity signals — account risk, verification state, recent credential changes — directly in the OIDC flow. For financial systems operating under Zero Trust, this is not a convenience feature; it is a threat model shift. The real architectural question is not 'how do I wire this into Cognito', but rather 'where in my authorization chain should these signals be evaluated, at what latency, and what happens when the identity provider does not deliver them?'
Context and Forces: Why This Matters Now
In regulated financial environments — PCI-DSS, SOC 2 Type II, Brazil's BACEN 4.658 — federated authentication with external providers like Google has always been a double-edged sword. You gain UX and reduce credential management surface, but you lose visibility into the user's session lifecycle on the IdP side. A valid JWT issued at 09:00 is still accepted at 17:00 even if the user changed their password at 14:00, Google detected suspicious account access at 15:00, or the device was flagged as compromised at 16:00.
The traditional mitigation model was simple and blunt: short-lived tokens (15-30 minutes) with aggressive refresh. This works, but at a cost: re-authentication latency, friction in long-running flows (reports, exports, user-initiated batch operations), and pressure on Cognito rate limits — which enforces 120 token requests per second per user pool by default, a limit medium-scale systems hit easily at peak hours.
OIDC session metadata shifts the defense vector. Instead of shortening token lifetime, you enrich the authorization decision with real-time IdP signals. The token can be longer-lived, but each sensitive request passes through an evaluation that considers the current state of the session at Google — not just the state at issuance time. This is, conceptually, what NIST SP 800-207 calls 'continuous verification' within a Zero Trust model.
The Forces in Tension
Before reaching the options, it is necessary to name the forces that make this decision non-trivial:
Latency vs. Signal Freshness. Querying Google session metadata on every request adds an external network call to the critical path. In financial APIs with a p99 SLO < 300ms, this is unacceptable without caching. But caching introduces a staleness window — exactly the problem we are trying to solve.
IdP Availability vs. Service Continuity. If Google's metadata endpoint is unavailable, what do we do? Fail-open (accept the request) or fail-closed (deny)? In financial systems, fail-closed is the correct regulatory default, but this means a Google degradation can take down your service.
Policy Granularity vs. Operational Complexity. The more granular the adaptive access policy — 'this endpoint requires risk score < 20 AND recent device verification' — the harder it is to debug, audit, and explain to compliance teams.
Claims Coverage vs. Vendor Dependency. The metadata Google exposes is proprietary. If tomorrow you need to support Microsoft Entra or Okta as an alternative IdP, your policy logic needs to abstract over different claims schemas. This is a lock-in risk that few architecture teams model explicitly.
These four tensions define the design space. The options below are different responses to the same set of forces.
Options Considered for Session Metadata Evaluation
Option A: Short-Lived Token (Status Quo)
- No external endpoint dependency on the critical path
- Simple to implement and audit
- Predictable behavior on IdP failure
- Risk window between IdP revocation and token expiration
- Pressure on Cognito rate limits at scale
- Friction in long-running flows
Adequate for low-risk systems; insufficient for critical financial operations
Option B: Inline Evaluation in Lambda Authorizer
- Freshest possible signal per request
- Centralized and auditable policy in the authorizer
- Natural integration with API Gateway and ALB
- Added latency: Google call p99 ~80-150ms without cache
- Service availability coupled to Google
- Lambda cold start amplifies latency at peaks
Viable with aggressive caching and circuit breaker; requires revised latency SLO
Option C: Async Evaluation with Reactive Step-Up
- Zero latency added to the critical path
- Graceful degradation when Google is unavailable
- Allows differentiated policies by operation sensitivity level
- Risk window between async detection and session revocation
- Higher complexity: EventBridge + Step Functions + DynamoDB
- Requires step-up logic on the client (redirect to re-auth)
Best balance for high-scale financial systems; this is the option I would choose
Option D: Abstraction via IDSA/SPIFFE with Normalized Claims
- Eliminates lock-in on Google's proprietary schema
- Portability across IdPs (Entra, Okta, Google)
- Aligns with workload identity standards (SPIFFE/SPIRE)
- Significant implementation overhead; requires claims translation layer
- High operational maturity required from the team
- Out of scope for most projects in 2025
Ideal as long-term vision; premature as first implementation
The Decision: Async Evaluation with Selective Synchronous Step-Up
After modeling the four options against the identified forces, the decision I would make is a variant of Option C with a surgical synchronous element: asynchronous session metadata evaluation for most requests, with mandatory synchronous step-up for high-financial-impact operations.
The logic is as follows: not all requests have the same risk profile. A balance inquiry or statement read has tolerance for a 5-minute staleness window on the risk signal. A transfer above R$ 10,000, a personal data change, or a credit approval does not. Forcing the same evaluation model on both is over-engineering in the first case and under-engineering in the second.
The concrete implementation uses three AWS components in coordination:
- DynamoDB as session state cache: partition key
userId#sessionId, TTL of 5 minutes, with attributesriskScore,verificationState,lastMetadataRefresh, andstepUpRequired. Provisioned capacity with auto-scaling: 100 RCU/WCU base, scaling to 2,000 at peak. KMS CMK encryption, end-to-end.
- Lambda Authorizer with synchronous DynamoDB read: the authorizer reads the cached session state (p99 latency < 5ms with optional DAX) and makes the authorization decision locally. There is no Google call on the critical path. If
stepUpRequired = true, it returns 401 withWWW-Authenticate: step-upand the client initiates the re-authentication flow.
- EventBridge Scheduler + async refresh Lambda: every 4 minutes, a job queries Google's metadata endpoint for active sessions and updates DynamoDB. If the risk signal changes, the
stepUpRequiredattribute is set atomically with a DynamoDB condition (ConditionExpression: attribute_not_exists(stepUpRequired) OR stepUpRequired = :false), avoiding race conditions.
OIDC Session Metadata Evaluation Flow with Adaptive Step-Up
Critical path (synchronous) on the left; async refresh pipeline on the right. The Lambda Authorizer never calls Google directly — it reads only from the DynamoDB cache.
- API Gateway · REST / HTTP API
- Lambda Authorizer · step-up policy eval
- DynamoDB · userId#sessionId / TTL 5m
- KMS CMK · encrypt at rest
- EventBridge Scheduler · every 4 min
- Refresh Lambda · fetch + diff + write
- Step Functions · step-up orchestration
- CloudWatch · stepUpRate / riskSignalAge
- Sign in with Google · OIDC + session metadata
- Financial API · Lambda / EKS
Implementation Details That Matter in Production
IAM and Least Privilege in the Lambda Authorizer. The authorizer needs only dynamodb:GetItem on the session table, with condition aws:ResourceTag/Classification = SessionCache. No write permissions. The refresh Lambda role has dynamodb:UpdateItem with dynamodb:LeadingKeys condition restricted to the userId# prefix — preventing a code bug from writing to arbitrary partitions. Both roles use kms:Decrypt with condition kms:EncryptionContext:TableName = session-state.
Circuit Breaker in the Refresh Lambda. The call to Google's metadata endpoint must have a 2-second timeout and at most 2 retries with exponential backoff. If the error rate exceeds 50% in a 60-second window, the circuit breaker opens and the refresh Lambda stops trying — but does not automatically set stepUpRequired = true. The correct policy here is to preserve the last known state and emit a CloudWatch alarm (MetricName: GoogleMetadataUnavailable, threshold > 3 minutes). The fail-open vs. fail-closed decision during Google unavailability must be explicit in the policy and documented in the ADR — not implicit in the code.
Risk Signal Observability. Google's metadata includes fields like credential_age, account_risk_level, and device_verified. These values should be emitted as custom metric dimensions in CloudWatch — not just logged. This allows creating alarms like 'percentage of active sessions with account_risk_level = HIGH > 5%', which is a signal of an ongoing credential stuffing attack. The SLO I would define: riskSignalAge (time since last successful refresh) < 6 minutes for 99% of active sessions.
Cognito User Pools and the Opaque Token Problem. Cognito does not automatically forward custom claims from Google to the access token. You need a Pre Token Generation Lambda trigger that reads the session state from DynamoDB and injects the relevant claims. Note: this trigger adds latency to the token refresh flow — keep it below 100ms or Cognito will timeout the call.
Consequences and Risks of the Decision
Residual risk window of up to 4 minutes. The async architecture explicitly accepts that a user with a compromised session can make up to ~4 minutes of low-impact requests before being blocked. For high-impact operations, the synchronous step-up eliminates this window — but the compliance team needs to formally accept and document this trade-off. Do not try to hide this in the design.
Dependency on Google availability for the refresh pipeline. If Google has a 30-minute degradation, riskSignalAge will exceed the SLO. You need an explicit runbook: what the system does automatically (preserves last known state), what the operator does manually (can force global step-up via feature flag in DynamoDB), and when to escalate to the security team.
DynamoDB cost at scale. With 100,000 active sessions and refresh every 4 minutes, the async pipeline generates ~25,000 writes/minute. In on-demand mode, that is ~$0.0000125 per write = ~$0.31/minute = ~$450/month just for the refresh pipeline. Not expensive, but it needs to be in the cost model — especially if the session count scales 10x.
Step-up amplification risk. If Google returns account_risk_level = HIGH for a large segment of users simultaneously (mass false positive), you may trigger step-up for thousands of users at the same time, generating an avalanche of re-authentications that overwhelms Cognito. Implement rate limiting on the number of simultaneous step-ups and an emergency override mechanism.
Governance, Audit, and the Path to Mature Zero Trust
An identity architecture decision does not end at the technical implementation. In regulated financial systems, the traceability of each authorization decision is as important as the decision itself. Each Lambda Authorizer invocation should emit a structured event to CloudWatch Logs with: userId, sessionId, decision (allow/deny/step-up), riskScore at decision time, signalAge in seconds, and the API Gateway requestId. This log is the audit evidence to demonstrate to the regulator that the system was operating with continuous verification.
In the AWS Well-Architected context, this architecture directly touches the Security pillar — specifically the principles of 'apply security at all layers' and 'enable traceability'. The use of KMS CMK with annual rotation, granular IAM conditions, and CloudTrail enabled for all DynamoDB and KMS operations closes the audit loop.
The path to mature Zero Trust from here has two natural steps. The first is adding device context signals — not just what Google knows about the account, but what you know about the device: fingerprint, geolocation, behavioral pattern. This can be implemented via Amazon Fraud Detector or a custom SageMaker model fed by the audit logs. The second is moving the authorization policy to an external engine — Open Policy Agent (OPA) running on EKS, or AWS Verified Permissions with Cedar — decoupling policy logic from the authorizer code and allowing the compliance team to edit policies without code deployments.
This evolution should be on the roadmap, but not in the MVP. The classic trap is trying to build the complete Zero Trust system in the first iteration and delivering nothing.
Well-Architected Pillars Assessment
Security
Continuous session verification via OIDC metadata, KMS CMK, IAM least-privilege with granular conditions, CloudTrail for complete audit trail. Synchronous step-up for high-impact operations eliminates the risk window on critical operations.
Reliability
Circuit breaker in the refresh Lambda protects against Google unavailability. Session state cached in DynamoDB ensures the authorizer functions even without IdP connectivity. Explicit runbook for signal degradation.
Performance efficiency
Lambda Authorizer reads only from DynamoDB (p99 < 5ms with DAX). No Google call on the critical path. 5-minute cache TTL balances freshness and latency.
Anti-Patterns to Avoid
- Calling Google's metadata endpoint directly in the Lambda Authorizer without caching — you will violate your latency SLO and couple availability to Google.
- Treating
account_risk_levelas binary (high/low) without considering the historical distribution — a fixed threshold will generate false positives on legitimate Google security events (e.g., voluntary password change). - Storing the full OIDC token in DynamoDB as session state — store only the claims relevant to the policy, never the raw token.
- Not explicitly documenting the residual risk window in the ADR — the compliance team will discover this in an audit and it will be worse.
- Using the same Lambda Authorizer for all endpoints without differentiating sensitivity — apply synchronous step-up only where the risk justifies the additional latency.
In practice, the hardest decision here is not technical — it is political: convincing the compliance team to formally accept a 4-minute risk window in exchange for a system that is auditable, observable, and does not break when Google has a degradation. I have seen teams build 'perfect' synchronous evaluation systems that went down for 45 minutes during a Google incident and nobody had documented the fallback. The circuit breaker and the explicit runbook are not implementation details — they are the core of the architectural decision. And the ADR needs to live in the repository, not in a wiki that nobody reads after go-live.
Verdict: Adopt OIDC Session Metadata with Async Evaluation and Selective Step-Up
Session metadata support in Sign in with Google is a real advancement for systems that need continuous identity verification. The correct decision for financial environments is not synchronous evaluation on every request — it is asynchronous evaluation with DynamoDB cache, explicit circuit breaker, and surgical synchronous step-up for high-impact operations. This architecture delivers practical Zero Trust: continuous verification without sacrificing latency, with graceful degradation and complete auditability. The 4-minute residual risk window is an acceptable and documentable trade-off — not a design flaw. Implement it, document it in the ADR, and evolve to AWS Verified Permissions with Cedar when team maturity allows.
References
Architecture, AWS, AI and market deep dives — straight to your inbox. Free.
No spam · unsubscribe anytime
Ask Fernando about this
Get a focused answer about this article from my AI assistant, grounded in my work.
Join the conversation
Sign in to comment
Verify your email to join in — you'll also get the newsletter. No password.
Keep reading
Architecture intelligence, in your inbox
Curated signals and original analysis on AWS, AI, distributed systems and the market — the way a solutions architect reads them.
- Curated AWS · AI · architecture · market signals
- New architecture studies & deep-dives when they ship
- Sharp summaries — depth without the noise
- No spam · double opt-in · unsubscribe anytime