Who is Fernando F. Azevedo?

Fernando F. Azevedo is a Senior Solutions Architect at Banco Itaú with 16+ years of experience across AWS, event-driven architecture, DevSecOps, Data Mesh, AI and financial systems.

What technical topics does Fernando work with?

Fernando works with AWS, Kubernetes, Kafka, Data Mesh, Amazon Bedrock, RAG, DevSecOps, observability, financial systems and architecture communication using C4, ADRs and trade-off analysis.

Is Fernando available for professional conversations?

Fernando is currently building at Banco Itaú and is open to thoughtful conversations about architecture, cloud, AI, engineering leadership, community, podcasts and technical collaboration.

Security & ResilienceDecision Record

OIDC Session Metadata and Zero Trust: An Architecture Decision Record

Jun 12, 2026 9 minexpert AI-assisted

Listen to article

Fernando's voice

Fernando · 17:16

Download MP3

0:0017:16

Speed

The MP3 is saved to S3 after the first play.

Security & ResilienceDecision Record

fernando.moretes.com

Session metadata support in Sign in with Google opens a genuine window for continuous, signal-driven adaptive access — not just at login time. In this ADR, I analyze the architectural forces, options considered, and the decision I would make in a high-criticality financial system integrated with AWS.

For years, we have treated the authenticated session as a binary state: the user is logged in or not. Session metadata support in Sign in with Google changes that equation by exposing continuous identity signals — account risk, verification state, recent credential changes — directly in the OIDC flow. For financial systems operating under Zero Trust, this is not a convenience feature; it is a threat model shift. The real architectural question is not 'how do I wire this into Cognito', but rather 'where in my authorization chain should these signals be evaluated, at what latency, and what happens when the identity provider does not deliver them?'

Context and Forces: Why This Matters Now

In regulated financial environments — PCI-DSS, SOC 2 Type II, Brazil's BACEN 4.658 — federated authentication with external providers like Google has always been a double-edged sword. You gain UX and reduce credential management surface, but you lose visibility into the user's session lifecycle on the IdP side. A valid JWT issued at 09:00 is still accepted at 17:00 even if the user changed their password at 14:00, Google detected suspicious account access at 15:00, or the device was flagged as compromised at 16:00.

The traditional mitigation model was simple and blunt: short-lived tokens (15-30 minutes) with aggressive refresh. This works, but at a cost: re-authentication latency, friction in long-running flows (reports, exports, user-initiated batch operations), and pressure on Cognito rate limits — which enforces 120 token requests per second per user pool by default, a limit medium-scale systems hit easily at peak hours.

OIDC session metadata shifts the defense vector. Instead of shortening token lifetime, you enrich the authorization decision with real-time IdP signals. The token can be longer-lived, but each sensitive request passes through an evaluation that considers the current state of the session at Google — not just the state at issuance time. This is, conceptually, what NIST SP 800-207 calls 'continuous verification' within a Zero Trust model.

The Forces in Tension

Before reaching the options, it is necessary to name the forces that make this decision non-trivial:

Latency vs. Signal Freshness. Querying Google session metadata on every request adds an external network call to the critical path. In financial APIs with a p99 SLO < 300ms, this is unacceptable without caching. But caching introduces a staleness window — exactly the problem we are trying to solve.

IdP Availability vs. Service Continuity. If Google's metadata endpoint is unavailable, what do we do? Fail-open (accept the request) or fail-closed (deny)? In financial systems, fail-closed is the correct regulatory default, but this means a Google degradation can take down your service.

Policy Granularity vs. Operational Complexity. The more granular the adaptive access policy — 'this endpoint requires risk score < 20 AND recent device verification' — the harder it is to debug, audit, and explain to compliance teams.

Claims Coverage vs. Vendor Dependency. The metadata Google exposes is proprietary. If tomorrow you need to support Microsoft Entra or Okta as an alternative IdP, your policy logic needs to abstract over different claims schemas. This is a lock-in risk that few architecture teams model explicitly.

These four tensions define the design space. The options below are different responses to the same set of forces.

Options Considered for Session Metadata Evaluation

Option A: Short-Lived Token (Status Quo)

Pros

No external endpoint dependency on the critical path
Simple to implement and audit
Predictable behavior on IdP failure

Cons

Risk window between IdP revocation and token expiration
Pressure on Cognito rate limits at scale
Friction in long-running flows

Adequate for low-risk systems; insufficient for critical financial operations

Option B: Inline Evaluation in Lambda Authorizer

Pros

Freshest possible signal per request
Centralized and auditable policy in the authorizer
Natural integration with API Gateway and ALB

Cons

Added latency: Google call p99 ~80-150ms without cache
Service availability coupled to Google
Lambda cold start amplifies latency at peaks

Viable with aggressive caching and circuit breaker; requires revised latency SLO

Option C: Async Evaluation with Reactive Step-Up

Pros

Zero latency added to the critical path
Graceful degradation when Google is unavailable
Allows differentiated policies by operation sensitivity level

Cons

Risk window between async detection and session revocation
Higher complexity: EventBridge + Step Functions + DynamoDB
Requires step-up logic on the client (redirect to re-auth)

Best balance for high-scale financial systems; this is the option I would choose

Option D: Abstraction via IDSA/SPIFFE with Normalized Claims

Pros

Eliminates lock-in on Google's proprietary schema
Portability across IdPs (Entra, Okta, Google)
Aligns with workload identity standards (SPIFFE/SPIRE)

Cons

Significant implementation overhead; requires claims translation layer
High operational maturity required from the team
Out of scope for most projects in 2025

Ideal as long-term vision; premature as first implementation

The Decision: Async Evaluation with Selective Synchronous Step-Up

After modeling the four options against the identified forces, the decision I would make is a variant of Option C with a surgical synchronous element: asynchronous session metadata evaluation for most requests, with mandatory synchronous step-up for high-financial-impact operations.

The logic is as follows: not all requests have the same risk profile. A balance inquiry or statement read has tolerance for a 5-minute staleness window on the risk signal. A transfer above R$ 10,000, a personal data change, or a credit approval does not. Forcing the same evaluation model on both is over-engineering in the first case and under-engineering in the second.

The concrete implementation uses three AWS components in coordination:

DynamoDB as session state cache: partition key userId#sessionId, TTL of 5 minutes, with attributes riskScore, verificationState, lastMetadataRefresh, and stepUpRequired. Provisioned capacity with auto-scaling: 100 RCU/WCU base, scaling to 2,000 at peak. KMS CMK encryption, end-to-end.

Lambda Authorizer with synchronous DynamoDB read: the authorizer reads the cached session state (p99 latency < 5ms with optional DAX) and makes the authorization decision locally. There is no Google call on the critical path. If stepUpRequired = true, it returns 401 with WWW-Authenticate: step-up and the client initiates the re-authentication flow.

EventBridge Scheduler + async refresh Lambda: every 4 minutes, a job queries Google's metadata endpoint for active sessions and updates DynamoDB. If the risk signal changes, the stepUpRequired attribute is set atomically with a DynamoDB condition (ConditionExpression: attribute_not_exists(stepUpRequired) OR stepUpRequired = :false), avoiding race conditions.

OIDC Session Metadata Evaluation Flow with Adaptive Step-Up

Critical path (synchronous) on the left; async refresh pipeline on the right. The Lambda Authorizer never calls Google directly — it reads only from the DynamoDB cache.

🟧 AWS — API Layer

API Gateway · REST / HTTP API
Lambda Authorizer · step-up policy eval

🟧 AWS — Session State

DynamoDB · userId#sessionId / TTL 5m
KMS CMK · encrypt at rest

🟧 AWS — Async Refresh Pipeline

EventBridge Scheduler · every 4 min
Refresh Lambda · fetch + diff + write
Step Functions · step-up orchestration

🟧 AWS — Observability

CloudWatch · stepUpRate / riskSignalAge

🔵 Google — Identity

🟧 AWS — Backend

Financial API · Lambda / EKS

Implementation Details That Matter in Production

IAM and Least Privilege in the Lambda Authorizer. The authorizer needs only dynamodb:GetItem on the session table, with condition aws:ResourceTag/Classification = SessionCache. No write permissions. The refresh Lambda role has dynamodb:UpdateItem with dynamodb:LeadingKeys condition restricted to the userId# prefix — preventing a code bug from writing to arbitrary partitions. Both roles use kms:Decrypt with condition kms:EncryptionContext:TableName = session-state.

Circuit Breaker in the Refresh Lambda. The call to Google's metadata endpoint must have a 2-second timeout and at most 2 retries with exponential backoff. If the error rate exceeds 50% in a 60-second window, the circuit breaker opens and the refresh Lambda stops trying — but does not automatically set stepUpRequired = true. The correct policy here is to preserve the last known state and emit a CloudWatch alarm (MetricName: GoogleMetadataUnavailable, threshold > 3 minutes). The fail-open vs. fail-closed decision during Google unavailability must be explicit in the policy and documented in the ADR — not implicit in the code.

Risk Signal Observability. Google's metadata includes fields like credential_age, account_risk_level, and device_verified. These values should be emitted as custom metric dimensions in CloudWatch — not just logged. This allows creating alarms like 'percentage of active sessions with account_risk_level = HIGH > 5%', which is a signal of an ongoing credential stuffing attack. The SLO I would define: riskSignalAge (time since last successful refresh) < 6 minutes for 99% of active sessions.

Cognito User Pools and the Opaque Token Problem. Cognito does not automatically forward custom claims from Google to the access token. You need a Pre Token Generation Lambda trigger that reads the session state from DynamoDB and injects the relevant claims. Note: this trigger adds latency to the token refresh flow — keep it below 100ms or Cognito will timeout the call.

Consequences and Risks of the Decision

Residual risk window of up to 4 minutes. The async architecture explicitly accepts that a user with a compromised session can make up to ~4 minutes of low-impact requests before being blocked. For high-impact operations, the synchronous step-up eliminates this window — but the compliance team needs to formally accept and document this trade-off. Do not try to hide this in the design. Dependency on Google availability for the refresh pipeline. If Google has a 30-minute degradation, riskSignalAge will exceed the SLO. You need an explicit runbook: what the system does automatically (preserves last known state), what the operator does manually (can force global step-up via feature flag in DynamoDB), and when to escalate to the security team. DynamoDB cost at scale. With 100,000 active sessions and refresh every 4 minutes, the async pipeline generates ~25,000 writes/minute. In on-demand mode, that is ~$0.0000125 per write = ~$0.31/minute = ~$450/month just for the refresh pipeline. Not expensive, but it needs to be in the cost model — especially if the session count scales 10x. Step-up amplification risk. If Google returns account_risk_level = HIGH for a large segment of users simultaneously (mass false positive), you may trigger step-up for thousands of users at the same time, generating an avalanche of re-authentications that overwhelms Cognito. Implement rate limiting on the number of simultaneous step-ups and an emergency override mechanism.

Governance, Audit, and the Path to Mature Zero Trust

An identity architecture decision does not end at the technical implementation. In regulated financial systems, the traceability of each authorization decision is as important as the decision itself. Each Lambda Authorizer invocation should emit a structured event to CloudWatch Logs with: userId, sessionId, decision (allow/deny/step-up), riskScore at decision time, signalAge in seconds, and the API Gateway requestId. This log is the audit evidence to demonstrate to the regulator that the system was operating with continuous verification.

In the AWS Well-Architected context, this architecture directly touches the Security pillar — specifically the principles of 'apply security at all layers' and 'enable traceability'. The use of KMS CMK with annual rotation, granular IAM conditions, and CloudTrail enabled for all DynamoDB and KMS operations closes the audit loop.

The path to mature Zero Trust from here has two natural steps. The first is adding device context signals — not just what Google knows about the account, but what you know about the device: fingerprint, geolocation, behavioral pattern. This can be implemented via Amazon Fraud Detector or a custom SageMaker model fed by the audit logs. The second is moving the authorization policy to an external engine — Open Policy Agent (OPA) running on EKS, or AWS Verified Permissions with Cedar — decoupling policy logic from the authorizer code and allowing the compliance team to edit policies without code deployments.

This evolution should be on the roadmap, but not in the MVP. The classic trap is trying to build the complete Zero Trust system in the first iteration and delivering nothing.

Well-Architected Pillars Assessment

Security

Continuous session verification via OIDC metadata, KMS CMK, IAM least-privilege with granular conditions, CloudTrail for complete audit trail. Synchronous step-up for high-impact operations eliminates the risk window on critical operations.

Reliability

Circuit breaker in the refresh Lambda protects against Google unavailability. Session state cached in DynamoDB ensures the authorizer functions even without IdP connectivity. Explicit runbook for signal degradation.

Performance efficiency

Lambda Authorizer reads only from DynamoDB (p99 < 5ms with DAX). No Google call on the critical path. 5-minute cache TTL balances freshness and latency.

Anti-Patterns to Avoid

Calling Google's metadata endpoint directly in the Lambda Authorizer without caching — you will violate your latency SLO and couple availability to Google.
Treating account_risk_level as binary (high/low) without considering the historical distribution — a fixed threshold will generate false positives on legitimate Google security events (e.g., voluntary password change).
Storing the full OIDC token in DynamoDB as session state — store only the claims relevant to the policy, never the raw token.
Not explicitly documenting the residual risk window in the ADR — the compliance team will discover this in an audit and it will be worse.
Using the same Lambda Authorizer for all endpoints without differentiating sensitivity — apply synchronous step-up only where the risk justifies the additional latency.

Architect's Note

Senior Solutions Architect

In practice, the hardest decision here is not technical — it is political: convincing the compliance team to formally accept a 4-minute risk window in exchange for a system that is auditable, observable, and does not break when Google has a degradation. I have seen teams build 'perfect' synchronous evaluation systems that went down for 45 minutes during a Google incident and nobody had documented the fallback. The circuit breaker and the explicit runbook are not implementation details — they are the core of the architectural decision. And the ADR needs to live in the repository, not in a wiki that nobody reads after go-live.

Verdict: Adopt OIDC Session Metadata with Async Evaluation and Selective Step-Up

Strongly Recommended with documented tra

Session metadata support in Sign in with Google is a real advancement for systems that need continuous identity verification. The correct decision for financial environments is not synchronous evaluation on every request — it is asynchronous evaluation with DynamoDB cache, explicit circuit breaker, and surgical synchronous step-up for high-impact operations. This architecture delivers practical Zero Trust: continuous verification without sacrificing latency, with graceful degradation and complete auditability. The 4-minute residual risk window is an acceptable and documentable trade-off — not a design flaw. Implement it, document it in the ADR, and evolve to AWS Verified Permissions with Cedar when team maturity allows.

References

NIST SP 800-207: Zero Trust Architecture AWS Docs: Amazon Cognito — Pre Token Generation Lambda Trigger AWS Docs: API Gateway Lambda Authorizers AWS Docs: Amazon Verified Permissions and Cedar AWS Blog: Implementing Zero Trust with AWS Google Developers: Sign in with Google — Session Metadata OpenID Connect Core 1.0 Specification AWS Docs: DynamoDB Condition Expressions

#oidc#zero-trust#cognito#identity#step-up-auth#session-metadata#lambda-authorizer#financial-grade

Liked this? Get the next one.

Architecture, AWS, AI and market deep dives — straight to your inbox. Free.

No spam · unsubscribe anytime

Analyzed source: Session metadata for Sign in with Google

Ask Fernando about this

Get a focused answer about this article from my AI assistant, grounded in my work.

Join the conversation

Verify your email to join in — you'll also get the newsletter. No password.

Keep reading

Security & ResilienceCognito Multi-Region: Migrating Identity to High AvailabilityAuthentication is critical infrastructure — a regional Cognito failure brings down the entire user journey. With Cognito multi-Region replication now available, there is a concrete path to elevating the identity plane to the same resilience level we already demand from databases and queues. In this article, I document the migration journey, the architecture decisions, and the risks that need active management.Read Security & ResilienceScalable User Search with Amazon Cognito: A Deep-Dive AnalysisAmazon Cognito excels at authentication, but its user-listing API was never designed for high-frequency search against large user pools. In this article, I analyze how to build a scalable search layer on top of Cognito, the failure modes that emerge when you ignore native API limits, and the real trade-offs between eventual consistency, data privacy, and operational cost.Read Security & ResilienceADR: Replacing SMS OTP with Silent Authentication in CognitoSMS OTP is simultaneously the most widely deployed authentication mechanism and one of the weakest: vulnerable to SIM swap, SS7 interception, and social engineering, with only ~80% completion rates. This ADR examines the decision to replace or complement SMS OTP with network-silent authentication via Vonage integrated into Amazon Cognito's CUSTOM_AUTH flow.Read

Architecture newsletter

Architecture intelligence, in your inbox

Curated signals and original analysis on AWS, AI, distributed systems and the market — the way a solutions architect reads them.

Curated AWS · AI · architecture · market signals
New architecture studies & deep-dives when they ship
Sharp summaries — depth without the noise
No spam · double opt-in · unsubscribe anytime

Security & ResilienceDecision Record

OIDC Session Metadata and Zero Trust: An Architecture Decision Record

Jun 12, 2026 9 minexpert AI-assisted

Listen to article

Fernando's voice

Fernando · 17:16

Download MP3

0:0017:16

Speed

The MP3 is saved to S3 after the first play.

Security & ResilienceDecision Record

fernando.moretes.com

Context and Forces: Why This Matters Now

The Forces in Tension

Before reaching the options, it is necessary to name the forces that make this decision non-trivial:

These four tensions define the design space. The options below are different responses to the same set of forces.

Options Considered for Session Metadata Evaluation

Option A: Short-Lived Token (Status Quo)

Pros

No external endpoint dependency on the critical path
Simple to implement and audit
Predictable behavior on IdP failure

Cons

Risk window between IdP revocation and token expiration
Pressure on Cognito rate limits at scale
Friction in long-running flows

Adequate for low-risk systems; insufficient for critical financial operations

Option B: Inline Evaluation in Lambda Authorizer

Pros

Freshest possible signal per request
Centralized and auditable policy in the authorizer
Natural integration with API Gateway and ALB

Cons

Added latency: Google call p99 ~80-150ms without cache
Service availability coupled to Google
Lambda cold start amplifies latency at peaks

Viable with aggressive caching and circuit breaker; requires revised latency SLO

Option C: Async Evaluation with Reactive Step-Up

Pros

Zero latency added to the critical path
Graceful degradation when Google is unavailable
Allows differentiated policies by operation sensitivity level

Cons

Risk window between async detection and session revocation
Higher complexity: EventBridge + Step Functions + DynamoDB
Requires step-up logic on the client (redirect to re-auth)

Best balance for high-scale financial systems; this is the option I would choose

Option D: Abstraction via IDSA/SPIFFE with Normalized Claims

Pros

Eliminates lock-in on Google's proprietary schema
Portability across IdPs (Entra, Okta, Google)
Aligns with workload identity standards (SPIFFE/SPIRE)

Cons

Significant implementation overhead; requires claims translation layer
High operational maturity required from the team
Out of scope for most projects in 2025

Ideal as long-term vision; premature as first implementation

The Decision: Async Evaluation with Selective Synchronous Step-Up

The concrete implementation uses three AWS components in coordination:

DynamoDB as session state cache: partition key userId#sessionId, TTL of 5 minutes, with attributes riskScore, verificationState, lastMetadataRefresh, and stepUpRequired. Provisioned capacity with auto-scaling: 100 RCU/WCU base, scaling to 2,000 at peak. KMS CMK encryption, end-to-end.

Lambda Authorizer with synchronous DynamoDB read: the authorizer reads the cached session state (p99 latency < 5ms with optional DAX) and makes the authorization decision locally. There is no Google call on the critical path. If stepUpRequired = true, it returns 401 with WWW-Authenticate: step-up and the client initiates the re-authentication flow.

EventBridge Scheduler + async refresh Lambda: every 4 minutes, a job queries Google's metadata endpoint for active sessions and updates DynamoDB. If the risk signal changes, the stepUpRequired attribute is set atomically with a DynamoDB condition (ConditionExpression: attribute_not_exists(stepUpRequired) OR stepUpRequired = :false), avoiding race conditions.

OIDC Session Metadata Evaluation Flow with Adaptive Step-Up

Critical path (synchronous) on the left; async refresh pipeline on the right. The Lambda Authorizer never calls Google directly — it reads only from the DynamoDB cache.

🟧 AWS — API Layer

API Gateway · REST / HTTP API
Lambda Authorizer · step-up policy eval

🟧 AWS — Session State

DynamoDB · userId#sessionId / TTL 5m
KMS CMK · encrypt at rest

🟧 AWS — Async Refresh Pipeline

EventBridge Scheduler · every 4 min
Refresh Lambda · fetch + diff + write
Step Functions · step-up orchestration

🟧 AWS — Observability

CloudWatch · stepUpRate / riskSignalAge

🔵 Google — Identity

🟧 AWS — Backend

Financial API · Lambda / EKS

Implementation Details That Matter in Production

Consequences and Risks of the Decision

Governance, Audit, and the Path to Mature Zero Trust

This evolution should be on the roadmap, but not in the MVP. The classic trap is trying to build the complete Zero Trust system in the first iteration and delivering nothing.

Well-Architected Pillars Assessment

Security

Reliability

Performance efficiency

Lambda Authorizer reads only from DynamoDB (p99 < 5ms with DAX). No Google call on the critical path. 5-minute cache TTL balances freshness and latency.

Anti-Patterns to Avoid

Calling Google's metadata endpoint directly in the Lambda Authorizer without caching — you will violate your latency SLO and couple availability to Google.
Treating account_risk_level as binary (high/low) without considering the historical distribution — a fixed threshold will generate false positives on legitimate Google security events (e.g., voluntary password change).
Storing the full OIDC token in DynamoDB as session state — store only the claims relevant to the policy, never the raw token.
Not explicitly documenting the residual risk window in the ADR — the compliance team will discover this in an audit and it will be worse.
Using the same Lambda Authorizer for all endpoints without differentiating sensitivity — apply synchronous step-up only where the risk justifies the additional latency.

Architect's Note

Senior Solutions Architect

Verdict: Adopt OIDC Session Metadata with Async Evaluation and Selective Step-Up

Strongly Recommended with documented tra

References

#oidc#zero-trust#cognito#identity#step-up-auth#session-metadata#lambda-authorizer#financial-grade

Liked this? Get the next one.

Architecture, AWS, AI and market deep dives — straight to your inbox. Free.

No spam · unsubscribe anytime

Analyzed source: Session metadata for Sign in with Google

Ask Fernando about this

Get a focused answer about this article from my AI assistant, grounded in my work.

Join the conversation

Verify your email to join in — you'll also get the newsletter. No password.

Keep reading

Architecture newsletter

Architecture intelligence, in your inbox

Curated signals and original analysis on AWS, AI, distributed systems and the market — the way a solutions architect reads them.

Curated AWS · AI · architecture · market signals
New architecture studies & deep-dives when they ship
Sharp summaries — depth without the noise
No spam · double opt-in · unsubscribe anytime