ADR: OpenSearch Serverless vs Dedicated Vector Database for Agentic RAG
Listen to study
generated on playGenerated only on first play
This ADR evaluates vector search infrastructure options for a multi-tenant agentic RAG platform on AWS, comparing OpenSearch Serverless, dedicated vector databases (Pinecone, pgvector), and a self-managed hybrid search layer. The decision weighs cost, p99 latency, permission-based filtering, incremental ingestion, and native Bedrock Knowledge Bases integration.
When you start building a multi-tenant agentic RAG platform, the choice of vector search engine is not an infrastructure detail — it is an architectural decision that defines operational cost, cross-tenant data security, retrieval latency, and the speed at which the system learns from new documents. This ADR documents the reasoning that led me to choose OpenSearch Serverless with native Bedrock Knowledge Bases integration, and the trade-offs I consciously accepted in doing so.
Fact Sheet
- System
- Agentic RAG Platform (reference scenario)
- Domain
- AI / Data — retrieval-augmented generation with autonomous agents
- Estimated scale
- 50–200 tenants, 10M–500M vectors, 5k–50k queries/day (project estimate)
- Primary region
- us-east-1 (Bedrock GA, OpenSearch Serverless available)
- AI stack
- Amazon Bedrock (Claude 3, Titan Embeddings v2), Bedrock Knowledge Bases, Bedrock Agents
- Data stack
- S3 (source), OpenSearch Serverless (vector index), DynamoDB (tenant metadata), Lambda (ingestion)
- Decision status
- Accepted — implementation in progress
- Date
- 2025-Q1
Context and Forces at Play
An agentic RAG platform differs from a simple RAG chatbot in two critical dimensions: autonomy and dynamic scope. Agents don't just retrieve documents — they decide when to retrieve, which collections to query, and how to combine multiple sources to build multi-hop reasoning. This imposes requirements that a static vector index alone cannot satisfy.
The first set of forces comes from the multi-tenant model. Each tenant has its own knowledge base — legal documents, technical manuals, internal policies — and cannot, under any circumstances, see data from another tenant. This is not just logical isolation via metadata filter; it is a security requirement that must be auditable and demonstrable. The naive approach of putting all vectors in a single index and filtering by tenant_id creates residual risk: a malformed query or a filter bug exposes cross-tenant data. Separate indexes per tenant eliminate this risk, but explode operational costs in provisioned vector databases.
The second set of forces is economic. Dedicated vector databases like Pinecone operate on a provisioned pod model. For 50 tenants with heterogeneous volumes — some with 100k vectors, others with 50M — over-provisioning is inevitable. The cost of keeping p2 or p3 pods idle during low-demand hours is real and recurring. OpenSearch Serverless, by contrast, charges per OCU (OpenSearch Compute Unit) actually consumed, with automatic scaling to near-zero in periods without traffic — which for workloads with a strong daytime pattern represents a structural saving.
The third set of forces is Bedrock ecosystem integration. AWS announced native support for OpenSearch Serverless as a Bedrock Knowledge Bases backend, meaning the ingestion pipeline (chunking, embedding with Titan Embeddings v2, indexing) is managed by the service itself. This eliminates an entire layer of orchestration code I would otherwise have to write and maintain if I chose an external vector database.
The Real Problem: Permission Filters at Query Time
The most underestimated aspect in choosing a vector engine for enterprise RAG is the permission filter at query time — what the literature calls pre-filtering vs post-filtering. The distinction matters far more than it appears.
Post-filtering is simple to implement: you retrieve the top-K most similar vectors and then discard those the user doesn't have permission to see. The problem is that if 60% of the most semantically relevant documents are outside the user's permission scope, the final result can have catastrophically low recall — you return 2 documents when you should return 10, and the 2 you returned are not the best ones.
Pre-filtering solves this by applying the filter before the vector search, restricting the search space to only the documents the user can see. OpenSearch Serverless supports pre-filtering via filter in the context of knn_vector queries using the _search API with knn + filter combined. This is critical: it means the agent never sees a document the user shouldn't see, and semantic relevance is calculated within the correct permission space.
Additionally, the platform needs to support incremental ingestion — new documents continuously arrive via S3 events, and the index needs to reflect this in minutes, not hours. OpenSearch Serverless handles this natively: Bedrock Knowledge Bases has incremental sync support via S3 data source, where only modified or added chunks are re-embedded and re-indexed. This contrasts with solutions like pgvector on RDS, where high-frequency incremental ingestion creates write IOPS pressure and can degrade read performance without careful tuning of vacuum and HNSW indexes.
A point many architects overlook: hybrid search (vector + BM25 lexical) is not optional for quality RAG in technical domains. Documents with serial numbers, product codes, technical acronyms, or specific proper nouns are poorly retrieved by purely semantic search. OpenSearch supports hybrid search natively with hybrid query type, combining kNN and BM25 scores via min-max normalization or RRF (Reciprocal Rank Fusion). Pinecone requires you to maintain a separate text index (typically Elasticsearch or OpenSearch) and do the fusion at the application layer — adding latency, operational complexity, and an extra point of failure.
Decision Matrix: Options Evaluated
OpenSearch Serverless (AOSS) + Bedrock Knowledge Bases
- Native integration with Bedrock Knowledge Bases — managed ingestion pipeline (chunking, embedding, indexing)
- True autoscaling per OCU — no over-provisioning for low-volume tenants
- Native hybrid search (kNN + BM25) with RRF without extra layer
- Metadata pre-filtering natively supported in kNN query
- Per-tenant collection isolation — no cross-tenant leakage risk
- Minimum cost of 2 OCUs per collection (~$700/month per active collection) — penalizes very small tenants
- Less control over HNSW parameters (ef_construction, m) compared to provisioned OpenSearch
- Cold start latency for collections without recent traffic
- Moderate vendor lock-in in the AWS/Bedrock ecosystem
Primary choice — best balance between zero-ops, multi-tenant security, and Bedrock integration
Pinecone (dedicated SaaS)
- Consistently low p99 latency at high-scale workloads (>100M vectors)
- Namespaces for logical tenant isolation without separate collection cost
- Mature SDK, excellent documentation, rich metadata support
- No native Bedrock Knowledge Bases integration — 100% custom ingestion pipeline
- Hybrid search requires external lexical index (additional cost and complexity)
- Provisioned pod model — inevitable over-provisioning for heterogeneous tenants
- Data outside AWS VPC — compliance and network latency implications
- High cost at scale: p2 pod ~$1.4k/month per pod, no scale to zero
Rejected — high structural cost, no native Bedrock integration, data outside VPC
pgvector on Aurora PostgreSQL Serverless v2
- Operational familiarity — teams already know PostgreSQL
- Data inside VPC, integration with RDS IAM auth
- Serverless v2 with ACU scaling — better than provisioned RDS
- Rich metadata support via SQL columns and complex filters
- HNSW in pgvector still immature for >10M vectors — IVFFlat index requires periodic rebuild
- No native hybrid search — BM25 requires pg_trgm extension or separate FTS
- No Bedrock Knowledge Bases integration — custom ingestion pipeline
- Aurora Serverless v2 scale-to-zero still has 15-30s cold start latency
- Tenant isolation per schema or separate database — operational complexity grows with tenant count
Rejected for primary use — suitable as fallback for very low-volume tenants (<100k vectors)
Self-managed Hybrid Search (provisioned OpenSearch + Lambda)
- Full control over index parameters, sharding, relevance tuning
- Potentially lower cost per vector at very high scale (>1B vectors)
- Massive operational overhead — patches, rebalancing, blue/green upgrades
- Engineering team diverted from product features to infrastructure
- No Bedrock Knowledge Bases integration — completely custom pipeline
- Risk of under-provisioning during ingestion or query spikes
Rejected — engineering cost not justified at current phase; revisit if scale >500M vectors/tenant
Detailed Technical Comparison
| Criterion | AOSS + Bedrock KB | Pinecone | pgvector Aurora | |
|---|---|---|---|---|
| Bedrock KB Integration | ✅ Native | ❌ Custom pipeline | ❌ Custom pipeline | — |
| Hybrid Search (kNN + BM25) | ✅ Native with RRF | ⚠️ Requires external index | ⚠️ Partial FTS, no native RRF | — |
| Permission pre-filtering | ✅ Native kNN + filter | ✅ Metadata filter | ✅ SQL WHERE clause | — |
| Multi-tenant isolation | ✅ Collection per tenant | ⚠️ Namespace (logical) | ⚠️ Separate schema/DB | — |
| Autoscaling / idle cost | ✅ OCU per use (min 0.5 OCU) | ❌ Fixed provisioned pod | ⚠️ Serverless ACU, cold start | — |
| Incremental ingestion | ✅ S3 sync via Bedrock KB | ⚠️ Custom upsert pipeline | ⚠️ Custom upsert + reindex | — |
| p99 latency (estimate) | ~80-150ms (warm) | ~20-50ms (warm) | ~50-120ms (warm) | — |
| Data inside AWS VPC | ✅ VPC endpoint available | ❌ External SaaS | ✅ Native VPC | — |
Architectural Decision
The agentic RAG platform needs to serve 50-200 tenants with heterogeneous document volumes, guarantee data isolation between tenants, support hybrid search for technical domains, integrate with Bedrock Agents for multi-hop reasoning, and scale cost proportionally to actual usage — not to the provisioned worst case.
Adopt **Amazon OpenSearch Serverless (AOSS)** as the vector search backend, integrated with **Amazon Bedrock Knowledge Bases** for ingestion pipeline management. Each tenant receives a **dedicated AOSS collection** with isolated data access policy. Search uses **hybrid query (kNN + BM25) with RRF** for all technical domains. Permission filters are applied via **pre-filter on the kNN query**. Incremental ingestion is managed by **Bedrock Knowledge Bases S3 data source sync**.
- ✅ Managed ingestion pipeline — zero chunking, embedding, and indexing code to maintain
- ✅ Physical isolation per collection — auditable, demonstrable, no cross-tenant leakage risk
- ✅ Cost proportional to usage — inactive tenants generate no significant compute cost
- ✅ Native hybrid search eliminates the need for a separate lexical index
- ⚠️ Minimum cost per active collection (~2 OCUs) penalizes very low-volume tenants — mitigated by shared collection strategy for free-tier tenants
- ⚠️ p99 latency of 80-150ms is acceptable for agentic RAG (LLM dominates total latency), but requires monitoring on complex queries with many filters
Multi-Tenancy Strategy: Collections vs Namespaces vs Shared Indexes
The decision to use one AOSS collection per tenant deserves detailed justification, because it goes against cost-optimization intuition at first glance.
The obvious alternative would be a single shared index with a tenant_id field as a filter. This works in low-risk environments, but fails in three real scenarios: (1) compliance auditing — when an auditor asks "how do you guarantee Tenant A doesn't see Tenant B's data?", the answer "we have a filter in the query" is not satisfactory; the answer "each tenant has its own collection with isolated IAM data access policy" is; (2) maintenance operation blast radius — a reindex or schema migration on a shared index affects all tenants simultaneously; (3) performance isolation — a tenant with massive ingestion on a shared index can degrade query latency for other tenants via indexing resource contention.
Per-tenant collection resolves all three. The minimum cost of ~2 OCUs per active collection is real, but for paying tenants on B2B platforms, this cost is absorbed in the plan pricing. For free-tier or very low-volume tenants (less than 50k vectors, less than 100 queries/day), the strategy is different: these tenants are grouped into a shared free-tier collection, where isolation is done by metadata (tenant_id as a mandatory filter field in all queries). The trade-off of isolation guarantee for cost is explicit and documented in the free-tier service level agreement.
This two-layer strategy — dedicated collections for paying tenants, shared collection for free-tier — is a pattern I consistently use in multi-tenant platforms: you don't need a single solution that serves all cases, you need a clear policy that defines which solution applies to which segment.
Architecture: Agentic RAG Platform with AOSS and Bedrock
Full flow: incremental document ingestion via S3 → Bedrock Knowledge Bases → AOSS, and agentic query flow via Bedrock Agents → AOSS hybrid search → LLM-generated response. Per-tenant collection isolation highlighted.
- End User · Tenant A/B/N
- API Gateway · + Cognito JWT
- Bedrock Agent · Orchestrator
- Claude 3 · (Bedrock)
- Bedrock · Knowledge Bases
- AOSS Collection · Tenant A (dedicated)
- AOSS Collection · Tenant B (dedicated)
- AOSS Collection · Free-tier (shared)
- S3 Bucket · Documents (per-tenant prefix)
- S3 Event · Notification
- Bedrock KB · Sync Job (incremental)
- Titan Embeddings v2 · (managed by KB)
- IAM Data Access · Policy (per collection)
- DynamoDB · Tenant Metadata + ACL
- Lambda · Permission Resolver
Well-Architected Assessment
Security
Physical isolation per AOSS collection with per-tenant IAM data access policies. kNN query pre-filtering ensures the model never sees documents outside the permission scope. Data flows inside the VPC via AOSS VPC endpoint. Credentials managed by IAM roles — no static keys.
Reliability
AOSS is a managed service with 99.9% SLA. Bedrock Knowledge Bases manages ingestion retry automatically. No SPOF in the search layer — the service is multi-AZ by design. Risk: inactive collection cold start; mitigated by periodic warm-up via CloudWatch Events for critical tenants.
Performance efficiency
Hybrid kNN+BM25 search with RRF improves recall in technical domains. Estimated p99 latency of 80-150ms (warm) is dominated by the LLM (~1-3s), making search not the bottleneck. OCU autoscaling prevents degradation during query spikes.
Cost optimization
OCU-per-use model eliminates over-provisioning. Two-tier strategy (dedicated collection for paying, shared for free) optimizes cost per segment. Per-collection OCU monitoring via CloudWatch is essential to detect tenants with anomalous growth.
Sustainability
Autoscaling to near-zero for inactive collections reduces energy consumption during idle periods — relevant for platforms with strong daytime patterns and tenants in different time zones.
If I were starting this platform today, I would make exactly this choice — but with eyes open to two risks that most architects underestimate. The first is OCU cost at collection scale. AOSS pricing charges a minimum of 2 OCUs per active collection. This is great for medium and large tenants, but if you reach 200 paying tenants with active collections 24/7, the minimum compute cost can surprise you. My recommendation: implement a collection hibernation policy for tenants with no activity for more than 30 days — AOSS doesn't have this natively yet, so you need a job that monitors activity and deactivates collections via API. This can reduce compute costs by 30-40% on platforms with tenant churn. The second risk is dependency on the Bedrock Knowledge Bases ingestion pipeline. It's convenient, but it's a black box: you don't control the chunking strategy, overlap, or the embedding model used (you're tied to Titan Embeddings v2 or Cohere). For most cases this is acceptable, but if you have documents with very specific structure — financial tables, source code, technical diagrams — the default chunking will destroy the semantics. In that case, I would keep Bedrock KB for the standard flow and have a custom ingestion pipeline via Lambda for special document types, writing directly to the AOSS collection via API. The coexistence of both pipelines is supported — Bedrock KB is not the only one that can write to an AOSS collection. On p99 latency: 80-150ms for vector search seems high compared to Pinecone (~20-50ms), but in the context of agentic RAG, the LLM consumes 1-3 seconds per call. Search is not the bottleneck. Spending engineering effort to reduce search latency from 120ms to 40ms at the expense of operational complexity is a classic premature optimization. Measure first, optimize later.
References
Verdict
OpenSearch Serverless with Bedrock Knowledge Bases is the right choice for a multi-tenant agentic RAG platform on AWS in 2025 — not because it's perfect, but because it solves the right problems with the lowest operational cost. Physical isolation per collection is the only auditable answer for multi-tenancy with sensitive data. Native hybrid search eliminates a layer of complexity that any other option would require. OCU autoscaling transforms infrastructure cost from fixed to variable, aligning it with the platform's business model. The trade-offs are real: minimum cost per active collection, higher p99 latency than Pinecone, and moderate lock-in in the Bedrock ecosystem. None of them are blockers for the current stage. The minimum cost is manageable with a hibernation policy. The latency is irrelevant against LLM inference time. The lock-in is mitigated by the portability of raw data in S3. The decision would be different in two scenarios: (1) if the platform needed sub-20ms search latency for real-time use cases (real-time recommendation, autocomplete), where Pinecone or a provisioned OpenSearch with HNSW tuning would be necessary; (2) if scale exceeded 500M vectors per tenant with highly unpredictable access patterns, where fine control over sharding and index parameters would justify the operational overhead of a provisioned cluster. For the described scenario — 50-200 tenants, technical domains, Bedrock integration, product-focused engineering team — this is the right decision.
Ask Fernando about this
Get a focused answer about this study from my AI assistant, grounded in my work.