Metadata, filters and routing
Use metadata to filter, isolate tenants and route the search to the right source.
6 min read
Vector search without filters is like opening an HR file to any employee in the company: it technically works, but it's a security and quality disaster. Metadata is the mechanism that turns a generic index into a precise, secure, multi-tenant retrieval system — and ignoring it is the most common mistake I see in RAG that 'works on the notebook but fails in production'.
Why metadata exists in the RAG pipeline
When you index a chunk, you are storing two types of information: the semantic content (captured by the embedding) and the document context (captured by metadata). The embedding answers what the text means. Metadata answers where it came from, when, for whom, and with what permission.
Practical examples of useful fields:
| Field | Example | Use |
|---|---|---|
| tenant_id | client-acme | Isolate data per client |
| doc_type | contract, faq, manual | Route to the right collection |
| created_at | 2024-11-01 | Filter by time window |
| author | legal@company.com | Audit and provenance |
| permission_level | public, internal, confidential | Access control |
| language | en-US | Avoid language mixing |
| source_uri | s3://bucket/doc.pdf | Citations and traceability |
These fields are attached at ingestion time — alongside the vector — and stored in the index (OpenSearch, Pinecone, pgvector). At search time, they become filter clauses that narrow the search space before similarity ranking. This improves both precision and speed simultaneously.
In practice, every RAG project that reaches me without structured metadata has the same problem: the model responds with documents from the wrong clients, outdated versions, or content the user shouldn't see. Metadata is not an advanced feature — it's the minimum for any RAG going to production with more than one client, more than one document type, or more than one permission level. Define the metadata schema before you start indexing, because retrofitting it is expensive.
Flow: metadata controlling retrieval and routing
From user query to retrieved chunk: how tenant, permission, and document type filter and route the search.
- Auth / JWT · tenant_id + roles
- Query Router · doc_type → coleção
- Filter Builder · tenant + permission + date
- Busca Vetorial · + filtro de metadados
- Índice: Contratos · tenant_id=acme
- Índice: FAQ · tenant_id=acme
- Índice: Outro Tenant · tenant_id=beta
- LLM (Bedrock) · chunks filtrados
Multi-tenant and access control: where security meets retrieval
Multi-tenant in RAG has two main patterns, and the choice affects security, cost, and operational complexity:
Shared index with tenant_id filter: all clients in the same OpenSearch index, but every query mandatorily includes filter: { term: { tenant_id: "acme" } }. Cheaper and simpler to operate, but requires absolute discipline — a query without the filter leaks data across tenants. Use when per-tenant volume is small and you trust the application layer.
Separate index per tenant: each client has its own index (or namespace, in Pinecone). Isolation is physical — impossible to leak via a filter bug. Higher cost, but the security model is much more robust. Use when data is sensitive or regulated (healthcare, financial, legal).
Beyond tenant_id, access control by permission_level follows the same logic: the user's JWT carries their roles, the application layer translates that into metadata filters, and the index never returns chunks the user cannot see. This does not replace IAM and S3 bucket policies — it is an additional layer. In Lesson 11 (guardrails) and Module 3 we will go deeper into the full security model.
A classic mistake: trusting that the LLM will 'ignore' context it shouldn't see. It won't. If the chunk reached the context, the model can use it. The filter must happen before retrieval.
Query routing: sending the question to the right index
Not every question should go to the same index. A system with technical documents, support FAQs, and legal contracts benefits from a router that decides, before vector search, which collection to query.
The router can be as simple as a keyword classification or as sophisticated as a small LLM that classifies query intent. In practice, I start with a lightweight classifier (regex or a text classification model) and only move to LLM if precision is insufficient.
The router's output is a set of metadata filters: { doc_type: "contract", tenant_id: "acme", created_at: { gte: "2023-01-01" } }. These filters are passed directly to the OpenSearch query or to the Knowledge Bases filter.
A useful pattern is confidence-based routing: if the classifier has high confidence, go directly to a specific index; if confidence is low, search across multiple indexes and use reranking (Lesson 05) to consolidate. This avoids empty responses when the query is ambiguous.
Important detail: routing does not replace hybrid search — it happens before it. You route to the right collection, then execute hybrid search + reranking within that collection.
Key points from this lesson
How to implement metadata and filters in practice
- 1
Define the metadata schema at the start of ingestion
List all fields you will need for filtering, routing, and auditing. Document types and allowed values. Validate at ingestion — chunks without tenant_id do not enter the index.
- 2
Extract metadata at the parsing stage
Use the S3 object key, object tags, or a document header parser to populate the fields. For documents without explicit metadata, use a small LLM to classify doc_type automatically.
- 3
Store metadata alongside the vector in the index
In OpenSearch, define a mapping with metadata fields as keyword (for exact filters) or date (for ranges). In Knowledge Bases, use the native S3 metadata fields.
- 4
Build filters from the authenticated user's context
Extract tenant_id and roles from the JWT. Build the filter object programmatically — never let the user pass filters directly. Always include tenant_id as a mandatory filter.
- 5
Implement the router before vector search
Classify query intent to determine doc_type and target collection. Combine router output with security filters before executing the search.
Frequently asked questions
Can I use metadata to filter by date and ensure always up-to-date responses?
Yes, and it's a very useful pattern. Add a created_at >= now() - 12 months filter for queries that need recent information. But be careful: valid historical documents may be excluded. Consider combining with an is_current: true field for documents that should always appear, regardless of date.
Does Bedrock Knowledge Bases support metadata filters?
Yes. Knowledge Bases allows passing a retrievalConfiguration object with metadata filters in the retrieve API. Metadata is defined in S3 via .metadata.json files alongside each document. Lesson 09 covers this in detail.
Does metadata filtering replace encryption and IAM policies?
No. Metadata filtering is an application layer — it protects against improper retrieval within RAG, but does not protect data at rest. You still need S3 bucket policies, KMS, IAM roles, and VPC endpoints. The two mechanisms are complementary.
Conclusion: metadata is infrastructure, not a detail
Well-designed metadata transforms a fragile RAG into a reliable system. It is what separates 'works in the demo' from 'works in production with ten clients and sensitive data'. Invest time in the schema before indexing, treat security filters as mandatory (not optional), and implement the router as the first decision in the retrieval pipeline. In the next lesson, we go beyond passive retrieval and enter agentic RAG — where the system dynamically decides which sources to query and in what order.
Quick check
1. Filtering by metadata in retrieval mainly helps to…