Terminologies
Re-ranking
What is Re-ranking?
In a standard RAG setup, when a user asks a question, a vector database performs a semantic search using a mathematical function such as Cosine Similarity. This first pass is fast and returns the top N documents (e.g., top 20).
Re-ranking is the second step. It takes those top 20 documents and passes them, along with the user's original query, through a more powerful, specialized AI model called a Re-ranker. This model re-evaluates the list and re-orders them, moving the absolute best answers to the very top.
Why is it Useful?
Standard vector databases prioritize speed over absolute precision. They excel at narrowing down millions of documents to a few dozen in milliseconds, but they often get the exact order wrong. Re-ranking fixes this by solving three major problems:
- Overcoming Bi-Encoder Limitations: Vector databases use "Bi-Encoders" (where queries and documents are embedded separately). They catch overall semantic concepts but miss fine-grained details, keyword matching, or complex logic. Re-rankers use "Cross-Encoders," which analyze the query and the document together, catching deep contextual nuances.
- Fixing "Lost in the Middle": LLMs suffer from a known limitation where they pay heavy attention to the very beginning and very end of the prompt context, often ignoring information in the middle. Re-ranking ensures the golden nuggets are at the absolute top, right where the LLM is paying attention.
- Reducing Noise and Token Costs: Instead of feeding 20 messy documents to your LLM (which wastes tokens and confuses the model), a re-ranker allows you to confidently trim that list down to the top 3 or 5 highly precise chunks.
The Workflow: Two-Stage Retrieval
Think of it like hiring for a job. Your vector database is the HR software filtering 1,000 resumes down to 20 based on keywords and skills. The re-ranker is the hiring manager conducting deep interviews with those 20 to pick the perfect top 3.
- Stage 1 (Retrieval): User Query
Vector Database Returns top 25 chunks (High recall, lower precision). - Stage 2 (Re-ranking): User Query + 25 Chunks
Re-ranker Model Returns top 5 precisely re-ordered chunks (High precision).
While re-rankers dramatically improve RAG accuracy, they come with a trade-off: Latency.
Because Cross-Encoders process the query and document together sentence-by-sentence, they are computationally heavy and slower than vector lookups. This is exactly why we use the two-stage approach. You should never run a re-ranker across your entire database—only ever use it on the small subset returned by Stage 1.
Chunking
What is Chunking?
Large language models and embedding models cannot ingest infinite text at once due to context window limits. More importantly, embedding an entire 50-page document into a single vector flattens out all the nuance; the specific details get averaged out and lost.
Chunking cuts the text into digestible blocks (e.g., paragraphs, sentences, or a fixed number of tokens) so that each individual block can be converted into its own highly specific vector embedding.
Why is it Useful?
Chunking directly dictates the relevance of your RAG system. Getting it right provides three massive benefits:
- Preserves Semantic Precision: If a user asks about a specific clause on page 42 of a contract, you want your vector search to find the exact paragraph containing that clause. If you embedded the whole contract as one file, the mathematical vector would represent the "general topic of the contract," completely missing the specific clause.
- Saves LLM Cost and Context: LLMs charge you by the token. Instead of feeding an entire document into the prompt when a user asks a simple question, chunking allows you to inject only the relevant 200-word paragraph, saving money and reducing API latency.
- Improves Search Accuracy: Smaller, focused chunks create sharper, more distinct vector embeddings, making it significantly easier for mathematical similarity algorithms (like Cosine Similarity) to find exact matches.
Core Chunking Strategies
There is no "one-size-fits-all" chunking strategy. You will need to choose a method based on the structure of their source data:
1. Fixed-Size Chunking
The simplest method. You decide on a strict limit, such as exactly 200 tokens or characters per chunk.
- The Catch: It completely ignores human grammar. A chunk might cut off right in the middle of a critical sentence, destroying the meaning.
- The Fix (Overlap): To prevent losing context at the boundaries, developers use a sliding window or overlap (e.g., chunk size of 200 tokens with a 50-token overlap). This ensures that the end of Chunk 1 is repeated at the beginning of Chunk 2, keeping sentences intact.
2. Recursive Character / Markdown Chunking
A smarter, hierarchical approach (and the default in frameworks like LangChain). It attempts to split text by natural boundaries in order of importance: first by paragraphs (\n\n), then by sentences (\n), and finally by spaces if necessary, until the chunk fits the target size. This keeps paragraphs and related ideas together.
3. Semantic Chunking
The most advanced method. Instead of counting characters or looking for punctuation, an embedding model reads the text line by line. It calculates the semantic distance between consecutive sentences and draws a boundary (creates a new chunk) only when the meaning or topic shifts significantly.
While chunking you may face these issues, you need to rectify it for the balance as per your requirements or needs:
- Chunks too small: You lose context. The system finds the phrase "Apply 5 drops," but loses the context of which medicine it applies to because that was in the previous chunk.
- Chunks too large: You introduce noise. The specific answer is buried inside a mountain of irrelevant text, confusing the LLM and inflating your bill.
Tensor
What is a Tensor?
Mathematically, a tensor is a multi-dimensional array of numbers. The easiest way to understand a tensor is through its dimensions (also called axes or rank):
- Rank 0 Tensor: A single number (a Scalar). Example: 5
- Rank 1 Tensor: A list of numbers (a Vector). Example: [1.2, -0.5, 3.1] (An text embedding is a Rank 1 Tensor).
- Rank 2 Tensor: A grid of numbers with rows and columns (a Matrix). Example: A spreadsheet or a batch of text embeddings.
- Rank 3+ Tensor: A cube or hyper-cube of numbers. Example: A color image has 3 dimensions (Height, Width, and Color Channels for Red, Green, and Blue).
In short, a tensor is just a generalization of vectors and matrices to an arbitrary number of dimensions.
Why is it Useful?
Tensors aren't just a conceptual way to organize numbers; they are engineered for massive computational performance. They are critical for two reasons:
- Hardware Optimization (GPUs/TPUs): Graphics processing units (GPUs) and Tensor Processing Units (TPUs) are physically designed to perform math on entire multi-dimensional grids simultaneously, rather than calculating numbers one by one. Tensors allow frameworks like PyTorch or TensorFlow to offload massive math operations to specialized hardware hardware blocks (like NVIDIA's Tensor Cores).
- Unified Representation: AI models don't understand words, pixels, or audio frequencies. They only understand geometric coordinates and vector spaces. Tensors provide a single, unified format. A video, an audio clip, and a chunk of text from a RAG pipeline are all converted into tensors, allowing the same deep learning architectures to process them.
1. Embeddings are Tensors
When a text chunk goes through an embedding model (like text-embedding-3-small), the model outputs a sequence of numbers (e.g., 1536 floating-point numbers). This single vector is technically a 1D Tensor.
2. Batch Processing (2D Tensors)
When you build a RAG pipeline, you rarely embed one sentence at a time because it is highly inefficient. Instead, you send a batch of sentences to the embedding model.
- If you send a batch of 32 sentences, and each sentence yields a 1536-dimensional vector, the embedding model processes and returns a 2D Tensor with a shape of (32, 1536).
Tensor Hierarchy
[ 5 ] [ 1.2, 3.4, -0.1 ] [ [1, 2],
[3, 4] ]
Scalar (0D) Vector (1D) Matrix (2D)
A single value An Embedding Vector A Batch of Embeddings
Note
In PyTorch, TensorFlow, and NumPy, a very common source of bugs is shape mismatches—where tensor or array dimensions don’t align with the expected mathematical operations. Checking .shape (and sometimes .ndim) is one of the most important debugging techniques.
Metadata
What is Metadata?
Metadata is "data about data." When you chunk a document, you get raw text strings. Metadata is the structured dictionary (usually key-value pairs) paired with each chunk.
{
"chunk_id": "chunk_942",
"text": "To reset the corporate router, hold the pinhole button for 10 seconds...",
"metadata": {
"source_file": "it_manual_2026.pdf",
"page_number": 42,
"department": "IT Support",
"security_clearance": "internal",
"last_updated": "2026-01-15"
}
}
Why is it Useful?
Pure semantic search has major blind spots. For example, if a user asks, "What were our Q4 revenues in the 2025 financial report?", a vector database might return the 2023 or 2024 reports because the semantic meaning of "Q4 revenues" is identical across all of them.
Metadata solves this by enabling Hybrid Search and strict filtering:
- Hard Filtering: It allows you to weed out irrelevant data before or during the vector search (e.g., "Only search documents where department == 'IT Support'").
- Time Awareness: Vectors don't understand chronology well. Metadata lets you sort results by last_updated so the LLM gets the freshest information.
- Access Control & Security: You can filter chunks based on the user's permissions (security_clearance), ensuring an employee can't retrieve executive-level data via the RAG chat.
- Source Citations: It allows the LLM application to say, "Here is your answer (Source: it_manual_2026.pdf, Page 42)", building user trust.
How to Handle Metadata in a RAG Pipeline
Managing metadata happens at two critical stages: Ingestion (Embedding) and Retrieval (Querying). There are a few key things to take care of.
1. During Ingestion & Embedding
When parsing and chunking documents, you must extract and inject metadata before saving to the vector database.
- Inherit Global Attributes: Every chunk generated from a specific file should automatically copy that file's global properties (e.g., author, url, created_date).
- Capture Local Context: Use your chunking script to track position-aware metadata, like page_number, heading_level_1 (to know what chapter the chunk belongs to), or preceding section headers.
- The "Metadata Enrichment" Trick: Many advanced pipelines append critical metadata directly into the text string before generating the embedding vector.
Example: "Document: IT Manual | Section: Router Reset | Text: To reset the corporate router..."
This ensures the embedding model explicitly bakes the document context right into the mathematical vector.
2. During Retrieval & Querying
When a user submits a query, you use metadata to narrow down the search space. There are two primary architectural patterns for this:
Pattern A: Pre-Filtering (Recommended)
You apply a strict metadata filter at the same time or immediately before the vector similarity calculation.
How it works: You tell the vector database: "Filter the database to only include rows where year == 2025, and then perform a vector search for 'Q4 revenues' among those rows." Most modern vector databases (Milvus, Qdrant, Pinecone, pgvector) optimize this using composite indexes so it happens instantly.
Pattern B: Post-Filtering
You run a broad vector search first, fetch the top 100 results, and then write code (e.g., in Python) to loop through those 100 results and discard chunks that don't match your metadata criteria.
The Risk: If your vector search didn't return the correct documents in the top 100, post-filtering will leave you with zero or highly irrelevant results. Avoid this pattern for strict criteria like security permissions.
Modern Advanced Pattern: Auto-Retrieval / Self-Querying
A major trend in RAG design is using an LLM to dynamically generate metadata filters from natural human speech.
If a user types: "Show me the security protocols updated after February 2026."
An LLM is placed in front of the vector database to parse the sentence into a structured query payload:
{
"query": "security protocols",
"filter": {
"last_updated": {
"$gt": "2026-02-01"
}
}
}
The vector database executes this exact payload, ensuring flawless precision without forcing the user to fill out complex search forms.
Multi-Tenancy
What is Multi-Tenancy in RAG?
If you are building a SaaS product—for instance, an AI-powered legal assistant used by 500 different law firms—you have two extreme choices:
- Spin up 500 individual servers and 500 separate vector databases (Highly secure, but incredibly expensive and hard to maintain).
- Host one central app and database infrastructure that all 500 firms share seamlessly (Inexpensive, but requires flawless code-level architecture to prevent data leaks).
Opting for the second approach means building a Multi-Tenant RAG pipeline. It ensures that when an employee from Firm A asks a question, the vector database limits its mathematical search exclusively to Firm A’s specific chunks.
Why is it Useful?
Multi-tenancy is a non-negotiable requirement for commercial SaaS applications and large enterprise systems for several key reasons:
- Massive Cost Optimization: Running separate vector database instances for hundreds of small clients results in tremendous resource waste (idle RAM, CPU, and cloud compute fees). Sharing the infrastructure pools resources efficiently.
- Simplified Operations & CI/CD: Updating a single multi-tenant pipeline code infrastructure instantly upgrades the experience for all users, rather than forcing you to manage and maintain hundreds of isolated container deployments.
- Enterprise Compliance and Security: Many industries (healthcare, finance, legal) legally require explicit data separation. Multi-tenant architecture allows you to achieve compliance without breaking the bank.
How to Handle Multi-Tenancy in a RAG Pipeline
Implementing multi-tenancy inside a vector database is fundamentally different from a standard SQL database. Because vector indices (like HNSW graphs) are built globally across a data pool, standard queries can accidentally traverse neighbor nodes belonging to other tenants.
There are three primary architectural patterns to handle this, ranging from soft logical separation to hard physical isolation.
Pattern 1: Metadata Filter-Based Isolation (Logical Separation)
In this approach, you store all text chunks from all tenants inside a single, giant vector index. You assign a specific tenant_id to the metadata dictionary of every chunk.
- During Ingestion:
{
"text": "...internal text...",
"metadata": {
"tenant_id": "tenant_123"
}
}
- During Retrieval:
When a user queries the system, your application layer automatically appends a strict metadata filter behind the scenes:
{
"vector": [0.12, -0.43, "..."],
"filter": {
"tenant_id": "tenant_123"
}
}
- The Catch: If the vector database performs "post-filtering" (searching the whole database first, then discarding other tenants), it can result in terrible performance or empty results. Ensure your database natively supports Pre-Filtering or Single-Stage Filtering to lock down the search path during the graph traversal.
Pattern 2: Namespace / Partition Isolation (Virtual Separation)
Many specialized native vector databases (like Pinecone, Qdrant, or Milvus) feature native Namespaces or Partition Keys. This allows you to split a single index into virtual, isolated compartments.
- How it works: Chunks are completely partitioned at the storage layer inside the same index instance. When a user queries a specific namespace, the search algorithm physically cannot step outside that boundary.
- Pros: Highly performant, eliminates cross-tenant data leaks at the database query level, and allows you to wipe out an entire client's data instantly by deleting their namespace.
Pattern 3: Database-Level Separation (Physical Isolation)
For high-value, high-security enterprise clients, you use a dedicated infrastructure silo. Each tenant receives a completely unique database instance, cluster, or isolated collection.
- Pros: Complete isolation. Zero chance of data bleed. It also solves the "Noisy Neighbor" problem (where one tenant making millions of heavy API calls slows down the database for everyone else).
- Cons: Extremely expensive to scale and complex to orchestrate.
Architectural Summary
| Isolation Strategy | Cost Efficiency | Security/Isolation | Best For |
|---|---|---|---|
| Metadata Filtering | High | Medium (Risky if code bugs occur) | B2C apps or large pools of small users |
| Namespaces / Shards | Balanced | High | Standard B2B SaaS products |
| Separate Databases | Low (Expensive) | Maximum | Enterprise / Gov clients with strict SLAs |