Vector DB

Types of Vectors

In the context of machine learning and vector databases, vectors can be categorized in several ways depending on the type of data they represent, their mathematical density, their role within a search system, and their compression state.

These are the different types of vectors you will encounter in these systems:

1. Types by Density (Dense vs. Sparse)

Modern search systems often utilize a combination of two distinct vector types to maximize retrieval accuracy, an approach known as hybrid search.

Dense Vectors: These are standard, high-dimensional feature vectors generated by embedding models (like OpenAI or neural networks). They are considered "dense" because most of their dimensions contain non-zero floating-point numbers. Dense vectors excel at capturing abstract semantic meaning and intent, allowing systems to find relevant results even if exact keywords are not used.
Sparse Vectors: These vectors represent traditional keyword-based search techniques (such as term-frequency algorithms). They are "sparse" because they might have tens of thousands of dimensions (representing an entire vocabulary), but only a tiny fraction contain non-zero values (representing the specific words present in a document). Sparse vectors are excellent for exact keyword matching and are often aggregated with dense vectors to score results.

2. Types by Data Modality (Domain-Specific Embeddings) Embedding models can be trained to generate feature vectors for highly specific types of unstructured data. Common modalities include:

Text/Word Embeddings: Used in Natural Language Processing (NLP) to represent words, sentences, or entire documents (e.g., Word2Vec).
Image Embeddings: Used in computer vision systems. Images are converted into feature vectors (e.g., SIFT or GIST descriptors) to enable reverse image search and pattern recognition.
Audio/Music Embeddings: Feature vectors that capture audio characteristics, allowing systems to group similar tracks together for music recommendation.
Knowledge Graph Embeddings: Used to map entities and their relationships, often applied in recommendation systems or anomaly detection.

3. Types by System Role During the embedding and search workflow, systems manipulate vectors differently depending on what they are trying to accomplish.

Data Vectors (or Database Vectors): The primary feature vectors representing the actual entities (documents, images, etc.) stored within the database.
Query Vectors: The vector generated from a user's search input. The system compares this query vector against the stored data vectors to find the nearest neighbors.
Residual Vectors: When databases use clustering techniques (like Inverted File Indexing, or IVF), they group vectors around central points called "centroids." A residual vector represents the mathematical difference (or distance offset) between a specific data vector and its cluster's centroid. Tracking residuals is crucial for advanced optimizations like Product Quantization (IVFADC) and redundant assignment strategies.
Sub-vectors: To compress massive datasets, systems using Product Quantization split high-dimensional data vectors into smaller segments, or sub-vectors. Each sub-vector represents a different characteristic of the data and is evaluated independently to radically shrink the database's memory footprint.

4. Types by Compression State (Quantized Vectors) Because raw, high-dimensional vectors consume massive amounts of memory, databases frequently compress them into specialized formats known as quantized vectors.

Float32 Vectors (Raw): The default output of most embedding models, where each dimension is a 32-bit floating-point number. A single 1536-dimensional vector in this format requires about 6 KB of memory.
Scalar Quantized Vectors (Int8): The float32 dimensions are mapped into a smaller range of 8-bit integers (values from -128 to 127). This type of vector uses 75% less memory while maintaining high accuracy, and distance calculations are computationally cheaper.
Binary Quantized Vectors: These are extreme compressions where vector dimensions are reduced to simple 1s and 0s (bits). Values greater than zero become 1, and values less than or equal to zero become 0. A 1536-dimensional binary vector requires only 192 bytes (a 32x memory reduction) and can be searched up to 40 times faster using highly optimized CPU instructions, though they are primarily effective only on models with 1024 dimensions or more.

Search Algorithms

To find similar vectors, a system could compare a query to every single stored vector (Exact Search), but this becomes impossibly slow with millions of high-dimensional records. Instead, modern systems use Approximate Nearest Neighbor (ANN) search algorithms, which organize the data to drastically narrow down the search space, trading a microscopic amount of accuracy for massive speed gains.

These are some primary algorithms used to index and search vectors:

Graph-based (HNSW): Hierarchical Navigable Small World (HNSW) is currently the most popular and powerful search algorithm. It organizes vectors into a multi-layered graph. The search starts at the top layer using long, "highway" links for a fast, broad overview, and progressively drops to lower, denser layers to finely navigate to the closest matches.
Inverted File (IVF) / Clustering: This method groups vectors into distinct clusters around central points (centroids) using algorithms like k-means. When a query comes in, the system determines which centroid is closest and only searches the vectors within that specific cluster.
Hashing-based (LSH): Locality-Sensitive Hashing uses specialized hash functions designed so that similar vectors are highly likely to be assigned the same hash value. This maps them into the same discrete "buckets" for near-instant retrieval.
Tree-based: Algorithms like k-d trees or ANNOY recursively split the vector data into branches, like a flowchart. While efficient for smaller datasets, they can struggle to scale in very high-dimensional spaces due to the "curse of dimensionality," where distance metrics become less reliable.

Types of Vector Databases

Vector databases generally fall into three main architectural categories: Native, Extended, and Embedded.

1. Native Vector Databases: These are built from the ground up specifically to manage, search, and scale vector data.

Managed Cloud (SaaS): Databases like Pinecone offer zero operational overhead because they are fully managed, serverless, and scale automatically.
Dedicated Open-Source: Systems like Qdrant (built in Rust for high performance), Milvus (designed for enterprise-scale workloads with billions of vectors), and Weaviate (which includes built-in embedding generation modules) give you full control to self-host or use managed cloud versions.

2. Extended Databases These are traditional databases that have added vector search capabilities. They allow you to store embeddings right alongside your regular application data, meaning you don't have to manage a separate standalone vector service.

Relational (SQL): The most popular is pgvector, a PostgreSQL extension that lets you use standard SQL to query and filter vectors transactionally. Other examples include SingleStore and ClickHouse.
NoSQL: Systems like MongoDB, Cassandra, and Redis have integrated vector indexes directly into their existing storage engines.

3. Embedded Databases These run directly inside your application's process rather than requiring a separate running server. They are ideal for local development, edge computing, and rapid prototyping.

Examples: Chroma offers a very simple API for Python and JavaScript prototyping, while LanceDB provides zero-copy, columnar storage for fast local workloads.

References / Resources

https://www.cloudflare.com/en-gb/learning/ai/what-is-vector-database/
https://www.youtube.com/watch?v=gl1r1XV0SLw
List of popular vector DBs : https://cookbook.openai.com/examples/vector_databases/readme