Vector similarity search is the retrieval operation at the heart of modern AI pipelines. When an embedding model converts text, images, audio, or structured data into a numerical vector, semantic meaning is encoded as geometry — similar concepts land near each other in high-dimensional space. Vector similarity search answers the question: given a query vector, which stored vectors are closest to it?
This is the engine behind RAG (retrieving relevant document chunks for an LLM), semantic search (finding conceptually related results rather than keyword matches), recommendation systems (finding items similar to what a user engaged with), and multimodal search (querying images with text).
What Is a Vector?
An embedding model outputs a vector — a list of numbers, typically 384 to 4,096 dimensions — where each dimension captures some learned feature of the input. Two pieces of text with similar meaning will produce vectors that point in similar directions in this high-dimensional space. Similarity search is the process of finding which stored vectors point most similarly to a query.
Distance and Similarity Metrics
The choice of metric determines what “close” means. Different metrics suit different embedding models and use cases.
Cosine Similarity
Measures the angle between two vectors, ignoring their magnitudes entirely.
cosine_similarity(A, B) = (A · B) / (|A| × |B|)
Range: −1 (opposite) to 1 (identical direction). A score of 0 means orthogonal (unrelated).
When to use: Text embeddings, NLP, document similarity. Cosine similarity is the default for most sentence transformer models because text vectors vary in magnitude based on sentence length — you want to compare direction (meaning), not scale. It is the most widely used metric in semantic search.
When not to use: When magnitude carries meaningful information (e.g., frequency-weighted vectors where a vector twice as large means twice as important).
Dot Product (Inner Product)
The raw sum of element-wise products of two vectors:
dot_product(A, B) = Σ (Aᵢ × Bᵢ)
Relationship to cosine: For unit-normalised vectors (magnitude = 1), dot product equals cosine similarity exactly. Most modern embedding models normalise their outputs, making dot product and cosine interchangeable — and dot product faster to compute since it skips the normalisation step.
When to use: Normalised embeddings, maximum inner product search (MIPS), recommendation systems where you want to reward both similarity of direction and magnitude. OpenAI’s text-embedding-3 series is designed for dot product retrieval.
When not to use: Unnormalised vectors where magnitude differences would dominate the score.
Euclidean Distance (L2)
The straight-line distance between two points in space:
euclidean(A, B) = √Σ (Aᵢ − Bᵢ)²
Unlike cosine similarity, Euclidean distance is sensitive to vector magnitude — two vectors pointing in the same direction but with different magnitudes are not considered close.
When to use: Image embeddings, clustering tasks, any domain where absolute position in the embedding space (not just direction) carries meaning. Also the correct metric when your embedding model was trained with an L2 objective.
When not to use: Text similarity with variable-length inputs, where magnitude varies with length rather than meaning.
Manhattan Distance (L1)
The sum of absolute differences across each dimension — the distance you’d travel if you could only move along grid lines:
manhattan(A, B) = Σ |Aᵢ − Bᵢ|
When to use: Very high-dimensional spaces where Euclidean distance suffers from the “curse of dimensionality” (all points become equidistant). Manhattan distance degrades more gracefully in extreme dimensions, and is faster to compute since it avoids squaring. It is also more robust to outlier dimensions — a single very different feature doesn’t dominate the total distance as dramatically as it would in L2.
When not to use: When geometric accuracy matters more than speed or robustness; Euclidean distance is the more natural measure of “how far apart” two points actually are.
Hamming Distance
Counts the number of positions where two binary vectors differ. Primarily used for binary embeddings — compressed representations where each dimension is a single bit.
When to use: Large-scale retrieval with binary quantized embeddings, perceptual hashing, deduplication.
Metric Selection Rule
Match the metric to the one used during embedding model training. Each model’s training objective shapes the geometry of its output space. Using the wrong metric can dramatically degrade retrieval quality — a model trained with cosine similarity objectives will produce vectors where dot product and cosine work well but L2 may not.
Exact vs. Approximate Search
Exact nearest neighbour search (brute-force scan of every vector) guarantees finding the true closest match but scales as O(n × d) — linear in the number of vectors and their dimensionality. This is only practical for small collections (under ~100K vectors).
For larger collections, Approximate Nearest Neighbour (ANN) algorithms trade a small, controlled amount of recall for orders-of-magnitude faster queries.
HNSW (Hierarchical Navigable Small Worlds) The de facto standard in production vector databases (Pinecone, Weaviate, Qdrant, Chroma). HNSW builds a multi-layer probabilistic graph: upper layers are sparse long-range connections (for fast global navigation), lower layers are dense local connections (for precise local search). Query complexity is O(log n). It supports incremental updates without rebuilding the full index and delivers near-optimal accuracy-latency trade-offs across most workloads.
IVF (Inverted File Index) + PQ (Product Quantization) Used in FAISS for billion-scale collections. IVF clusters vectors into Voronoi cells; queries only scan the nearest cells rather than the full dataset. Product Quantization further compresses stored vectors into compact codes, reducing memory by 8–32×. The IVF-PQ combination is the backbone of web-scale vector search at Meta, Spotify, and similar.
LSH (Locality-Sensitive Hashing) Hash functions designed so similar vectors land in the same bucket with high probability. Lower memory overhead than HNSW but generally lower recall. Most production systems have migrated toward HNSW or IVF-PQ.
Practical Guide by Scale
| Collection size | Recommended approach |
|---|---|
| < 100K vectors | Exact search (brute force) — simple, perfectly accurate |
| 100K – 10M | HNSW — best recall/latency, supports live updates |
| 10M – 1B+ | IVF-PQ (FAISS) — memory-efficient, scales to web scale |
| Binary embeddings at any scale | Hamming distance + LSH or binary HNSW |
In Practice: RAG Pipelines
In a typical RAG pipeline, document chunks are embedded and stored in a vector database indexed with HNSW. At query time, the user’s question is embedded with the same model, and HNSW retrieves the k most similar chunks — measured by cosine or dot product — in milliseconds. Those chunks are then passed to the LLM as context.
The quality of retrieval is determined by three factors equally: the embedding model’s quality, the correctness of the metric choice, and the ANN index’s recall setting. A mismatch in any one of them silently degrades end-to-end answer quality.
Ready to build?
Leverage AI technologies to build your product stack
Superteams can help you build, deploy and launch AI application stacks using open source technologies — from architecture through to production.
Talk to Superteams