Embeddings — AI Glossary

A computer can compare the strings “dog” and “canine” and conclude they share no characters. But a human immediately understands they refer to the same animal. The gap between character-level comparison and semantic understanding is exactly what embeddings bridge.

An embedding is a point in a high-dimensional vector space — a list of hundreds or thousands of decimal numbers — that encodes what a piece of content means. Two pieces of content with similar meanings end up at similar coordinates. Distance in the embedding space corresponds to semantic distance in human understanding.

This transformation — from arbitrary text or images to a measurable position in meaning-space — is the foundational operation behind semantic search, RAG, recommendation systems, classification, clustering, and similarity detection.

How Embeddings Are Created

Embeddings are produced by embedding models — neural networks trained to map inputs to vectors. The training objective determines what “similar” means:

Word2Vec (2013): Words that appear in similar contexts in a large text corpus are pushed together in vector space. “King” and “queen” end up nearby; “king” and “carburetor” end up far apart. The famous result: vector(king) - vector(man) + vector(woman) ≈ vector(queen) — arithmetic on meaning.

BERT-based sentence embeddings (2019+): Sentence-BERT and its successors trained BERT on pairs of semantically similar sentences using contrastive learning: similar sentences are pulled together, dissimilar ones pushed apart. The result is a sentence-level vector rather than a word-level one — capturing the meaning of an entire passage, not just individual tokens.

Contrastive pre-training at scale (2022–2025): OpenAI’s text-embedding-3 series, Cohere’s embed-v3, and Voyage AI’s voyage-3 models train on massive corpora with contrastive objectives, producing general-purpose embeddings that work across diverse domains without task-specific fine-tuning.

Natively multimodal embeddings (2026): The frontier has shifted from text-only to unified multimodal embedding spaces. Google’s Gemini Embedding 2 (March 2026) embeds text, images, video, audio, and PDFs into a single 3072-dimensional vector space — enabling cross-modal search (find images by text, or audio clips by document) without separate modality-specific models. CLIP (OpenAI, 2021) was the early proof-of-concept for this approach, but current models extend it across many more modalities at production scale.

The Geometry of Meaning

Once content is embedded, several geometric operations become semantically meaningful:

Cosine similarity measures the angle between two vectors — the most common way to compare embeddings. A cosine similarity of 1.0 means identical direction (same meaning); 0 means orthogonal (unrelated); -1 means opposite. This lets you rank all items in a database by semantic closeness to a query.

Nearest neighbour search — given a query vector, find the K database vectors with the highest cosine similarity. This is the retrieval operation underlying RAG, semantic search, and recommendation. See Vector Similarity Search for the algorithms that make this fast at scale.

Clustering — group documents, customer segments, or support tickets by topic without predefined categories. K-means or HDBSCAN run directly on embedding vectors produces semantically coherent groups.

Outlier detection — items far from all cluster centres in embedding space are anomalous or novel. Used for fraud detection, content moderation, and quality filtering.

Embedding Dimensions

Embedding models produce vectors of fixed dimensionality — common sizes range from 384 (lightweight) to 3072 (Gemini Embedding 2) or 4096 (high-capacity research models). Higher dimensionality generally captures finer semantic distinctions but requires more memory and compute for similarity search.

Modern embedding APIs let you specify output dimensions (via Matryoshka Representation Learning, or MRL) — truncating a 3072-dim vector to 768 dimensions produces a smaller vector with minimal quality loss, useful when storage or search latency is a constraint. Gemini Embedding 2 and OpenAI’s text-embedding-3 models both support MRL-based dimension reduction natively.

From Tokens to Sentences to Documents

Token embeddings are the internal representations a transformer model assigns to each token as input. These flow through the model’s layers, accumulating context, and are not typically used for downstream similarity search.

Sentence/passage embeddings collapse a full sequence into a single vector — either by averaging token embeddings or by using a [CLS] token. These are what embedding APIs return and what vector databases store.

Document embeddings represent longer documents. Since most embedding models have a context limit (512 to 8192 tokens), long documents are usually chunked, each chunk embedded separately, and queries are matched against chunks rather than whole documents.

Embeddings in Production Systems

Semantic search: A user types “contract renewal process” and the system returns documents semantically about that topic — even if they never use those exact words. The query is embedded, and the nearest document chunks are retrieved.

Retrieval-Augmented Generation (RAG): Documents are pre-embedded and stored in a vector database. At query time, the question is embedded, nearest chunks are retrieved, and those chunks are injected into the LLM’s context as grounding. Embeddings are the retrieval backbone of every RAG system.

Recommendation systems: User history and content items are embedded in the same space. Recommend items whose embedding is closest to the user’s embedding (or closest to the embedding of content they’ve engaged with).

Duplicate and near-duplicate detection: Two support tickets or legal clauses with high cosine similarity are semantically equivalent even if worded differently. Useful for deduplication, routing, and plagiarism detection.

Classification without labelled data: Zero-shot classification embeds the input and a set of candidate labels, returning the label whose embedding is closest to the input. No training data required.

Choosing an Embedding Model

The embedding model landscape has consolidated around a handful of strong options in 2025–2026. The MTEB leaderboard (Massive Text Embedding Benchmark, maintained by Hugging Face) is the standard reference for comparing models across retrieval, classification, clustering, and reranking tasks. MTEB v2 launched in 2026 with harder benchmarks — scores are not directly comparable across v1 and v2.

Current Leading Models (2026)

Google Gemini Embedding 2 — MTEB score 68.32 (v2 English). The current benchmark leader and the first production-grade natively multimodal embedding model. 3072-dim vectors, 8192-token context, supports text, image, video, audio, and PDF in one unified space. Supports MRL truncation to 768 or 1536 dims. Available via the Gemini API.

Cohere embed-v4 — MTEB score 65.2. The strongest commercial model for multilingual use, with native support for over 100 languages and a multimodal variant. Particularly well-suited for enterprise RAG with mixed-language corpora.

OpenAI text-embedding-3-large — MTEB score 64.6. Industry workhorse, widely integrated, supports MRL dimension reduction (from 3072 down to 256). Hasn’t been updated since January 2024 but remains a reliable default for teams already in the OpenAI ecosystem.

Voyage AI voyage-3-large — Leading commercial option for domain-specific retrieval. Outperforms OpenAI text-embedding-3-large by 10%+ on code, legal, finance, and long-context benchmarks. 128k-token context window — the longest of any production embedding model. Best-in-class for code RAG.

Microsoft Harrier-OSS-v1 (27B) — MTEB v2 score 74.3. Top open-weight model for teams with inference infrastructure. Achieves frontier-level quality without API dependency.

Qwen3-Embedding — Top of the MTEB multilingual leaderboard (score 70.58 for the 8B model). Open-weight, instruction-aware, available in 0.6B, 4B, and 8B sizes. Strong choice for multilingual RAG with self-hosted infrastructure.

BAAI bge-m3 — The practical open-source default. Supports dense, sparse, and multi-vector (ColBERT-style) retrieval in a single model, across 100+ languages, with an 8192-token context window. Enables hybrid search without running separate models.

Selection Guide

Need	Recommended model
Best overall quality, API-based	Gemini Embedding 2
Multilingual enterprise RAG	Cohere embed-v4
Code or long-document retrieval	Voyage AI voyage-3-large
OpenAI ecosystem, general use	text-embedding-3-large
Self-hosted, multilingual	Qwen3-Embedding-8B
Self-hosted, hybrid search	BAAI bge-m3
Self-hosted, maximum quality	Microsoft Harrier-OSS-v1

Key variables beyond benchmark score: context length (voyage-3-large’s 128k is unmatched for long documents), modality support (Gemini Embedding 2 for multimodal), deployment model (API vs. self-hosted), and cost per million tokens.

How to Use — Generate embeddings with Gemini Embedding 2

python

from google import genai

client = genai.Client()

# Embed a single text passage
response = client.models.embed_content(
    model="gemini-embedding-exp-03-07",
    contents="Retrieval-augmented generation grounds LLM outputs in real documents.",
)
vector = response.embeddings[0].values
print(f"Dimensions: {len(vector)}")  # 3072

# Embed a batch of texts (more efficient)
texts = [
    "What is a transformer architecture?",
    "How does attention mechanism work?",
    "Explain positional encoding.",
]
batch_response = client.models.embed_content(
    model="gemini-embedding-exp-03-07",
    contents=texts,
)
vectors = [e.values for e in batch_response.embeddings]

# Cosine similarity between two vectors
import numpy as np

def cosine_similarity(a, b):
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

score = cosine_similarity(vectors[0], vectors[1])
print(f"Similarity: {score:.4f}")

Ready to build?

Leverage AI technologies to build your product stack

Superteams can help you build, deploy and launch AI application stacks using open source technologies — from architecture through to production.

Talk to Superteams