Embeddings Deep — Models · Dimensions · Benchmarks · Cache
Embeddings Deep — Models · Dimensions · Benchmarks · Cache
An embedding is the result of mapping text (or images · code) to fixed-dimensional real-valued vectors. On top of the assumption that semantically close items sit close in the vector space, search · clustering · classification · recommendation are layered.
1. The intuition of embeddings
If you embed two sentences with the same model and measure cosine similarity, sentences with similar meaning tend to show closer values.
"How is the weather today?" ↔ "Tell me the current temperature" → similarity 0.85
"How is the weather today?" ↔ "How to make french fries" → similarity 0.20
For the same sentence, the absolute value differs by model, so thresholds are tuned per model.
2. The meaning of dimension
Larger embedding dimensions give greater expressive power, but storage · search costs grow more than proportionally.
| Dim | Example models |
|---|---|
| 384 / 512 | Small Sentence-Transformers · BGE-small |
| 768 | nomic-embed-text · BGE-base |
| 1024 | BGE-large · Cohere embed v3 |
| 1536 | OpenAI text-embedding-3-small |
| 3072 | OpenAI text-embedding-3-large |
3. Matryoshka representation learning
A learning method formalized by Kusupati et al. (2022). The model is trained so that information is reasonably preserved even when only the front portion of the embedding is sliced and used. The OpenAI text-embedding-3 series option to receive reduced dimensions in the response is reported to come from this idea.
Response dimension: 3072 (full) → 1536 → 768 → 256
↑
The shape of slicing within a single model
The balance between storage cost · search speed and accuracy can be set after the fact.
4. Normalization and distance functions
Most text embeddings are provided as L2-normalized vectors, or normalization is recommended. Between normalized vectors, cosine similarity and inner product are essentially the same. Follow the recommended distance function in the model card.
5. Closed API models
| Model | Provider | Dim | Note |
|---|---|---|---|
OpenAI text-embedding-3-small |
OpenAI API | 1536 (reducible) | Cheap · multilingual. |
OpenAI text-embedding-3-large |
OpenAI API | 3072 (reducible) | High quality. |
Cohere embed-multilingual-v3.0 |
Cohere API | 1024 | Strong in multilingual. |
Voyage voyage-3 |
Voyage AI | Various | Domain-specialized. |
Google text-embedding-004 |
Vertex / AI Studio | 768 | Same entry as Gemini. |
6. Open weight models
| Model | Origin | Note |
|---|---|---|
BGE (BAAI/bge-*) |
BAAI | English · Chinese · multilingual variants. small / base / large. |
| nomic-embed-text | Nomic AI | 768-dim. Relatively permissive license. |
| Jina embeddings | Jina AI | Multilingual variants (jina-embeddings-v3) · long input support. |
| multilingual-e5 | Microsoft research | Trained on ~100 languages. |
| Sentence-Transformers | UKP Lab | A collection of various small models. |
| ColBERT family | Stanford | For late-interaction search. |
7. Per-task guidance
Most models guide you to distinguish two modes:
- passage / document — Embedding the index target.
- query — Embedding the search query.
For the same model, different prefixes ("query: ..." · "passage: ...") change results (e5 family).
8. The MTEB benchmark
Massive Text Embedding Benchmark — A composite benchmark released by Hugging Face that ranks models on diverse tasks like classification · clustering · retrieval · reranking · STS.
Caveats:
- Many evaluations are English-centric. Korean scores show up in separate categories · multilingual variants.
- Benchmark scores and your own domain performance can differ. Your own evaluation set is more trustworthy.
- New models are added often.
9. Korean and multilingual
- multilingual-e5 · bge-m3 · Cohere embed v3 multilingual, etc., support multilingual including Korean.
- It's common to see quality drop when feeding Korean to English-only models.
- Evaluation materials measured on Korean data exist (efforts like
Ko-MTEB), and new models are registered often.
10. Cache · recomputation policy
Common operational shapes:
- Content-hash-based cache — Store embedding results keyed by chunk text hash. Avoid recomputing the same text.
- Version pinning — Record the embedding model name · version in metadata.
- Model change = re-embed — Vector spaces of different models are not compatible.
- Batch jobs — Call in batches of certain size, retry failures, alert on cost limits.
CREATE TABLE chunks (
id BIGINT PRIMARY KEY,
content TEXT,
content_hash TEXT,
embedding vector(1536),
embed_model TEXT, -- 'text-embedding-3-small'
embed_version TEXT, -- '2024-01-25'
embed_at TIMESTAMPTZ
);
11. Dimension reduction · cost
API embeddings bill per token. Initial indexing of a large corpus can incur a one-time large cost. Switching to self-hosted (BGE · nomic · e5 + GPU or CPU) removes the per-call cost in exchange for added operational burden.
Dimension reduction — Slicing dimensions of a Matryoshka model, or post-hoc reduction such as PCA. The impact on search accuracy varies by model · domain, so measurement is the precondition.
12. Spots where you often get stuck
Re-embedding obligation after model change — Old vectors and a new query are in different spaces. Not compatible.
Dimension mismatch — A vector(1536) column can't hold 1024 dim. Separate columns·tables per model.
Missing prefix — Models like e5 · BGE change results based on query/passage prefix.
Normalization assumption — Assuming cosine distance on a non-normalized vector can be off. Verify the model output's own normalization.
Language mixing — When the same text mixes Korean·English, the two languages may sit close or far depending on the model. Measure in your own domain.
Too short text — Inputs under 10 tokens scatter in meaning and degrade search quality. Set a minimum chunk length policy.
Too long text — Exceeding the model's input limit (512 · 8192 tokens) truncates or errors. Workarounds like chunking + weighted average have losses.
Benchmark over-trust — MTEB rank #1 may not be #1 in your domain.
Unbounded embedding cache growth — Without TTL · usage statistics, storage cost piles up.
Closing thoughts
90% of the work in embeddings is enough with an appropriate multilingual model + content-hash cache + model-version metadata. Model swap drags re-embedding cost along, so the initial pin matters; and your own domain evaluation set is more reliable than MTEB scores.
Next
- agents-overview
- llm-landscape
References: OpenAI Embeddings · Cohere Embed · BAAI BGE · nomic-embed-text · Sentence-Transformers · MTEB Leaderboard · Matryoshka Learning (2022) · BEIR · pgvector.