Embeddings Deep — Models · Dimensions · Benchmarks · Cache

An embedding is the result of mapping text (or images · code) to fixed-dimensional real-valued vectors. On top of the assumption that semantically close items sit close in the vector space, search · clustering · classification · recommendation are layered.

1. The intuition of embeddings

If you embed two sentences with the same model and measure cosine similarity, sentences with similar meaning tend to show closer values.

"How is the weather today?" ↔ "Tell me the current temperature"      → similarity 0.85
"How is the weather today?" ↔ "How to make french fries"             → similarity 0.20

For the same sentence, the absolute value differs by model, so thresholds are tuned per model.

2. The meaning of dimension

Larger embedding dimensions give greater expressive power, but storage · search costs grow more than proportionally.

Dim	Example models
384 / 512	Small Sentence-Transformers · BGE-small
768	nomic-embed-text · BGE-base
1024	BGE-large · Cohere embed v3
1536	OpenAI `text-embedding-3-small`
3072	OpenAI `text-embedding-3-large`

3. Matryoshka representation learning

A learning method formalized by Kusupati et al. (2022). The model is trained so that information is reasonably preserved even when only the front portion of the embedding is sliced and used. The OpenAI text-embedding-3 series option to receive reduced dimensions in the response is reported to come from this idea.

Response dimension: 3072 (full) → 1536 → 768 → 256
                       ↑
   The shape of slicing within a single model

The balance between storage cost · search speed and accuracy can be set after the fact.

4. Normalization and distance functions

Most text embeddings are provided as L2-normalized vectors, or normalization is recommended. Between normalized vectors, cosine similarity and inner product are essentially the same. Follow the recommended distance function in the model card.

5. Closed API models

Model	Provider	Dim	Note
OpenAI `text-embedding-3-small`	OpenAI API	1536 (reducible)	Cheap · multilingual.
OpenAI `text-embedding-3-large`	OpenAI API	3072 (reducible)	High quality.
Cohere `embed-multilingual-v3.0`	Cohere API	1024	Strong in multilingual.
Voyage `voyage-3`	Voyage AI	Various	Domain-specialized.
Google `text-embedding-004`	Vertex / AI Studio	768	Same entry as Gemini.

6. Open weight models

Model	Origin	Note
BGE (`BAAI/bge-*`)	BAAI	English · Chinese · multilingual variants. small / base / large.
nomic-embed-text	Nomic AI	768-dim. Relatively permissive license.
Jina embeddings	Jina AI	Multilingual variants (`jina-embeddings-v3`) · long input support.
multilingual-e5	Microsoft research	Trained on ~100 languages.
Sentence-Transformers	UKP Lab	A collection of various small models.
ColBERT family	Stanford	For late-interaction search.

7. Per-task guidance

Most models guide you to distinguish two modes:

passage / document — Embedding the index target.
query — Embedding the search query.

For the same model, different prefixes ("query: ..." · "passage: ...") change results (e5 family).

8. The MTEB benchmark

Massive Text Embedding Benchmark — A composite benchmark released by Hugging Face that ranks models on diverse tasks like classification · clustering · retrieval · reranking · STS.

Caveats:

Many evaluations are English-centric. Korean scores show up in separate categories · multilingual variants.
Benchmark scores and your own domain performance can differ. Your own evaluation set is more trustworthy.
New models are added often.

9. Korean and multilingual

multilingual-e5 · bge-m3 · Cohere embed v3 multilingual, etc., support multilingual including Korean.
It's common to see quality drop when feeding Korean to English-only models.
Evaluation materials measured on Korean data exist (efforts like Ko-MTEB), and new models are registered often.

10. Cache · recomputation policy

Common operational shapes:

Content-hash-based cache — Store embedding results keyed by chunk text hash. Avoid recomputing the same text.
Version pinning — Record the embedding model name · version in metadata.
Model change = re-embed — Vector spaces of different models are not compatible.
Batch jobs — Call in batches of certain size, retry failures, alert on cost limits.

CREATE TABLE chunks (
  id BIGINT PRIMARY KEY,
  content TEXT,
  content_hash TEXT,
  embedding vector(1536),
  embed_model TEXT,        -- 'text-embedding-3-small'
  embed_version TEXT,      -- '2024-01-25'
  embed_at TIMESTAMPTZ
);

11. Dimension reduction · cost

API embeddings bill per token. Initial indexing of a large corpus can incur a one-time large cost. Switching to self-hosted (BGE · nomic · e5 + GPU or CPU) removes the per-call cost in exchange for added operational burden.

Dimension reduction — Slicing dimensions of a Matryoshka model, or post-hoc reduction such as PCA. The impact on search accuracy varies by model · domain, so measurement is the precondition.

12. Spots where you often get stuck

Re-embedding obligation after model change — Old vectors and a new query are in different spaces. Not compatible.

Dimension mismatch — A vector(1536) column can't hold 1024 dim. Separate columns·tables per model.

Missing prefix — Models like e5 · BGE change results based on query/passage prefix.

Normalization assumption — Assuming cosine distance on a non-normalized vector can be off. Verify the model output's own normalization.

Language mixing — When the same text mixes Korean·English, the two languages may sit close or far depending on the model. Measure in your own domain.

Too short text — Inputs under 10 tokens scatter in meaning and degrade search quality. Set a minimum chunk length policy.

Too long text — Exceeding the model's input limit (512 · 8192 tokens) truncates or errors. Workarounds like chunking + weighted average have losses.

Benchmark over-trust — MTEB rank #1 may not be #1 in your domain.

Unbounded embedding cache growth — Without TTL · usage statistics, storage cost piles up.

Closing thoughts

90% of the work in embeddings is enough with an appropriate multilingual model + content-hash cache + model-version metadata. Model swap drags re-embedding cost along, so the initial pin matters; and your own domain evaluation set is more reliable than MTEB scores.

agents-overview
llm-landscape

References: OpenAI Embeddings · Cohere Embed · BAAI BGE · nomic-embed-text · Sentence-Transformers · MTEB Leaderboard · Matryoshka Learning (2022) · BEIR · pgvector.

Embeddings Deep — Models · Dimensions · Benchmarks · Cache

Embeddings Deep — Models · Dimensions · Benchmarks · Cache

1. The intuition of embeddings

2. The meaning of dimension

3. Matryoshka representation learning

4. Normalization and distance functions

5. Closed API models

6. Open weight models

7. Per-task guidance

8. The MTEB benchmark

9. Korean and multilingual

10. Cache · recomputation policy

11. Dimension reduction · cost

12. Spots where you often get stuck

Closing thoughts

Next

Back to ai