Embeddings — text to vectors

Embeddings map text to high-dimensional vectors so "price tag", "receipt", and "bill" sit close together in meaning space.

1. What embeddings answer

Word similarity (cat ↔ dog)
Sentence meaning ("I want a refund" ↔ "return request")
Cross-language (apple ↔ 사과, multilingual models only)

Traditional search (BM25 / TF-IDF) matches tokens and misses these.

2. Dimensions and models

Model	Dims	Notes
OpenAI text-embedding-3-small	1536	Cheap + strong
OpenAI text-embedding-3-large	3072	High quality, 3x cost
Gemini text-embedding-004	768	Free quota
bge-m3 (local)	1024	Multilingual
multilingual-e5-large	1024	Open, local-friendly

768–1024 dims is the pragmatic balance.

3. Gemini (free API)

import google.generativeai as genai
genai.configure(api_key="...")
resp = genai.embed_content(
    model="models/text-embedding-004",
    content="Audit log — logAdminAction pattern",
    task_type="retrieval_document",
)
vec = resp["embedding"]  # list[float] 768

task_type matters — index _document and queries as _query.

4. Cosine similarity

import numpy as np
def cosine(a, b):
    a, b = np.array(a), np.array(b)
    return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b)))

In PostgreSQL the <=> operator returns distance (1 − cos), so sort ASC.

5. Quality sanity check

Prepare 10 pairs with the same meaning and confirm average similarity ≥ 0.85. Below that, switch to a multilingual model.

6. Gotchas

Mixing query/doc task types
One long text without chunking (token limits 512–2048)
Not re-embedding when the model changes
Missing normalisation (HNSW assumes normalised vectors)

Closing

Embedding quality decides 60–70% of retrieval accuracy. An hour spent on model choice beats a day of prompt tuning.

03-pgvector-hnsw