Step 2
Embeddings — text to vectors
25 min
Embeddings — text to vectors
Embeddings map text to high-dimensional vectors so "price tag", "receipt", and "bill" sit close together in meaning space.
1. What embeddings answer
- Word similarity (
cat ↔ dog) - Sentence meaning (
"I want a refund" ↔ "return request") - Cross-language (
apple ↔ 사과, multilingual models only)
Traditional search (BM25 / TF-IDF) matches tokens and misses these.
2. Dimensions and models
| Model | Dims | Notes |
|---|---|---|
| OpenAI text-embedding-3-small | 1536 | Cheap + strong |
| OpenAI text-embedding-3-large | 3072 | High quality, 3x cost |
| Gemini text-embedding-004 | 768 | Free quota |
| bge-m3 (local) | 1024 | Multilingual |
| multilingual-e5-large | 1024 | Open, local-friendly |
768–1024 dims is the pragmatic balance.
3. Gemini (free API)
import google.generativeai as genai
genai.configure(api_key="...")
resp = genai.embed_content(
model="models/text-embedding-004",
content="Audit log — logAdminAction pattern",
task_type="retrieval_document",
)
vec = resp["embedding"] # list[float] 768
task_type matters — index _document and queries as _query.
4. Cosine similarity
import numpy as np
def cosine(a, b):
a, b = np.array(a), np.array(b)
return float(a @ b / (np.linalg.norm(a) * np.linalg.norm(b)))
In PostgreSQL the <=> operator returns distance (1 − cos), so sort ASC.
5. Quality sanity check
Prepare 10 pairs with the same meaning and confirm average similarity ≥ 0.85. Below that, switch to a multilingual model.
6. Gotchas
- Mixing query/doc task types
- One long text without chunking (token limits 512–2048)
- Not re-embedding when the model changes
- Missing normalisation (HNSW assumes normalised vectors)
Closing
Embedding quality decides 60–70% of retrieval accuracy. An hour spent on model choice beats a day of prompt tuning.
Next
- 03-pgvector-hnsw