Workshops ... Part 3: Embeddings and vector search

Part 3: Embeddings and vector search

Text search only matches exact words. A query about "enrolling" does not find a document that says "register" even though they mean the same thing. Embeddings solve this with dense vectors. Similar meanings end up close together, even when the exact words differ.

Embeddings explained

An embedding converts a document into numbers. The array has fixed length, typically 16 to 768 dimensions. Documents about similar topics produce similar vectors.

The key property is context. Words that never co-occur in the same document can still end up close in embedding space. That happens when they appear in similar contexts.

SVD for embeddings

Singular Value Decomposition compresses the sparse TF-IDF matrix into a dense matrix with fewer dimensions. It does not capture word order because TF-IDF does not preserve it either. It does capture latent topics: groups of words that tend to appear together.

Use the text field matrix we already computed:

from sklearn.decomposition import TruncatedSVD

X = matrices['text']
cv = transformers['text']

svd = TruncatedSVD(n_components=16)
X_emb = svd.fit_transform(X)

X_emb[0]

Each document is now a 16-dimensional vector. It replaces a sparse vector with thousands of dimensions.

Transform the query the same way:

query = 'I just signed up. Is it too late to join the course?'

Q = cv.transform([query])
Q_emb = svd.transform(Q)
Q_emb[0]

Compute similarity between the query and all documents:

score = cosine_similarity(X_emb, Q_emb).flatten()
idx = np.argsort(-score)[:10]
list(df.loc[idx].text)

The results include documents that share synonyms with the query, not just exact keyword matches. The n_components parameter controls the trade-off: more dimensions capture more detail but add noise. Fewer dimensions generalize better but lose specificity. Sixteen is a good starting point for a dataset this size.

NMF as an alternative

SVD produces both positive and negative values, which makes the dimensions hard to interpret. Non-Negative Matrix Factorization (NMF) produces only non-negative values. Each dimension can be interpreted as a topic, and the value shows how much the document is about that topic.

from sklearn.decomposition import NMF

nmf = NMF(n_components=16)
X_emb = nmf.fit_transform(X)
X_emb[0]

Transform and search with the query:

Q = cv.transform([query])
Q_emb = nmf.transform(Q)
Q_emb[0]

score = cosine_similarity(X_emb, Q_emb).flatten()
idx = np.argsort(-score)[:10]
list(df.loc[idx].text)

Both SVD and NMF operate on bag-of-words representations, so they still ignore word order. For search this is often good enough, and the computational cost is low. When word order matters, for example distinguishing "dog bites man" from "man bites dog," BERT embeddings handle that. We cover them in Part 4: BERT embeddings.

Questions & Answers

Sign in to ask questions