Part 4: BERT embeddings
SVD and NMF embeddings ignore word order entirely. BERT, a transformer model, reads text sequentially and produces embeddings that capture both meaning and structure. Two sentences with the same words in different order produce different embeddings.
Setup
Install the Hugging Face libraries:
uv add transformers tqdm torch
Load the tokenizer and model:
import torch
from transformers import BertModel, BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
model.eval()
The tokenizer converts text into integer token IDs. The model converts those tokens into contextual embeddings, where each token's representation depends on the surrounding tokens, not just the word.
Compute embeddings for a small example
Tokenize a pair of sentences:
texts = [
"Yes, we will keep all the materials after the course finishes.",
"You can follow the course at your own pace after it finishes",
]
encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
Run the model and take the mean of the last hidden state across all tokens. This produces one vector per sentence:
with torch.no_grad():
outputs = model(**encoded_input)
hidden_states = outputs.last_hidden_state
sentence_embeddings = hidden_states.mean(dim=1)
sentence_embeddings.shape
The result is a tensor with shape (2, 768) - two sentences, each represented as a 768-dimensional vector.
Batch computation for all documents
Computing embeddings one at a time is slow. Process documents in batches:
def make_batches(seq, n):
result = []
for i in range(0, len(seq), n):
batch = seq[i:i+n]
result.append(batch)
return result
Compute embeddings for every FAQ document:
from tqdm.auto import tqdm
texts = df['text'].tolist()
text_batches = make_batches(texts, 8)
all_embeddings = []
for batch in tqdm(text_batches):
encoded_input = tokenizer(batch, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
outputs = model(**encoded_input)
hidden_states = outputs.last_hidden_state
batch_embeddings = hidden_states.mean(dim=1)
batch_embeddings_np = batch_embeddings.cpu().numpy()
all_embeddings.append(batch_embeddings_np)
X_emb = np.vstack(all_embeddings)
A batch size of 8 fits comfortably on most GPUs. On CPU, use 4 or lower to avoid running out of memory.
Search with BERT embeddings
Search works the same way as with SVD or NMF embeddings - compute cosine similarity between the query embedding and all document embeddings:
query = "I just signed up. Is it too late to join the course?"
encoded_q = tokenizer([query], padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
outputs = model(**encoded_q)
Q_emb = outputs.last_hidden_state.mean(dim=1).numpy()
score = cosine_similarity(X_emb, Q_emb).flatten()
idx = np.argsort(-score)[:10]
list(df.loc[idx].text)
BERT embeddings capture semantic similarity better than SVD and NMF because they account for word order and context. The trade-off is speed: computing BERT embeddings takes seconds per batch, while SVD takes milliseconds for the entire dataset. For a small FAQ like ours the difference is manageable, but for larger collections you would precompute the embeddings and store them in a vector database.
Continue with Where to go from here for practical tools and where to go from here.