Workshops ... Part 4: BERT embeddings

Part 4: BERT embeddings

SVD and NMF embeddings ignore word order entirely. BERT, a transformer model, reads text sequentially and produces embeddings that capture both meaning and structure. Two sentences with the same words in different order produce different embeddings.

Setup

Install the Hugging Face libraries:

uv add transformers tqdm torch

Load the tokenizer and model:

import torch
from transformers import BertModel, BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
model.eval()

The tokenizer converts text into integer token IDs. The model converts those tokens into contextual embeddings, where each token's representation depends on the surrounding tokens, not just the word.

Compute embeddings for a small example

Tokenize a pair of sentences:

texts = [
    "Yes, we will keep all the materials after the course finishes.",
    "You can follow the course at your own pace after it finishes",
]

encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

Run the model and take the mean of the last hidden state across all tokens. This produces one vector per sentence:

with torch.no_grad():
    outputs = model(**encoded_input)
    hidden_states = outputs.last_hidden_state

sentence_embeddings = hidden_states.mean(dim=1)
sentence_embeddings.shape

The result is a tensor with shape (2, 768) - two sentences, each represented as a 768-dimensional vector.

Batch computation for all documents

Computing embeddings one at a time is slow. Process documents in batches:

def make_batches(seq, n):
    result = []
    for i in range(0, len(seq), n):
        batch = seq[i:i+n]
        result.append(batch)
    return result

Compute embeddings for every FAQ document:

from tqdm.auto import tqdm

texts = df['text'].tolist()
text_batches = make_batches(texts, 8)

all_embeddings = []

for batch in tqdm(text_batches):
    encoded_input = tokenizer(batch, padding=True, truncation=True, return_tensors='pt')

    with torch.no_grad():
        outputs = model(**encoded_input)
        hidden_states = outputs.last_hidden_state

        batch_embeddings = hidden_states.mean(dim=1)
        batch_embeddings_np = batch_embeddings.cpu().numpy()
        all_embeddings.append(batch_embeddings_np)

X_emb = np.vstack(all_embeddings)

A batch size of 8 fits comfortably on most GPUs. On CPU, use 4 or lower to avoid running out of memory.

Search with BERT embeddings

Search works the same way as with SVD or NMF embeddings - compute cosine similarity between the query embedding and all document embeddings:

query = "I just signed up. Is it too late to join the course?"

encoded_q = tokenizer([query], padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    outputs = model(**encoded_q)
    Q_emb = outputs.last_hidden_state.mean(dim=1).numpy()

score = cosine_similarity(X_emb, Q_emb).flatten()
idx = np.argsort(-score)[:10]
list(df.loc[idx].text)

BERT embeddings capture semantic similarity better than SVD and NMF because they account for word order and context. The trade-off is speed: computing BERT embeddings takes seconds per batch, while SVD takes milliseconds for the entire dataset. For a small FAQ like ours the difference is manageable, but for larger collections you would precompute the embeddings and store them in a vector database.

Continue with Where to go from here for practical tools and where to go from here.

Questions & Answers

Sign in to ask questions