Workshops ... Part 1: Text search and TF-IDF

Part 1: Text search and TF-IDF

Now that the dataset is loaded, we build the first version of the search engine. We represent documents and queries as vectors, then measure how close they are. This is the basic idea behind information retrieval.

Information retrieval basics

We need a way to compare a query against every document and rank by relevance.

The standard approach:

  • Represent each document as a vector of numbers
  • Represent the query in the same vector space
  • Compute similarity between the query vector and each document vector

The simplest representation is bag of words. Each dimension corresponds to a word, and its value is how many times that word appears. Order does not matter: "cat sat on mat" and "mat sat on cat" produce the same vector.

TF-IDF improves on raw counts. A word that appears in every document gets a low weight. A word that appears in only a few documents gets a high weight. The formula multiplies term frequency by inverse document frequency.

Bag of words with CountVectorizer

Start with a small example to see the mechanics:

docs_example = [
    "Course starts on 15th Jan 2024",
    "Prerequisites listed on GitHub",
    "Submit homeworks after start date",
    "Registration not required for participation",
    "Setup Google Cloud and Python before course",
]

We use CountVectorizer to turn these documents into a term-document matrix:

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
X = cv.fit_transform(docs_example)

names = cv.get_feature_names_out()

df_docs = pd.DataFrame(X.toarray(), columns=names).T
df_docs

Each column is a document, each row is a word, and the values are raw counts. This is bag of words - we ignore word order and only track which words appear and how often.

TF-IDF with TfidfVectorizer

Replace the count vectorizer with TF-IDF:

from sklearn.feature_extraction.text import TfidfVectorizer

cv = TfidfVectorizer(stop_words='english')
X = cv.fit_transform(docs_example)

names = cv.get_feature_names_out()

df_docs = pd.DataFrame(X.toarray(), columns=names).T
df_docs.round(2)

The numbers are now weighted: common words score lower, distinctive words score higher.

Query-document similarity

To search, represent the query in the same vector space using the same vectorizer:

query = "Do I need to know python to sign up for the January course?"

q = cv.transform([query])
q.toarray()

The dot product between the query vector and a document vector gives a relevance score.

The more words they share, the higher the score:

X.dot(q.T).toarray()

In practice we use cosine similarity, which normalizes by vector length. Because TfidfVectorizer already outputs normalized vectors, dot product and cosine similarity produce the same results here.

Compute cosine similarity:

from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(X, q)

Vectorizing all the document fields

The FAQ documents have three text fields: section, question, and text.

Vectorize each one separately so we can weight them differently later:

fields = ['section', 'question', 'text']
transformers = {}
matrices = {}

for field in fields:
    cv = TfidfVectorizer(stop_words='english', min_df=3)
    X = cv.fit_transform(df[field])

    transformers[field] = cv
    matrices[field] = X

The min_df=3 parameter drops words that appear in fewer than three documents. This removes typos and rare terms that do not help with matching.

Basic search

Search using just the text field:

query = "I just signed up. Is it too late to join the course?"

q = transformers['text'].transform([query])
score = cosine_similarity(matrices['text'], q).flatten()

Filter to only the data-engineering course:

import numpy as np

mask = (df.course == 'data-engineering-zoomcamp').values
score = score * mask

Get the top 10 results:

idx = np.argsort(-score)[:10]
df.iloc[idx].text

This works, but we are only using one field. The question field is often a better match because FAQ questions use the same language as user queries. We fix that in Part 2: Boosting, filtering, and the TextSearch class by combining all three fields with boosting.

Questions & Answers

Sign in to ask questions