Part 1: Text search and TF-IDF
Now that the dataset is loaded, we build the first version of the search engine. We represent documents and queries as vectors, then measure how close they are. This is the basic idea behind information retrieval.
Information retrieval basics
We need a way to compare a query against every document and rank by relevance.
The standard approach:
- Represent each document as a vector of numbers
- Represent the query in the same vector space
- Compute similarity between the query vector and each document vector
The simplest representation is bag of words. Each dimension corresponds to a word, and its value is how many times that word appears. Order does not matter: "cat sat on mat" and "mat sat on cat" produce the same vector.
TF-IDF improves on raw counts. A word that appears in every document gets a low weight. A word that appears in only a few documents gets a high weight. The formula multiplies term frequency by inverse document frequency.
Bag of words with CountVectorizer
Start with a small example to see the mechanics:
docs_example = [
"Course starts on 15th Jan 2024",
"Prerequisites listed on GitHub",
"Submit homeworks after start date",
"Registration not required for participation",
"Setup Google Cloud and Python before course",
]
We use CountVectorizer to turn these documents into a term-document matrix:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words='english')
X = cv.fit_transform(docs_example)
names = cv.get_feature_names_out()
df_docs = pd.DataFrame(X.toarray(), columns=names).T
df_docs
Each column is a document, each row is a word, and the values are raw counts. This is bag of words - we ignore word order and only track which words appear and how often.
TF-IDF with TfidfVectorizer
Replace the count vectorizer with TF-IDF:
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer(stop_words='english')
X = cv.fit_transform(docs_example)
names = cv.get_feature_names_out()
df_docs = pd.DataFrame(X.toarray(), columns=names).T
df_docs.round(2)
The numbers are now weighted: common words score lower, distinctive words score higher.
Query-document similarity
To search, represent the query in the same vector space using the same vectorizer:
query = "Do I need to know python to sign up for the January course?"
q = cv.transform([query])
q.toarray()
The dot product between the query vector and a document vector gives a relevance score.
The more words they share, the higher the score:
X.dot(q.T).toarray()
In practice we use cosine similarity, which normalizes by vector length.
Because TfidfVectorizer already outputs normalized vectors, dot product and
cosine similarity produce the same results here.
Compute cosine similarity:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(X, q)
Vectorizing all the document fields
The FAQ documents have three text fields: section, question, and text.
Vectorize each one separately so we can weight them differently later:
fields = ['section', 'question', 'text']
transformers = {}
matrices = {}
for field in fields:
cv = TfidfVectorizer(stop_words='english', min_df=3)
X = cv.fit_transform(df[field])
transformers[field] = cv
matrices[field] = X
The min_df=3 parameter drops words that appear in fewer than three documents. This removes typos and rare terms that do not help with matching.
Basic search
Search using just the text field:
query = "I just signed up. Is it too late to join the course?"
q = transformers['text'].transform([query])
score = cosine_similarity(matrices['text'], q).flatten()
Filter to only the data-engineering course:
import numpy as np
mask = (df.course == 'data-engineering-zoomcamp').values
score = score * mask
Get the top 10 results:
idx = np.argsort(-score)[:10]
df.iloc[idx].text
This works, but we are only using one field. The question field is often a
better match because FAQ questions use the same language as user queries. We
fix that in Part 2: Boosting, filtering, and the TextSearch class by combining
all three fields with boosting.