Part 2: Boosting, filtering, and the TextSearch class
The basic search from the previous step only uses the text field. In practice, the question field tends to match user queries better because both are phrased as questions. We combine all three fields and weight them, then wrap everything into a reusable class.
Boosting the question field
Combine scores from all three fields, giving the question field a 3x boost:
query = "I just signed up. Is it too late to join the course?"
boost = {'question': 3.0}
score = np.zeros(len(df))
for f in fields:
b = boost.get(f, 1.0)
q = transformers[f].transform([query])
s = cosine_similarity(matrices[f], q).flatten()
score = score + b * s
Boosting multiplies the similarity score for a field by a weight. A 3.0 boost
on question makes a question match count three times as much as a match in
text or section. The best weight depends on the dataset. For FAQ-style
content, 3.0 is a good starting point.
Keyword filtering
Apply a filter to restrict results to a specific course:
filters = {
'course': 'data-engineering-zoomcamp',
}
for field, value in filters.items():
mask = (df[field] == value).values
score = score * mask
Multiplying by the mask zeroes out scores for documents that do not match the filter. The combined score accounts for all fields with boosting and respects the filter.
Get the top results:
idx = np.argsort(-score)[:10]
results = df.iloc[idx]
results.to_dict(orient='records')
The TextSearch class
Put the whole pipeline (vectorization, boosting, filtering, ranking) into a class:
class TextSearch:
def __init__(self, text_fields):
self.text_fields = text_fields
self.matrices = {}
self.vectorizers = {}
def fit(self, records, vectorizer_params={}):
self.df = pd.DataFrame(records)
if 'answer' in self.df.columns:
self.df = self.df.rename(columns={'answer': 'text'})
for f in self.text_fields:
cv = TfidfVectorizer(**vectorizer_params)
X = cv.fit_transform(self.df[f])
self.matrices[f] = X
self.vectorizers[f] = cv
def search(self, query, n_results=10, boost={}, filters={}):
score = np.zeros(len(self.df))
for f in self.text_fields:
b = boost.get(f, 1.0)
q = self.vectorizers[f].transform([query])
s = cosine_similarity(self.matrices[f], q).flatten()
score = score + b * s
for field, value in filters.items():
mask = (self.df[field] == value).values
score = score * mask
idx = np.argsort(-score)[:n_results]
results = self.df.iloc[idx]
return results.to_dict(orient='records')
Use it:
index = TextSearch(text_fields=['section', 'question', 'text'])
index.fit(documents)
index.search(
query='I just signed up. Is it too late to join the course?',
n_results=5,
boost={'question': 3.0},
filters={'course': 'data-engineering-zoomcamp'},
)
This class became the minsearch library, a lightweight, dependency-free text search engine for Python.
You can install the production version with uv add minsearch.
Text search handles keyword matching, but it has a limitation: it only finds exact word matches. A query about "joining" will not match a document that says "enroll" because the words are different. We solve that with embeddings in Part 3: Embeddings and vector search.