Workshops ... Part 2: Boosting, filtering, and the TextSearch class

Part 2: Boosting, filtering, and the TextSearch class

The basic search from the previous step only uses the text field. In practice, the question field tends to match user queries better because both are phrased as questions. We combine all three fields and weight them, then wrap everything into a reusable class.

Boosting the question field

Combine scores from all three fields, giving the question field a 3x boost:

query = "I just signed up. Is it too late to join the course?"

boost = {'question': 3.0}

score = np.zeros(len(df))

for f in fields:
    b = boost.get(f, 1.0)
    q = transformers[f].transform([query])
    s = cosine_similarity(matrices[f], q).flatten()
    score = score + b * s

Boosting multiplies the similarity score for a field by a weight. A 3.0 boost on question means a match in the question field counts three times as much as a match in text or section. The best weight depends on the dataset, and 3.0 works well for FAQ-style content.

Keyword filtering

Apply a filter to restrict results to a specific course:

filters = {
    'course': 'data-engineering-zoomcamp',
}

for field, value in filters.items():
    mask = (df[field] == value).values
    score = score * mask

Multiplying by the mask zeroes out scores for documents that do not match the filter. The result is a combined score that accounts for all fields with boosting and respects the filter.

Get the top results:

idx = np.argsort(-score)[:10]
results = df.iloc[idx]
results.to_dict(orient='records')

The TextSearch class

Put the whole pipeline (vectorization, boosting, filtering, ranking) into a class:

class TextSearch:

    def __init__(self, text_fields):
        self.text_fields = text_fields
        self.matrices = {}
        self.vectorizers = {}

    def fit(self, records, vectorizer_params={}):
        self.df = pd.DataFrame(records)
        if 'answer' in self.df.columns:
            self.df = self.df.rename(columns={'answer': 'text'})

        for f in self.text_fields:
            cv = TfidfVectorizer(**vectorizer_params)
            X = cv.fit_transform(self.df[f])
            self.matrices[f] = X
            self.vectorizers[f] = cv

    def search(self, query, n_results=10, boost={}, filters={}):
        score = np.zeros(len(self.df))

        for f in self.text_fields:
            b = boost.get(f, 1.0)
            q = self.vectorizers[f].transform([query])
            s = cosine_similarity(self.matrices[f], q).flatten()
            score = score + b * s

        for field, value in filters.items():
            mask = (self.df[field] == value).values
            score = score * mask

        idx = np.argsort(-score)[:n_results]
        results = self.df.iloc[idx]
        return results.to_dict(orient='records')

Use it:

index = TextSearch(text_fields=['section', 'question', 'text'])
index.fit(documents)

index.search(
    query='I just signed up. Is it too late to join the course?',
    n_results=5,
    boost={'question': 3.0},
    filters={'course': 'data-engineering-zoomcamp'},
)

This class became the minsearch library, a lightweight, dependency-free text search engine for Python.

You can install the production version with uv add minsearch.

The text search approach works well for keyword matching, but it has a limitation: it only finds exact word matches. A query about "joining" will not match a document that says "enroll" because the words are different. We solve that with embeddings in Part 3: Embeddings and vector search.

Questions & Answers

Sign in to ask questions