The basic search from the previous step only uses the text field. In practice, the question field tends to match user queries better because both are phrased as questions. We combine all three fields and weight them, then wrap everything into a reusable class.

Boosting the question field

Combine scores from all three fields, giving the question field a 3x boost:

query = "I just signed up. Is it too late to join the course?"

boost = {'question': 3.0}

score = np.zeros(len(df))

for f in fields:
    b = boost.get(f, 1.0)
    q = transformers[f].transform([query])
    s = cosine_similarity(matrices[f], q).flatten()
    score = score + b * s

Boosting multiplies the similarity score for a field by a weight. A 3.0 boost on question makes a question match count three times as much as a match in text or section. The best weight depends on the dataset. For FAQ-style content, 3.0 is a good starting point.

Keyword filtering

Apply a filter to restrict results to a specific course:

filters = {
    'course': 'data-engineering-zoomcamp',
}

for field, value in filters.items():
    mask = (df[field] == value).values
    score = score * mask

Multiplying by the mask zeroes out scores for documents that do not match the filter. The combined score accounts for all fields with boosting and respects the filter.

Get the top results:

idx = np.argsort(-score)[:10]
results = df.iloc[idx]
results.to_dict(orient='records')

The TextSearch class

Put the whole pipeline (vectorization, boosting, filtering, ranking) into a class:

class TextSearch:

    def __init__(self, text_fields):
        self.text_fields = text_fields
        self.matrices = {}
        self.vectorizers = {}

    def fit(self, records, vectorizer_params={}):
        self.df = pd.DataFrame(records)
        if 'answer' in self.df.columns:
            self.df = self.df.rename(columns={'answer': 'text'})

        for f in self.text_fields:
            cv = TfidfVectorizer(**vectorizer_params)
            X = cv.fit_transform(self.df[f])
            self.matrices[f] = X
            self.vectorizers[f] = cv

    def search(self, query, n_results=10, boost={}, filters={}):
        score = np.zeros(len(self.df))

        for f in self.text_fields:
            b = boost.get(f, 1.0)
            q = self.vectorizers[f].transform([query])
            s = cosine_similarity(self.matrices[f], q).flatten()
            score = score + b * s

        for field, value in filters.items():
            mask = (self.df[field] == value).values
            score = score * mask

        idx = np.argsort(-score)[:n_results]
        results = self.df.iloc[idx]
        return results.to_dict(orient='records')

Use it:

index = TextSearch(text_fields=['section', 'question', 'text'])
index.fit(documents)

index.search(
    query='I just signed up. Is it too late to join the course?',
    n_results=5,
    boost={'question': 3.0},
    filters={'course': 'data-engineering-zoomcamp'},
)

This class became the minsearch library, a lightweight, dependency-free text search engine for Python.

You can install the production version with uv add minsearch.

Text search handles keyword matching, but it has a limitation: it only finds exact word matches. A query about "joining" will not match a document that says "enroll" because the words are different. We solve that with embeddings in Part 3: Embeddings and vector search.

Part 2: Boosting, filtering, and the TextSearch class

Boosting the question field

Keyword filtering

The TextSearch class

Questions & Answers

Part 2: Boosting, filtering, and the TextSearch class

Boosting the question field

Keyword filtering

The TextSearch class

Questions & Answers (0)

Questions & Answers