We build a search engine from scratch over the DataTalks.Club Zoomcamp FAQ documents. We start with simple text matching and TF-IDF, progress through vector embeddings and BERT, and end up with a search engine that understands synonyms and semantics, not just keywords.

The search engine you will build

The search engine retrieves relevant FAQ entries given a natural-language query. Along the way we implement:

TF-IDF text search with field boosting and keyword filtering
A reusable TextSearch class (the foundation of the minsearch library)
Vector search using SVD and NMF embeddings
BERT embeddings for semantic search that respects word order

Prerequisites

You need the following:

Python 3.11 to 3.13 (3.14 not supported by PyTorch)
Familiarity with basic Python and pandas
No machine learning or NLP experience required

Project setup

Initialize the project and install the dependencies:

uv init
uv add requests pandas scikit-learn jupyter

Start Jupyter:

uv run jupyter notebook

Create a new notebook. Everything in this workshop runs inside a single notebook.

Download the FAQ dataset

The DataTalks.Club FAQ is published as JSON, one file per course. We use the Data Engineering Zoomcamp FAQ:

import requests
import pandas as pd

documents = requests.get(
    'https://datatalks.club/faq/json/data-engineering-zoomcamp.json'
).json()

df = pd.DataFrame(documents)
df = df.rename(columns={'answer': 'text'})
df.head()

Each row has four fields: course, section, question, and text (the answer, renamed from answer). We use all of them during search, but with different weights.

Continue with Part 1: Text search and TF-IDF to start building the search engine.

Overview and setup

The search engine you will build

Prerequisites

Project setup

Download the FAQ dataset

Questions & Answers

Overview and setup

The search engine you will build

Prerequisites

Project setup

Download the FAQ dataset

Questions & Answers (0)

Questions & Answers