Overview and setup
We build a search engine from scratch over the DataTalks.Club Zoomcamp FAQ documents. We start with simple text matching and TF-IDF. Then we progress through vector embeddings and BERT. By the end, the search engine understands synonyms and semantics, not just keywords.
The search engine you will build
The search engine retrieves relevant FAQ entries given a natural-language query.
Along the way we implement:
- TF-IDF text search with field boosting and keyword filtering
- A reusable
TextSearchclass that becomes theminsearchlibrary - Vector search using SVD and NMF embeddings
- BERT embeddings for semantic search that respects word order
Prerequisites
You need the following:
- Python 3.11 to 3.13 (3.14 not supported by PyTorch)
- Familiarity with basic Python and pandas
- No machine learning or NLP experience required
Project setup
Initialize the project and install the dependencies:
uv init
uv add requests pandas scikit-learn jupyter
Start Jupyter:
uv run jupyter notebook
Create a new notebook. Everything in this workshop runs inside a single notebook.
Download the FAQ dataset
The DataTalks.Club FAQ is published as JSON, one file per course.
We use the Data Engineering Zoomcamp FAQ:
import requests
import pandas as pd
documents = requests.get(
'https://datatalks.club/faq/json/data-engineering-zoomcamp.json'
).json()
df = pd.DataFrame(documents)
df = df.rename(columns={'answer': 'text'})
df.head()
Each row has four fields: course, section, question, and text.
The text field is the answer, renamed from answer. We use all fields
during search, but with different weights.
Continue with Part 1: Text search and TF-IDF to start building the search engine.