Overview and setup
We build a search engine from scratch over the DataTalks.Club Zoomcamp FAQ documents. We start with simple text matching and TF-IDF, progress through vector embeddings and BERT, and end up with a search engine that understands synonyms and semantics, not just keywords.
The search engine you will build
The search engine retrieves relevant FAQ entries given a natural-language query. Along the way we implement:
- TF-IDF text search with field boosting and keyword filtering
- A reusable
TextSearchclass (the foundation of theminsearchlibrary) - Vector search using SVD and NMF embeddings
- BERT embeddings for semantic search that respects word order
Prerequisites
You need the following:
- Python 3.11 to 3.13 (3.14 not supported by PyTorch)
- Familiarity with basic Python and pandas
- No machine learning or NLP experience required
Project setup
Initialize the project and install the dependencies:
uv init
uv add requests pandas scikit-learn jupyter
Start Jupyter:
uv run jupyter notebook
Create a new notebook. Everything in this workshop runs inside a single notebook.
Download the FAQ dataset
The DataTalks.Club FAQ is published as JSON, one file per course. We use the Data Engineering Zoomcamp FAQ:
import requests
import pandas as pd
documents = requests.get(
'https://datatalks.club/faq/json/data-engineering-zoomcamp.json'
).json()
df = pd.DataFrame(documents)
df = df.rename(columns={'answer': 'text'})
df.head()
Each row has four fields: course, section, question, and text (the answer, renamed from answer). We use all of them during search, but with different weights.
Continue with Part 1: Text search and TF-IDF to start building the search engine.