Overview and setup

We build a search engine from scratch over the DataTalks.Club Zoomcamp FAQ documents. We start with simple text matching and TF-IDF, progress through vector embeddings and BERT, and end up with a search engine that understands synonyms and semantics, not just keywords.

The search engine you will build

The search engine retrieves relevant FAQ entries given a natural-language query. Along the way we implement:

  • TF-IDF text search with field boosting and keyword filtering
  • A reusable TextSearch class (the foundation of the minsearch library)
  • Vector search using SVD and NMF embeddings
  • BERT embeddings for semantic search that respects word order

Prerequisites

You need the following:

  • Python 3.11 to 3.13 (3.14 not supported by PyTorch)
  • Familiarity with basic Python and pandas
  • No machine learning or NLP experience required

Project setup

Initialize the project and install the dependencies:

uv init
uv add requests pandas scikit-learn jupyter

Start Jupyter:

uv run jupyter notebook

Create a new notebook. Everything in this workshop runs inside a single notebook.

Download the FAQ dataset

The DataTalks.Club FAQ is published as JSON, one file per course. We use the Data Engineering Zoomcamp FAQ:

import requests
import pandas as pd

documents = requests.get(
    'https://datatalks.club/faq/json/data-engineering-zoomcamp.json'
).json()

df = pd.DataFrame(documents)
df = df.rename(columns={'answer': 'text'})
df.head()

Each row has four fields: course, section, question, and text (the answer, renamed from answer). We use all of them during search, but with different weights.

Continue with Part 1: Text search and TF-IDF to start building the search engine.

Questions & Answers

Sign in to ask questions