Overview and setup

We build a search engine from scratch over the DataTalks.Club Zoomcamp FAQ documents. We start with simple text matching and TF-IDF. Then we progress through vector embeddings and BERT. By the end, the search engine understands synonyms and semantics, not just keywords.

The search engine you will build

The search engine retrieves relevant FAQ entries given a natural-language query.

Along the way we implement:

  • TF-IDF text search with field boosting and keyword filtering
  • A reusable TextSearch class that becomes the minsearch library
  • Vector search using SVD and NMF embeddings
  • BERT embeddings for semantic search that respects word order

Prerequisites

You need the following:

  • Python 3.11 to 3.13 (3.14 not supported by PyTorch)
  • Familiarity with basic Python and pandas
  • No machine learning or NLP experience required

Project setup

Initialize the project and install the dependencies:

uv init
uv add requests pandas scikit-learn jupyter

Start Jupyter:

uv run jupyter notebook

Create a new notebook. Everything in this workshop runs inside a single notebook.

Download the FAQ dataset

The DataTalks.Club FAQ is published as JSON, one file per course.

We use the Data Engineering Zoomcamp FAQ:

import requests
import pandas as pd

documents = requests.get(
    'https://datatalks.club/faq/json/data-engineering-zoomcamp.json'
).json()

df = pd.DataFrame(documents)
df = df.rename(columns={'answer': 'text'})
df.head()

Each row has four fields: course, section, question, and text. The text field is the answer, renamed from answer. We use all fields during search, but with different weights.

Continue with Part 1: Text search and TF-IDF to start building the search engine.

Questions & Answers

Sign in to ask questions