We build a search engine from scratch over DataTalks.Club Zoomcamp FAQ documents, starting with TF-IDF text search, adding cosine similarity and field boosting, then moving through SVD/LSA and BERT embeddings to vector search. The TextSearch class built during the workshop became the basis for the minsearch library used in later RAG and agent workshops.

What we cover:

TF-IDF text search with sklearn, cosine similarity, field boosting and keyword filtering
A reusable TextSearch class (the foundation of minsearch)
Vector search using SVD and NMF embeddings
BERT embeddings for semantic search that respects word order

Originally delivered at a DataTalks.Club live session in 2024, updated in 2026 with refreshed examples and tooling.

The search engine you will build

The workshop covers two approaches to search over the same FAQ data:

flowchart LR DOCS["FAQ documents DE/ML/MLOps Zoomcamp"] TEXT["Text search TF-IDF + cosine similarity field boosting + filtering"] VEC["Vector search SVD / NMF / BERT embeddings cosine similarity"] CLASS["TextSearch class (became minsearch)"] DOCS --> TEXT DOCS --> VEC TEXT --> CLASS VEC --> CLASS

Text search uses TF-IDF vectorization with field boosting (question weight 3x) and keyword filtering. Vector search replaces sparse representations with dense embeddings (SVD, NMF, then BERT) to handle synonyms and word order. The TextSearch class combines TF-IDF across multiple fields with boost weights and keyword filters, and later became the minsearch library.

Build Your Own Search Engine

Links

The search engine you will build

Tutorial pages

Watch the recording

Materials

Workshop Code Repository

minsearch library (derived from this workshop)

Q&A RAG workshop (follow-up using search)

FAQ documents (workshop dataset)