Build Your Own Search Engine
We build a search engine from scratch over DataTalks.Club Zoomcamp FAQ
documents, starting with TF-IDF text search, adding cosine similarity and
field boosting, then moving through SVD/LSA and BERT embeddings to vector
search. The TextSearch class built during the workshop became the basis
for the minsearch library
used in later RAG and agent workshops.
What we cover:
- TF-IDF text search with sklearn, cosine similarity, field boosting and keyword filtering
- A reusable
TextSearchclass (the foundation of minsearch) - Vector search using SVD and NMF embeddings
- BERT embeddings for semantic search that respects word order
Originally delivered at a DataTalks.Club live session in 2024, updated in 2026 with refreshed examples and tooling.
Links
Resources not included in the workshop materials list:
The search engine you will build
The workshop covers two approaches to search over the same FAQ data:
Text search uses TF-IDF vectorization with field boosting (question
weight 3x) and keyword filtering. Vector search replaces sparse
representations with dense embeddings (SVD, NMF, then BERT) to handle
synonyms and word order. The TextSearch class combines TF-IDF across
multiple fields with boost weights and keyword filters, and later became
the minsearch library.