Back to Workshops
Workshop Free

Build Your Own Search Engine

May 14, 2026
search rag llm-engineering information-retrieval

We build a search engine from scratch over DataTalks.Club Zoomcamp FAQ documents, starting with TF-IDF text search, adding cosine similarity and field boosting, then moving through SVD/LSA and BERT embeddings to vector search. The TextSearch class built during the workshop became the basis for the minsearch library used in later RAG and agent workshops.

What we cover:

  • TF-IDF text search with sklearn, cosine similarity, field boosting and keyword filtering
  • A reusable TextSearch class (the foundation of minsearch)
  • Vector search using SVD and NMF embeddings
  • BERT embeddings for semantic search that respects word order

Originally delivered at a DataTalks.Club live session in 2024, updated in 2026 with refreshed examples and tooling.

Links

Resources not included in the workshop materials list:

The search engine you will build

The workshop covers two approaches to search over the same FAQ data:

flowchart LR DOCS["FAQ documents<br/>DE/ML/MLOps Zoomcamp"] TEXT["Text search<br/>TF-IDF + cosine similarity<br/>field boosting + filtering"] VEC["Vector search<br/>SVD / NMF / BERT embeddings<br/>cosine similarity"] CLASS["TextSearch class<br/>(became minsearch)"] DOCS --> TEXT DOCS --> VEC TEXT --> CLASS VEC --> CLASS

Text search uses TF-IDF vectorization with field boosting (question weight 3x) and keyword filtering. Vector search replaces sparse representations with dense embeddings (SVD, NMF, then BERT) to handle synonyms and word order. The TextSearch class combines TF-IDF across multiple fields with boost weights and keyword filters, and later became the minsearch library.