Back to Events
Past

Build Your Own Search Engine

May 21, 2024, 02:00 Europe/Berlin

Continue with the workshop writeup

Open the canonical pages, recording, materials, and code repo.

View workshop writeup

We build a search engine from scratch over DataTalks.Club Zoomcamp FAQ documents. We start with TF-IDF text search, then add cosine similarity and field boosting. From there we move through SVD/LSA and BERT embeddings to vector search. The TextSearch class built during the workshop became the basis for the minsearch library used in later RAG and agent workshops.

What we cover:

  • TF-IDF text search with sklearn, cosine similarity, field boosting and keyword filtering
  • A reusable TextSearch class that becomes minsearch
  • Vector search using SVD and NMF embeddings
  • BERT embeddings for semantic search that respects word order

Originally delivered at a DataTalks.Club live session in 2024, updated in 2026 with refreshed examples and tooling.

Links

Resources not included in the workshop materials list:

The search engine you will build

We take two approaches to search over the same FAQ data:

flowchart LR DOCS["FAQ documents DE/ML/MLOps Zoomcamp"] TEXT["Text search TF-IDF + cosine similarity field boosting + filtering"] VEC["Vector search SVD / NMF / BERT embeddings cosine similarity"] CLASS["TextSearch class (became minsearch)"] DOCS --> TEXT DOCS --> VEC TEXT --> CLASS VEC --> CLASS

Text search uses TF-IDF vectorization, weights the question field three times as much as the others, and filters by keyword. Vector search replaces sparse representations with dense embeddings (SVD, NMF, then BERT) to handle synonyms and word order. Both paths share the TextSearch class we build along the way, which combines TF-IDF across multiple fields with boost weights and keyword filters.

Hosted by

Alexey Grigorev

Alexey Grigorev

Chief Agent Officer at AI Shipping Labs

Software engineer and machine learning practitioner with 15+ years of experience building production ML systems. I focus on practical, production-grade ML and AI systems, from early prototypes to reliable systems in production.

I'm the founder of DataTalks.Club, a free community that connects tens of thousands of practitioners worldwide, and the creator of the Zoomcamp series, free, code-first programs that have reached 100,000+ learners globally.

At AI Shipping Labs, I'm building the kind of environment that would have accelerated my own career growth. After years of teaching at scale, I wanted something more focused: a space for action-oriented builders who want to turn AI ideas into real projects. The community gives members the structure, accountability, and peer support to ship practical AI products consistently, even alongside their main jobs.

alexey@aishippinglabs.com

Feedback