Workshops ... Load the FAQ and build search

Load the FAQ and build search

Before we use the model, we need the FAQ data that the assistant answers from. The agent needs one real capability before guardrails matter. We load the Data Engineering Zoomcamp FAQ JSON, build a search index, and wrap search in one Python function.

Load the FAQ JSON

Start from the Data Engineering Zoomcamp FAQ JSON:

import requests

docs_url = "https://datatalks.club/faq/json/data-engineering-zoomcamp.json"
response = requests.get(docs_url)
response.raise_for_status()
documents = response.json()

Look at one entry before building the index:

documents[0]

Each document already has course, section, question, and answer.

That check gives enough context for the rest of the workshop. The bot answers Data Engineering Zoomcamp questions from these FAQ records, so it's not a general assistant.

Build the index

Use minsearch for the in-memory index. It runs inside Python, so we don't need a database, vector store, or separate search service for the workshop.

That keeps the setup small, and because the FAQ dataset is also small, rebuilding the index in the notebook is fast enough.

from minsearch import Index

index = Index(
    text_fields=["question", "answer", "section"],
    keyword_fields=["course"],
)

index.fit(documents)

The question, answer, and section fields are searchable text, while the course field is metadata. We keep course in the index because the same pattern works when you later index multiple courses.

Create a search function

Now we can search in our dataset:

boost = {"question": 3.0, "section": 0.5}

index.search(
    query="Can I still join the course?",
    boost_dict=boost,
    num_results=5,
)

The boost gives more weight to matches in the FAQ question. Section matches still help, but they shouldn't dominate the result.

Let's put this into one function:

def search(query: str) -> list[dict]:
    boost = {"question": 3.0, "section": 0.5}

    return index.search(
        query=query,
        boost_dict=boost,
        num_results=5,
    )

Run a quick test:

search("Can I still join the course?")

Look at the returned FAQ entries. These are the records the agent will use later when it answers course questions.

Exercise

Try a few course questions:

  • Can I still join the course?
  • How do I set up Docker for the course?
  • Can I get a certificate in self-paced mode?

Look at the returned FAQ entries. In the next part we give this same function to the model as a tool.

Questions & Answers

Sign up to ask questions, track your progress, and get access to other workshops · Already have an account? Sign in