Load the FAQ and build search
Before we use the model, we need the FAQ data that the assistant answers from. The agent needs one real capability before guardrails matter. We load the Data Engineering Zoomcamp FAQ JSON, build a search index, and wrap search in one Python function.
Load the FAQ JSON
Start from the Data Engineering Zoomcamp FAQ JSON:
import requests
docs_url = "https://datatalks.club/faq/json/data-engineering-zoomcamp.json"
response = requests.get(docs_url)
response.raise_for_status()
documents = response.json()
Look at one entry before building the index:
documents[0]
Each document already has course, section, question, and answer.
That check gives enough context for the rest of the workshop. The bot answers Data Engineering Zoomcamp questions from these FAQ records, so it's not a general assistant.
Build the index
Use minsearch for the in-memory index. It runs inside Python, so we don't need a database, vector store, or separate search service for the workshop.
That keeps the setup small, and because the FAQ dataset is also small, rebuilding the index in the notebook is fast enough.
from minsearch import Index
index = Index(
text_fields=["question", "answer", "section"],
keyword_fields=["course"],
)
index.fit(documents)
The question, answer, and section fields are searchable text, while
the course field is metadata. We keep course in the index because the
same pattern works when you later index multiple courses.
Create a search function
Now we can search in our dataset:
boost = {"question": 3.0, "section": 0.5}
index.search(
query="Can I still join the course?",
boost_dict=boost,
num_results=5,
)
The boost gives more weight to matches in the FAQ question. Section
matches still help, but they shouldn't dominate the result.
Let's put this into one function:
def search(query: str) -> list[dict]:
boost = {"question": 3.0, "section": 0.5}
return index.search(
query=query,
boost_dict=boost,
num_results=5,
)
Run a quick test:
search("Can I still join the course?")
Look at the returned FAQ entries. These are the records the agent will use later when it answers course questions.
Exercise
Try a few course questions:
- Can I still join the course?
- How do I set up Docker for the course?
- Can I get a certificate in self-paced mode?
Look at the returned FAQ entries. In the next part we give this same function to the model as a tool.