Part 1: Load the FAQ documents
We start with retrieval before we touch any agent framework. The agent needs a tool that can answer course-specific questions, and the simplest tool is a search function over the FAQ.
The parsed FAQ JSON groups documents by course. We flatten it into one list and keep the course name on every document so the search function can filter to the Data Engineering Zoomcamp.
import requests
docs_url = "https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json"
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()
documents = []
for course in documents_raw:
course_name = course["course"]
for doc in course["documents"]:
doc["course"] = course_name
documents.append(doc)
Each document has fields like question, text, section, and
course. That is enough for a small retrieval tool: match the student
question against text fields, then restrict the result to the course the
assistant is meant to support.
Build the in-memory index
The workshop uses minsearch because it is simple and runs in memory.
When the notebook process stops, any additions you made to the index
disappear. For this workshop, that is fine: we want to learn the agent flow,
not manage a persistent search service.
from minsearch import AppendableIndex
index = AppendableIndex(
text_fields=["question", "text", "section"],
keyword_fields=["course"]
)
index.fit(documents)
The text_fields participate in text search. The keyword_fields can
be used for exact filters. We keep course as a keyword because a
student from one course should not receive FAQ entries from another
course.
Try a search directly before exposing anything to an LLM:
query = "I just discovered the course. Can I join now?"
index.search(
query=query,
filter_dict={"course": "data-engineering-zoomcamp"},
num_results=5,
)
Check this before the model gets involved. If retrieval fails, function calling will only make the failure harder to diagnose.
Wrap search as a tool function
Now put the search call behind a function. The model will not call
index.search directly. It will call a tool with a stable public shape,
and that tool can hide the details of filters, boosts, and result count.
def search(query):
boost = {"question": 3.0, "section": 0.5}
results = index.search(
query=query,
filter_dict={"course": "data-engineering-zoomcamp"},
boost_dict=boost,
num_results=5,
)
return results
The boost gives matches in the FAQ question more weight than matches in the body text. If the student asks something close to an existing FAQ question, that entry should rank higher.
Call the function by hand:
search("join course now")
This function is already useful without an LLM. The next step is to tell OpenAI that this function exists and let the model decide when to call it.