Part 1: Load the FAQ documents

We start with retrieval before we touch any agent framework. The agent needs a tool that can answer course-specific questions, and the simplest such tool is a search function over the FAQ.

The parsed FAQ JSON groups documents by course. We flatten it into one list and keep the course name on every document. That lets the search function filter to the Data Engineering Zoomcamp.

import requests

docs_url = "https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json"
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course["course"]

    for doc in course["documents"]:
        doc["course"] = course_name
        documents.append(doc)

Each document includes question, text, and section fields alongside course, which is enough for a small retrieval tool. We match the student question against the text fields, then restrict the result to the course the assistant supports.

Build the in-memory index

We use minsearch because it's simple and runs in memory. When the notebook process stops, any additions you made to the index disappear. That's fine here: we want to learn the agent flow, not manage a persistent search service.

from minsearch import AppendableIndex

index = AppendableIndex(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

index.fit(documents)

The text_fields participate in text search, while the keyword_fields drive exact filters. We keep course as a keyword because a student from one course shouldn't receive FAQ entries from another course.

Try a search directly before exposing anything to an LLM:

query = "I just discovered the course. Can I join now?"

index.search(
    query=query,
    filter_dict={"course": "data-engineering-zoomcamp"},
    num_results=5,
)

Check this before the model gets involved. If retrieval fails, function calling will only make the failure harder to diagnose.

Wrap search as a tool function

Now put the search call behind a function. Rather than call index.search directly, the model calls a tool with a stable public interface that hides the filters, boosts, and result count.

def search(query):
    boost = {"question": 3.0, "section": 0.5}

    results = index.search(
        query=query,
        filter_dict={"course": "data-engineering-zoomcamp"},
        boost_dict=boost,
        num_results=5,
    )

    return results

The boost gives matches in the FAQ question more weight than matches in the body text. If the student asks something close to an existing FAQ question, that entry should rank higher.

Call the function by hand:

search("join course now")

This function is already useful without an LLM. The next step is to tell OpenAI that this function exists and let the model decide when to call it.

Questions & Answers

Sign up to ask questions, track your progress, and get access to other workshops · Already have an account? Sign in