Part 1: Load the FAQ documents

We start with retrieval before we touch any agent framework. The agent needs a tool that can answer course-specific questions, and the simplest tool is a search function over the FAQ.

The parsed FAQ JSON groups documents by course. We flatten it into one list and keep the course name on every document so the search function can filter to the Data Engineering Zoomcamp.

import requests

docs_url = "https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json"
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course["course"]

    for doc in course["documents"]:
        doc["course"] = course_name
        documents.append(doc)

Each document has fields like question, text, section, and course. That is enough for a small retrieval tool: match the student question against text fields, then restrict the result to the course the assistant is meant to support.

Build the in-memory index

The workshop uses minsearch because it is simple and runs in memory. When the notebook process stops, any additions you made to the index disappear. For this workshop, that is fine: we want to learn the agent flow, not manage a persistent search service.

from minsearch import AppendableIndex

index = AppendableIndex(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

index.fit(documents)

The text_fields participate in text search. The keyword_fields can be used for exact filters. We keep course as a keyword because a student from one course should not receive FAQ entries from another course.

Try a search directly before exposing anything to an LLM:

query = "I just discovered the course. Can I join now?"

index.search(
    query=query,
    filter_dict={"course": "data-engineering-zoomcamp"},
    num_results=5,
)

Check this before the model gets involved. If retrieval fails, function calling will only make the failure harder to diagnose.

Wrap search as a tool function

Now put the search call behind a function. The model will not call index.search directly. It will call a tool with a stable public shape, and that tool can hide the details of filters, boosts, and result count.

def search(query):
    boost = {"question": 3.0, "section": 0.5}

    results = index.search(
        query=query,
        filter_dict={"course": "data-engineering-zoomcamp"},
        boost_dict=boost,
        num_results=5,
    )

    return results

The boost gives matches in the FAQ question more weight than matches in the body text. If the student asks something close to an existing FAQ question, that entry should rank higher.

Call the function by hand:

search("join course now")

This function is already useful without an LLM. The next step is to tell OpenAI that this function exists and let the model decide when to call it.

Questions & Answers (0)

Sign in to ask questions