Part 1: Load FAQ documents

We start by loading the course FAQ because every later guardrail needs a real agent to protect. We use docs.py for this job. It downloads a GitHub repository, filters files, parses markdown frontmatter, and returns dictionaries we can search.

Download the loader

In the first notebook cell we download docs.py from the workshop repo:

wget https://raw.githubusercontent.com/alexeygrigorev/workshops/refs/heads/main/guardrails/docs.py

The helper is deliberately generic, so it can read any public GitHub repository. In this workshop, we point it at the DataTalks.Club FAQ repository and filter for Data Engineering Zoomcamp files.

The FAQ exists as a large rendered document too, but searching it by hand is awkward. A word like join can match SQL joins as well as course-join questions. We build a search layer so the agent can retrieve a small set of candidate FAQ entries first.

The reader class

docs.py starts with a small dataclass:

from dataclasses import dataclass

@dataclass
class RawRepositoryFile:
    filename: str
    content: str

The reader returns one RawRepositoryFile per file. Keep the filename with the content because the FAQ filenames still hold useful course and section context.

The constructor takes a repo owner and name plus two filters: allowed_extensions keeps us on markdown files, and filename_filter keeps only paths containing data-engineering. The read method downloads the repository zip and strips the top-level folder from each path. It then applies the filters, reads each file as UTF-8, and returns one RawRepositoryFile per file. You import these from docs.py rather than typing them out. Read the full source at docs.py.

The filename becomes part of each parsed FAQ document. That makes it available later when search results are returned to the model.

Parse the FAQ markdown

parse_data uses python-frontmatter to turn each markdown file into a dictionary:

def parse_data(data_raw):
    data_parsed = []
    for f in data_raw:
        post = frontmatter.loads(f.content)
        data = post.to_dict()
        data['filename'] = f.filename
        data_parsed.append(data)

    return data_parsed

Now use the reader in the notebook:

from docs import GithubRepositoryDataReader, parse_data

reader = GithubRepositoryDataReader(
    repo_owner="DataTalksClub",
    repo_name="faq",
    allowed_extensions={"md"},
    filename_filter=lambda fp: "data-engineering" in fp.lower()
)

faq_raw = reader.read()
faq_documents = parse_data(faq_raw)

print(f"Loaded {len(faq_documents)} FAQ entries")

You should see output like this:

Loaded 449 FAQ entries

Look at one document before building the agent:

faq_documents[4]

The record includes:

  • id
  • question
  • sort_order
  • content
  • filename

That's enough for search, and it also lets the model show where an answer came from.

Unused helpers in docs.py

docs.py also includes sliding_window and chunk_documents.

They split long documents into overlapping chunks:

chunks = sliding_window("hello world", size=5, step=3)

Those helpers are useful for longer documentation sets, but this notebook doesn't call them. The FAQ entries are already small enough to index directly.

Continue with Part 2: Base FAQ agent to turn these documents into a searchable tool and a working agent.

Questions & Answers

Sign up to ask questions, track your progress, and get access to other workshops · Already have an account? Sign in