Part 1: Load FAQ documents

We start by loading the course FAQ because every later guardrail needs a real agent to protect. The notebook uses docs.py for this job: it downloads a GitHub repository, filters files, parses markdown frontmatter, and returns dictionaries we can search.

Download the loader

The first notebook cell downloads docs.py from the workshop repo:

wget https://raw.githubusercontent.com/alexeygrigorev/workshops/refs/heads/main/guardrails/docs.py

The helper is deliberately generic. It can read any public GitHub repository, but in this workshop we point it at the DataTalks.Club FAQ repository and filter for Data Engineering Zoomcamp files.

The FAQ exists as a large rendered document too, but searching it by hand is awkward. A word like join can match SQL joins as well as course-join questions. The workshop builds a search layer so the agent can retrieve a small set of candidate FAQ entries first.

The reader class

docs.py starts with a small dataclass:

from dataclasses import dataclass

@dataclass
class RawRepositoryFile:
    filename: str
    content: str

The reader returns one RawRepositoryFile per file. Keep the filename with the content because the FAQ filenames still carry useful course and section context.

The constructor builds the GitHub zip URL and stores the filters:

class GithubRepositoryDataReader:
    """
    Downloads and parses markdown and code files from a GitHub repository.
    """

    def __init__(self,
                repo_owner: str,
                repo_name: str,
                allowed_extensions: Iterable[str] | None = None,
                filename_filter: Callable[[str], bool] | None = None
        ):
        prefix = "https://codeload.github.com"
        self.url = (
            f"{prefix}/{repo_owner}/{repo_name}/zip/refs/heads/main"
        )

The allowed_extensions filter keeps us on markdown files, and filename_filter lets us keep only paths containing data-engineering. That means the notebook does not need to download one file at a time or know the FAQ folder structure in advance.

Reading the zip file

The read method downloads the repository zip and passes it to _extract_files:

def read(self) -> list[RawRepositoryFile]:
    """
    Download and extract files from the GitHub repository.
    """
    resp = requests.get(self.url)
    if resp.status_code != 200:
        raise Exception(f"Failed to download repository: {resp.status_code}")

    zf = zipfile.ZipFile(io.BytesIO(resp.content))
    repository_data = self._extract_files(zf)
    zf.close()

    return repository_data

The extraction loop normalizes the path, applies the filters, reads the file as UTF-8, and stores the result:

def _extract_files(self, zf: zipfile.ZipFile) -> list[RawRepositoryFile]:
    data = []

    for file_info in zf.infolist():
        filepath = self._normalize_filepath(file_info.filename)

        if self._should_skip_file(filepath):
            continue

        with zf.open(file_info) as f_in:
            content = f_in.read().decode("utf-8", errors="ignore")
            if content is not None:
                content = content.strip()

            file = RawRepositoryFile(
                filename=filepath,
                content=content
            )
            data.append(file)

    return data

The full helper in the workshop repo catches per-file exceptions and prints a traceback so one bad file does not stop the whole read. The control flow above is the part you need to understand for the notebook.

File filters

The reader skips directories, hidden files, files with the wrong extension, and files rejected by filename_filter:

def _should_skip_file(self, filepath: str) -> bool:
    filepath = filepath.lower()

    if filepath.endswith("/"):
        return True

    filename = filepath.split("/")[-1]
    if filename.startswith("."):
        return True

    if self.allowed_extensions:
        ext = self._get_extension(filepath)
        if ext not in self.allowed_extensions:
            return True

    if not self.filename_filter(filepath):
        return True

    return False

The zip file contains a top-level folder such as faq-main/. The helper removes that prefix so filenames are stable:

def _normalize_filepath(self, filepath: str) -> str:
    """
    Removes the top-level directory from the file path inside the zip archive.
    'repo-main/path/to/file.py' -> 'path/to/file.py'
    """
    parts = filepath.split("/", maxsplit=1)
    if len(parts) > 1:
        return parts[1]
    else:
        return parts[0]

The filename becomes part of each parsed FAQ document, which makes it available later when search results are returned to the model.

Parse the FAQ markdown

parse_data uses python-frontmatter to turn each markdown file into a dictionary:

def parse_data(data_raw):
    data_parsed = []
    for f in data_raw:
        post = frontmatter.loads(f.content)
        data = post.to_dict()
        data['filename'] = f.filename
        data_parsed.append(data)

    return data_parsed

Now use the reader in the notebook:

from docs import GithubRepositoryDataReader, parse_data

reader = GithubRepositoryDataReader(
    repo_owner="DataTalksClub",
    repo_name="faq",
    allowed_extensions={"md"},
    filename_filter=lambda fp: "data-engineering" in fp.lower()
)

faq_raw = reader.read()
faq_documents = parse_data(faq_raw)

print(f"Loaded {len(faq_documents)} FAQ entries")

Expected output:

Loaded 449 FAQ entries

Look at one document before building the agent:

faq_documents[4]

The record includes id, question, sort_order, content, and filename. That is enough for search and for showing the model where an answer came from.

Unused helpers in docs.py

docs.py also includes sliding_window and chunk_documents. They split long documents into overlapping chunks:

chunks = sliding_window("hello world", size=5, step=3)

Those helpers are useful for longer documentation sets, but this notebook does not call them. The FAQ entries are already small enough to index directly.

Continue with Part 2: Base FAQ agent to turn these documents into a searchable tool and a working agent.

Questions & Answers (0)

Sign in to ask questions