Part 1: Load FAQ documents
We start by loading the course FAQ because every later guardrail needs a
real agent to protect. The notebook uses docs.py for this job: it
downloads a GitHub repository, filters files, parses markdown frontmatter,
and returns dictionaries we can search.
Download the loader
The first notebook cell downloads docs.py from the workshop repo:
wget https://raw.githubusercontent.com/alexeygrigorev/workshops/refs/heads/main/guardrails/docs.py
The helper is deliberately generic. It can read any public GitHub repository, but in this workshop we point it at the DataTalks.Club FAQ repository and filter for Data Engineering Zoomcamp files.
The FAQ exists as a large rendered document too, but searching it by hand
is awkward. A word like join can match SQL joins as well as course-join
questions. The workshop builds a search layer so the agent can retrieve a
small set of candidate FAQ entries first.
The reader class
docs.py starts with a small dataclass:
from dataclasses import dataclass
@dataclass
class RawRepositoryFile:
filename: str
content: str
The reader returns one RawRepositoryFile per file. Keep the filename with
the content because the FAQ filenames still carry useful course and section
context.
The constructor builds the GitHub zip URL and stores the filters:
class GithubRepositoryDataReader:
"""
Downloads and parses markdown and code files from a GitHub repository.
"""
def __init__(self,
repo_owner: str,
repo_name: str,
allowed_extensions: Iterable[str] | None = None,
filename_filter: Callable[[str], bool] | None = None
):
prefix = "https://codeload.github.com"
self.url = (
f"{prefix}/{repo_owner}/{repo_name}/zip/refs/heads/main"
)
The allowed_extensions filter keeps us on markdown files, and
filename_filter lets us keep only paths containing data-engineering.
That means the notebook does not need to download one file at a time or
know the FAQ folder structure in advance.
Reading the zip file
The read method downloads the repository zip and passes it to
_extract_files:
def read(self) -> list[RawRepositoryFile]:
"""
Download and extract files from the GitHub repository.
"""
resp = requests.get(self.url)
if resp.status_code != 200:
raise Exception(f"Failed to download repository: {resp.status_code}")
zf = zipfile.ZipFile(io.BytesIO(resp.content))
repository_data = self._extract_files(zf)
zf.close()
return repository_data
The extraction loop normalizes the path, applies the filters, reads the file as UTF-8, and stores the result:
def _extract_files(self, zf: zipfile.ZipFile) -> list[RawRepositoryFile]:
data = []
for file_info in zf.infolist():
filepath = self._normalize_filepath(file_info.filename)
if self._should_skip_file(filepath):
continue
with zf.open(file_info) as f_in:
content = f_in.read().decode("utf-8", errors="ignore")
if content is not None:
content = content.strip()
file = RawRepositoryFile(
filename=filepath,
content=content
)
data.append(file)
return data
The full helper in the workshop repo catches per-file exceptions and prints a traceback so one bad file does not stop the whole read. The control flow above is the part you need to understand for the notebook.
File filters
The reader skips directories, hidden files, files with the wrong
extension, and files rejected by filename_filter:
def _should_skip_file(self, filepath: str) -> bool:
filepath = filepath.lower()
if filepath.endswith("/"):
return True
filename = filepath.split("/")[-1]
if filename.startswith("."):
return True
if self.allowed_extensions:
ext = self._get_extension(filepath)
if ext not in self.allowed_extensions:
return True
if not self.filename_filter(filepath):
return True
return False
The zip file contains a top-level folder such as faq-main/. The helper
removes that prefix so filenames are stable:
def _normalize_filepath(self, filepath: str) -> str:
"""
Removes the top-level directory from the file path inside the zip archive.
'repo-main/path/to/file.py' -> 'path/to/file.py'
"""
parts = filepath.split("/", maxsplit=1)
if len(parts) > 1:
return parts[1]
else:
return parts[0]
The filename becomes part of each parsed FAQ document, which makes it available later when search results are returned to the model.
Parse the FAQ markdown
parse_data uses python-frontmatter to turn each markdown file into a
dictionary:
def parse_data(data_raw):
data_parsed = []
for f in data_raw:
post = frontmatter.loads(f.content)
data = post.to_dict()
data['filename'] = f.filename
data_parsed.append(data)
return data_parsed
Now use the reader in the notebook:
from docs import GithubRepositoryDataReader, parse_data
reader = GithubRepositoryDataReader(
repo_owner="DataTalksClub",
repo_name="faq",
allowed_extensions={"md"},
filename_filter=lambda fp: "data-engineering" in fp.lower()
)
faq_raw = reader.read()
faq_documents = parse_data(faq_raw)
print(f"Loaded {len(faq_documents)} FAQ entries")
Expected output:
Loaded 449 FAQ entries
Look at one document before building the agent:
faq_documents[4]
The record includes id, question, sort_order, content,
and filename. That is enough for search and for showing the model where
an answer came from.
Unused helpers in docs.py
docs.py also includes sliding_window and chunk_documents. They split
long documents into overlapping chunks:
chunks = sliding_window("hello world", size=5, step=3)
Those helpers are useful for longer documentation sets, but this notebook does not call them. The FAQ entries are already small enough to index directly.
Continue with Part 2: Base FAQ agent to turn these documents into a searchable tool and a working agent.