Part 1: Building a classic RAG system
Before we add agents, we build a classic RAG pipeline the straightforward way: search a document index, stuff the results into a prompt, and let the LLM answer. Once it works, we look at where it breaks - and those breakages are exactly what motivates the move to agents.
RAG basics
RAG (Retrieval-Augmented Generation) connects an LLM to your data so it answers questions grounded in your documents instead of relying on what it memorized during training. Three components:
- Search - retrieve documents relevant to the question
- Prompt building - stitch the retrieved documents into a prompt
- LLM - generate an answer from that prompt
In code it looks like this:
def rag(question):
search_results = search(question)
user_prompt = build_prompt(question, search_results)
return llm(RAG_INSTRUCTIONS, user_prompt)
The LLM only sees the documents we hand it, so its answers come from our data.
Loading the documentation
We use gitsource to fetch and parse Markdown files from the Evidently AI docs repo:
from gitsource import GithubRepositoryDataReader
reader = GithubRepositoryDataReader(
repo_owner="evidentlyai",
repo_name="docs",
allowed_extensions={"md", "mdx"},
)
files = reader.read()
parsed_docs = [doc.parse() for doc in files]
Each parsed document has title, description, content, and filename fields.
Chunking
The documents are too long to fit into a prompt as-is, so we split them into smaller chunks:
from gitsource import chunk_documents
chunked_docs = chunk_documents(parsed_docs, size=3000, step=1500)
chunk_documents(size=3000, step=1500) produces 3000-character chunks with 1500-character overlap so we don't lose information at chunk boundaries.
Indexing and searching
We use minsearch, a small in-memory search library, to index the chunks:
from minsearch import Index
index = Index(
text_fields=["title", "description", "content"],
keyword_fields=["filename"],
)
index.fit(chunked_docs)
text_fields are tokenized and ranked with TF-IDF. keyword_fields support exact-match filters. Wrap it in a search function:
def search(query):
return index.search(query=query, num_results=5)
Building the prompt and calling the LLM
The LLM does not see the docs unless we pass them in. We build a prompt with instructions and a template:
RAG_INSTRUCTIONS = """
You're a documentation assistant. Answer the QUESTION based on the CONTEXT from our documentation.
Use only facts from the CONTEXT when answering.
If the answer isn't in the CONTEXT, say so.
""".strip()
RAG_PROMPT_TEMPLATE = """
<QUESTION>
{question}
</QUESTION>
<CONTEXT>
{context}
</CONTEXT>
""".strip()
Stitch the search results into the template:
import json
def build_prompt(question, search_results):
context = json.dumps(search_results, indent=2)
return RAG_PROMPT_TEMPLATE.format(
question=question, context=context
)
The llm function sends the prompt to OpenAI:
def llm(instructions, user_prompt, model="gpt-4o-mini"):
messages = [
{"role": "system", "content": instructions},
{"role": "user", "content": user_prompt},
]
response = openai_client.responses.create(
model=model, input=messages
)
return response.output_text
Now the full pipeline works:
def rag(question):
search_results = search(question)
user_prompt = build_prompt(question, search_results)
return llm(RAG_INSTRUCTIONS, user_prompt)
Try it:
rag("How do I create a dashboard in Evidently?")
You should get an answer grounded in the knowledge base.
Limits of classic RAG
This works for clean, straightforward questions. But notice the limitations:
- Chunking loses context. If the answer is spread across chunks 1, 3, and 5 and we only retrieve chunks 2 and 4, we are stuck.
- One search per question. If the first query fails, we are done.
- The LLM has no say in retrieval. It only sees what came back.
- There is no way to open a document. A snippet is not always enough.
Try the same question with a typo:
rag("How do I create a dahsbord in Evidently?")
The answer gets noticeably worse or the system says it cannot find anything. The literal token dahsbord does not match anything in the index, and the pipeline has no way to fix that.
To do better, we need to put the LLM in the driver's seat and let it decide what to query. That is what an agent does. Continue with Part 2: From RAG to an agent.