Workshops ... Part 1: Building a classic RAG system

Part 1: Building a classic RAG system

Before we add agents, we build a classic RAG pipeline the straightforward way: search a document index, stuff the results into a prompt, and let the LLM answer. Once it works, we look at where it breaks - and those breakages are exactly what motivates the move to agents.

RAG basics

RAG (Retrieval-Augmented Generation) connects an LLM to your data so it answers questions grounded in your documents instead of relying on what it memorized during training. Three components:

  1. Search - retrieve documents relevant to the question
  2. Prompt building - stitch the retrieved documents into a prompt
  3. LLM - generate an answer from that prompt

In code it looks like this:

def rag(question):
    search_results = search(question)
    user_prompt = build_prompt(question, search_results)
    return llm(RAG_INSTRUCTIONS, user_prompt)

The LLM only sees the documents we hand it, so its answers come from our data.

Loading the documentation

We use gitsource to fetch and parse Markdown files from the Evidently AI docs repo:

from gitsource import GithubRepositoryDataReader

reader = GithubRepositoryDataReader(
    repo_owner="evidentlyai",
    repo_name="docs",
    allowed_extensions={"md", "mdx"},
)
files = reader.read()
parsed_docs = [doc.parse() for doc in files]

Each parsed document has title, description, content, and filename fields.

Chunking

The documents are too long to fit into a prompt as-is, so we split them into smaller chunks:

from gitsource import chunk_documents

chunked_docs = chunk_documents(parsed_docs, size=3000, step=1500)

chunk_documents(size=3000, step=1500) produces 3000-character chunks with 1500-character overlap so we don't lose information at chunk boundaries.

Indexing and searching

We use minsearch, a small in-memory search library, to index the chunks:

from minsearch import Index

index = Index(
    text_fields=["title", "description", "content"],
    keyword_fields=["filename"],
)
index.fit(chunked_docs)

text_fields are tokenized and ranked with TF-IDF. keyword_fields support exact-match filters. Wrap it in a search function:

def search(query):
    return index.search(query=query, num_results=5)

Building the prompt and calling the LLM

The LLM does not see the docs unless we pass them in. We build a prompt with instructions and a template:

RAG_INSTRUCTIONS = """
You're a documentation assistant. Answer the QUESTION based on the CONTEXT from our documentation.
Use only facts from the CONTEXT when answering.
If the answer isn't in the CONTEXT, say so.
""".strip()

RAG_PROMPT_TEMPLATE = """
<QUESTION>
{question}
</QUESTION>
<CONTEXT>
{context}
</CONTEXT>
""".strip()

Stitch the search results into the template:

import json

def build_prompt(question, search_results):
    context = json.dumps(search_results, indent=2)
    return RAG_PROMPT_TEMPLATE.format(
        question=question, context=context
    )

The llm function sends the prompt to OpenAI:

def llm(instructions, user_prompt, model="gpt-4o-mini"):
    messages = [
        {"role": "system", "content": instructions},
        {"role": "user", "content": user_prompt},
    ]
    response = openai_client.responses.create(
        model=model, input=messages
    )
    return response.output_text

Now the full pipeline works:

def rag(question):
    search_results = search(question)
    user_prompt = build_prompt(question, search_results)
    return llm(RAG_INSTRUCTIONS, user_prompt)

Try it:

rag("How do I create a dashboard in Evidently?")

You should get an answer grounded in the knowledge base.

Limits of classic RAG

This works for clean, straightforward questions. But notice the limitations:

  • Chunking loses context. If the answer is spread across chunks 1, 3, and 5 and we only retrieve chunks 2 and 4, we are stuck.
  • One search per question. If the first query fails, we are done.
  • The LLM has no say in retrieval. It only sees what came back.
  • There is no way to open a document. A snippet is not always enough.

Try the same question with a typo:

rag("How do I create a dahsbord in Evidently?")

The answer gets noticeably worse or the system says it cannot find anything. The literal token dahsbord does not match anything in the index, and the pipeline has no way to fix that.

To do better, we need to put the LLM in the driver's seat and let it decide what to query. That is what an agent does. Continue with Part 2: From RAG to an agent.

Questions & Answers

Sign in to ask questions