Name: Quartalis
Address: GB
Price range: ££

Retrieval Augmented Generation (RAG) is quickly becoming the go-to architecture for building AI applications that can answer questions, summarise documents, and engage in conversations using your own data. Instead of relying solely on the knowledge baked into large language models (LLMs), RAG allows you to inject relevant information from external sources, grounding the model’s responses and reducing hallucinations. Let’s dive into how you can build a basic RAG system from scratch using Python and ChromaDB.

Setting Up Your Environment

First things first, we need to install the necessary libraries. I’m using ollama as my LLM for embeddings, chromadb as the vector store, and tiktoken for token counting. I like to use Poetry for dependency management, but pip will work just fine.

poetry add chromadb langchain tiktoken requests python-dotenv

Create a .env file in your project root directory. This is where we will store the URL of our Ollama server:

OLLAMA_BASE_URL="http://localhost:11434"

Make sure you have Ollama running and have pulled a model, such as llama2.

ollama pull llama2

Now, let’s get started with the Python code. Create a file named rag.py and add the following import statements:

import os
from dotenv import load_dotenv
import requests
import tiktoken
import chromadb
from chromadb.utils import embedding_functions
from langchain.text_splitter import RecursiveCharacterTextSplitter

load_dotenv()

OLLAMA_BASE_URL = os.getenv("OLLAMA_BASE_URL")

if not OLLAMA_BASE_URL:
    raise ValueError("OLLAMA_BASE_URL not found in .env file.")

This sets up our environment, loads the .env file, and checks that our OLLAMA_BASE_URL is configured.

Document Loading and Chunking

The first step in building a RAG pipeline is loading your documents. For this example, let’s fetch some text from a URL.

def load_document(url: str) -> str:
    """Loads a document from a URL."""
    response = requests.get(url)
    response.raise_for_status()  # Raise HTTPError for bad responses (4xx or 5xx)
    return response.text

Now that we can load a document, we need to split it into smaller chunks. Chunking is a critical step, as it directly impacts the quality of the retrieval process. Smaller chunks might be more relevant but could lack context, while larger chunks might contain too much irrelevant information.

Here’s a simple example using RecursiveCharacterTextSplitter from Langchain:

def chunk_document(document: str, chunk_size: int = 500, chunk_overlap: int = 50) -> list[str]:
    """Chunks a document into smaller pieces."""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        is_separator_regex=False,
    )
    chunks = text_splitter.split_text(document)
    return chunks

This function splits the document into chunks of 500 characters with an overlap of 50 characters. Experiment with different chunk sizes and overlaps to find the optimal settings for your data. I’ve found that 500/50 is a reasonable starting point for many use cases.

We also need a function to count tokens, as many LLMs have token limits. We can use tiktoken for this:

def count_tokens(text: str) -> int:
    """Counts the number of tokens in a text."""
    enc = tiktoken.encoding_for_model("gpt-4")  # Or your model of choice
    return len(enc.encode(text))

Embedding with Ollama

Now comes the crucial part: converting our text chunks into vector embeddings. Vector embeddings are numerical representations of text that capture their semantic meaning. We’ll use Ollama for this.

First, we define a custom embedding function that sends requests to the Ollama server:

def ollama_embedding_function(texts: list[str]) -> list[list[float]]:
    """Embeds a list of texts using Ollama."""
    embeddings = []
    for text in texts:
        response = requests.post(
            f"{OLLAMA_BASE_URL}/api/embeddings",
            json={"prompt": text, "model": "llama2"},
        )
        response.raise_for_status()
        embeddings.append(response.json()["embedding"])
    return embeddings

Then, we create a ChromaDB client using our custom embedding function:

def create_chroma_client(embedding_function) -> chromadb.Client:
    """Creates a ChromaDB client with the specified embedding function."""
    client = chromadb.Client(
        chromadb.Settings(
            chroma_db_impl="duckdb+parquet",
            persist_directory="db",  # Optional: Persist the database to disk
            anonymized_telemetry=False,
        )
    )
    return client

And a helper function to get or create the collection.

def get_chroma_collection(client: chromadb.Client, name: str):
    """Gets or creates a ChromaDB collection."""
    collection = client.get_or_create_collection(
        name=name, embedding_function=embedding_function
    )
    return collection

Storing Embeddings in ChromaDB

With our embedding function and ChromaDB client ready, we can now store the embeddings.

def store_embeddings(collection: chromadb.Collection, chunks: list[str]):
    """Stores embeddings in ChromaDB."""
    ids = [str(i) for i in range(len(chunks))]
    collection.add(
        documents=chunks,
        ids=ids,
    )

This function takes a list of text chunks and stores them in ChromaDB. Each chunk is assigned a unique ID.

Retrieval and Querying

Now for the exciting part: retrieving relevant chunks based on a query.

def query_chroma(collection: chromadb.Collection, query: str, n_results: int = 5) -> list[str]:
    """Queries ChromaDB and returns the top n results."""
    results = collection.query(
        query_texts=[query],
        n_results=n_results,
    )
    return results["documents"][0]

This function takes a query string and returns the top n_results most relevant chunks from ChromaDB. The query_texts parameter takes a list of queries, but we’re only passing a single query in this case.

Bringing it All Together: The RAG Pipeline

Let’s tie everything together into a single function:

def rag_pipeline(url: str, query: str) -> str:
    """The complete RAG pipeline."""
    document = load_document(url)
    chunks = chunk_document(document)

    # Basic token limit check
    if any(count_tokens(chunk) > 500 for chunk in chunks):
        print("Warning: Some chunks exceed 500 tokens. Consider smaller chunk sizes.")

    client = create_chroma_client(ollama_embedding_function)
    collection = get_chroma_collection(client, name="my_collection")
    store_embeddings(collection, chunks)
    results = query_chroma(collection, query)
    client.persist() # persist the database to disk
    return results

This function takes a URL and a query as input, and it returns the most relevant chunks from the document at that URL. It performs document loading, chunking, embedding, storage, and retrieval.

Finally, here’s an example of how to use the rag_pipeline function:

if __name__ == "__main__":
    url = "https://raw.githubusercontent.com/darrenjbetney/digital-garden/main/src/content/blog/building-a-rag-system-from-scratch-with-python-and-chromadb/document.md"  # Replace with your document URL
    query = "What is Retrieval Augmented Generation?"
    results = rag_pipeline(url, query)
    print(results)

Create a text file named document.md in the same directory as rag.py with the following content:

# What is Retrieval Augmented Generation (RAG)?

Retrieval Augmented Generation (RAG) is an AI framework that enhances the capabilities of Large Language Models (LLMs) by allowing them to access and incorporate information from external sources during the generation process. This approach combines the power of pre-trained LLMs with the ability to retrieve relevant information from a knowledge base, enabling the model to generate more accurate, context-aware, and informative responses.

## How RAG Works

1.  **Retrieval:** When a user poses a question or provides a prompt, the RAG system first retrieves relevant documents or passages from an external knowledge base (e.g., a vector database, a document store, or the web). This retrieval process is typically based on semantic similarity between the user's query and the content in the knowledge base.

2.  **Augmentation:** The retrieved information is then combined with the original user prompt to create an augmented input. This augmented input provides the LLM with additional context and information that is relevant to the user's query.

3.  **Generation:** Finally, the LLM processes the augmented input and generates a response. Because the LLM has access to the retrieved information, it can generate more accurate, informative, and context-aware responses compared to relying solely on its pre-trained knowledge.

## Benefits of RAG

*   **Improved Accuracy:** By grounding the LLM's responses in external knowledge, RAG can significantly reduce the risk of generating inaccurate or hallucinated content.
*   **Increased Context Awareness:** RAG enables LLMs to generate responses that are more tailored to the specific context of the user's query.
*   **Enhanced Informative Content:** By incorporating information from external sources, RAG can provide users with more comprehensive and informative answers.
*   **Reduced Hallucinations:** RAG's reliance on retrieved information helps to mitigate the problem of LLMs generating fictional or nonsensical content.

Run the script:

python rag.py

You should see the most relevant chunks from the document printed to the console.

Wrapping Up

This is a basic example of a RAG pipeline, but it demonstrates the core principles. From here, you can explore more advanced techniques such as:

More Sophisticated Chunking: Experiment with different chunking strategies, such as semantic chunking, to improve retrieval accuracy.
Advanced Retrieval Methods: Explore different vector similarity search algorithms and techniques for filtering and ranking retrieved documents. The Quartalis platform includes advanced hybrid search capabilities.
Query Expansion: Refine the user’s query to improve retrieval accuracy.
Contextual Compression: Compress the retrieved documents to fit within the LLM’s context window.

By building your own RAG system, you gain a deeper understanding of how these AI applications work and how to tailor them to your specific needs.