Name: Quartalis
Address: GB
Price range: ££

Retrieval-Augmented Generation (RAG) is only as good as the chunks it retrieves. You can have the fanciest large language model in the world, but if you’re feeding it poorly prepared data, the results will be underwhelming. The unsung hero of effective RAG pipelines is the humble chunker. It dictates how your documents are split into manageable pieces, and the decisions you make here have a massive impact on retrieval quality, context window utilisation, and overall system performance. This post dives into different chunking strategies, their pros and cons, and how to optimise them for your specific use case.

Why Chunking Matters: The Goldilocks Zone

Before we get into the nitty-gritty of different chunking methods, let’s quickly recap why chunking is so important in RAG. The core idea is to divide your source documents into smaller, more manageable segments that can be efficiently indexed and searched. These segments, or “chunks”, become the units of retrieval.

If your chunks are too small, you risk losing context. The LLM might receive isolated snippets of information that are difficult to interpret or lack the necessary surrounding details. On the other hand, if your chunks are too large, you might exceed the context window of your LLM, leading to truncation or reduced performance. Large chunks can also dilute the relevant information with irrelevant details, hindering the retrieval process.

The goal is to find the “Goldilocks zone” – a chunk size that provides enough context for the LLM to understand the information without overwhelming it. This optimal size depends heavily on the nature of your documents, the LLM you’re using, and the type of questions you’re expecting.

Chunking Strategies: A Head-to-Head Comparison

Let’s explore some common chunking strategies and their characteristics:

1. Fixed-Size Chunking

The simplest approach is to divide the document into chunks of a fixed length, measured in characters, words, or tokens.

def fixed_size_chunking(text, chunk_size, chunk_overlap=0):
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunks.append(text[start:end])
        start += chunk_size - chunk_overlap
    return chunks

text = "This is a longer text to be chunked. We will split it into chunks of fixed size."
chunks = fixed_size_chunking(text, chunk_size=50, chunk_overlap=10)
print(chunks)

Pros:

Easy to implement.
Fast processing.

Cons:

Ignores document structure and semantic boundaries. Can lead to chunks that split sentences or paragraphs in awkward places, impacting context.
May require manual tweaking of chunk size and overlap to achieve acceptable results.

Fixed-size chunking is a good starting point for experimentation, especially with relatively homogenous data where preserving fine-grained semantic boundaries isn’t critical.

2. Recursive Chunking

Recursive chunking aims to respect the inherent structure of a document by splitting it hierarchically. It starts by attempting to split the document at the highest level of structure (e.g., chapters, sections, paragraphs) and then recursively splits each segment until it reaches a desired chunk size. Langchain’s RecursiveCharacterTextSplitter is a popular implementation of this strategy.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text = """
# Chapter 1: Introduction
This is the first chapter. It introduces the main concepts.

## Section 1.1: Background
Some background information.

This is a new paragraph.
"""

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap  = 20,
    separators=["

", "
", " ", ""] # Prioritise splitting by paragraph, then newline, then space.
)

chunks = text_splitter.split_text(text)
print(chunks)

Pros:

Preserves document structure and semantic relationships.
Generally produces more coherent and contextually relevant chunks.

Cons:

More complex to implement than fixed-size chunking.
Performance can be slower due to the recursive nature of the algorithm.
Requires careful selection of separators to align with the document structure.

This is often a great default choice, offering a good balance between simplicity and quality. Quartalis uses recursive chunking as a foundation for its advanced document processing features, allowing users to easily tailor the separators to suit their specific data formats.

3. Semantic Chunking

Semantic chunking focuses on grouping sentences or paragraphs that are semantically related. The idea is to create chunks that represent complete ideas or concepts. This can be achieved by using sentence embeddings to measure the semantic similarity between sentences and grouping them based on a similarity threshold.

from sentence_transformers import SentenceTransformer
import numpy as np

def semantic_chunking(text, chunk_size=3, threshold=0.7):
    model = SentenceTransformer('all-MiniLM-L6-v2')
    sentences = text.split(".") #Naive sentence splitting
    embeddings = model.encode(sentences)
    chunks = []
    current_chunk = [sentences[0]]
    current_embedding = embeddings[0]
    for i in range(1, len(sentences)):
        similarity = np.dot(current_embedding, embeddings[i]) / (np.linalg.norm(current_embedding) * np.linalg.norm(embeddings[i]))
        if similarity > threshold and len(current_chunk) < chunk_size:
            current_chunk.append(sentences[i])
            current_embedding = (current_embedding + embeddings[i]) / 2
        else:
            chunks.append(".".join(current_chunk))
            current_chunk = [sentences[i]]
            current_embedding = embeddings[i]
    chunks.append(".".join(current_chunk))
    return chunks


text = "This is the first sentence. The second sentence is similar. The third sentence is unrelated. The fourth sentence is also unrelated."
chunks = semantic_chunking(text)
print(chunks)

Pros:

Creates chunks that are semantically coherent and represent complete ideas.
Can improve retrieval accuracy by focusing on the underlying meaning of the text.

Cons:

Computationally expensive, requiring the calculation of sentence embeddings.
Sensitive to the choice of sentence embedding model and similarity threshold.
May not be suitable for documents with complex or nuanced semantic structures.

Semantic chunking is best suited for scenarios where semantic accuracy is paramount and computational resources are available. It can be particularly effective for handling complex documents with intricate relationships between concepts.

4. Markdown-Aware Chunking

If your documents are in Markdown format, you can leverage the structure inherent in the Markdown syntax to guide the chunking process. This involves splitting the document based on headings, subheadings, lists, and other Markdown elements.

import re

def markdown_chunking(text):
    chunks = re.split(r"(^#+s.*$)", text, flags=re.MULTILINE)
    # The regex splits the string by headings. It also captures the headings.
    # This results in a list where headings and content alternate.
    # We need to combine them back into logical chunks.

    formatted_chunks = []
    for i in range(1, len(chunks), 2):
        heading = chunks[i].strip()
        content = chunks[i+1].strip()
        formatted_chunks.append(f"{heading}
{content}")

    return formatted_chunks


markdown_text = """
# Title

Some introductory text.

## Section 1

Content of section 1.

## Section 2

Content of section 2.
"""

chunks = markdown_chunking(markdown_text)
print(chunks)

Pros:

Preserves the logical structure of Markdown documents.
Creates chunks that are well-organised and easy to understand.
Can be combined with other chunking strategies for further refinement.

Cons:

Specific to Markdown format.
Requires careful handling of different Markdown elements.
May not be suitable for documents with inconsistent or poorly formatted Markdown.

For projects like Quartalis’s documentation pipeline, which relies heavily on Markdown, this strategy provides a natural and effective way to create high-quality chunks.

Overlap: A Safety Net for Context

Chunk overlap, also known as “sliding window”, is a technique where consecutive chunks share some common text. This helps to maintain context across chunk boundaries and prevent information loss. The amount of overlap can be adjusted depending on the specific requirements of the application.

def fixed_size_chunking(text, chunk_size, chunk_overlap=0):
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunks.append(text[start:end])
        start += chunk_size - chunk_overlap
    return chunks

text = "This is a longer text to be chunked. We will split it into chunks with overlap."
chunks = fixed_size_chunking(text, chunk_size=30, chunk_overlap=10)
print(chunks)

In the above example, the chunk_overlap parameter specifies the number of characters to overlap between consecutive chunks. Choosing the right amount of overlap is crucial. Too little overlap, and you risk losing context. Too much, and you introduce redundancy and increase the size of your index. A good starting point is to use an overlap of 10-20% of the chunk size.

Benchmarking and Optimisation: Finding Your Sweet Spot

The optimal chunking strategy and chunk size depend on several factors, including the characteristics of your data, the LLM you’re using, and the specific task you’re trying to accomplish. The best way to determine the optimal configuration is through experimentation and benchmarking.

Here’s a general approach:

Define Evaluation Metrics: Establish metrics to assess the quality of your RAG system. Common metrics include retrieval accuracy (the percentage of relevant chunks retrieved) and generation quality (measured using metrics like BLEU or ROUGE). You could also devise a human evaluation process.
Create a Test Dataset: Prepare a representative test dataset that reflects the type of documents and queries you expect to handle in production.
Experiment with Different Chunking Strategies: Implement different chunking strategies and chunk sizes.
Measure Performance: Evaluate the performance of each configuration using your chosen metrics.
Iterate and Refine: Analyse the results and iterate on your chunking strategy and chunk size until you achieve satisfactory performance.

Wrapping Up

Choosing the right chunking strategy is crucial for building effective RAG systems. While there’s no one-size-fits-all solution, understanding the trade-offs between different approaches and carefully tuning the chunk size and overlap can significantly improve retrieval accuracy and overall system performance. Don’t be afraid to experiment and iterate to find the sweet spot for your specific use case. The Quartalis platform is designed to make this experimentation process easier, with flexible chunking options and built-in evaluation tools. Good luck chunking!

Chunking Strategies for RAG: Size Matters More Than You Think

Why Chunking Matters: The Goldilocks Zone

Chunking Strategies: A Head-to-Head Comparison

1. Fixed-Size Chunking

2. Recursive Chunking

3. Semantic Chunking

4. Markdown-Aware Chunking

Overlap: A Safety Net for Context

Benchmarking and Optimisation: Finding Your Sweet Spot

Wrapping Up

Related Posts

Topic Consolidation: Turning Thousands of Chunks into Structured Knowledge

Recency Weighting in RAG: When Newer Information Matters More

Multi-Query Retrieval: Ask the Same Question Five Different Ways