Chunking Strategies for RAG: Size Matters More Than You Think

Retrieval-Augmented Generation (RAG) is only as good as the chunks it retrieves. You can have the fanciest large language model in the world, but if you’re feeding it poorly prepared data, the results will be underwhelming. The unsung hero of effective RAG pipelines is the humble chunker. It dictates how your documents are split into manageable pieces, and the decisions you make here have a massive impact on retrieval quality, context window utilisation, and overall system performance. This post dives into different chunking strategies, their pros and cons, and how to optimise them for your specific use case.
Why Chunking Matters: The Goldilocks Zone
Before we get into the nitty-gritty of different chunking methods, let’s quickly recap why chunking is so important in RAG. The core idea is to divide your source documents into smaller, more manageable segments that can be efficiently indexed and searched. These segments, or “chunks”, become the units of retrieval.
If your chunks are too small, you risk losing context. The LLM might receive isolated snippets of information that are difficult to interpret or lack the necessary surrounding details. On the other hand, if your chunks are too large, you might exceed the context window of your LLM, leading to truncation or reduced performance. Large chunks can also dilute the relevant information with irrelevant details, hindering the retrieval process.
The goal is to find the “Goldilocks zone” – a chunk size that provides enough context for the LLM to understand the information without overwhelming it. This optimal size depends heavily on the nature of your documents, the LLM you’re using, and the type of questions you’re expecting.
Chunking Strategies: A Head-to-Head Comparison
Let’s explore some common chunking strategies and their characteristics:
1. Fixed-Size Chunking
The simplest approach is to divide the document into chunks of a fixed length, measured in characters, words, or tokens.
def fixed_size_chunking(text, chunk_size, chunk_overlap=0):
chunks = []
start = 0
while start < len(text):
end = min(start + chunk_size, len(text))
chunks.append(text[start:end])
start += chunk_size - chunk_overlap
return chunks
text = "This is a longer text to be chunked. We will split it into chunks of fixed size."
chunks = fixed_size_chunking(text, chunk_size=50, chunk_overlap=10)
print(chunks) Pros:
- Easy to implement.
- Fast processing.
Cons:
- Ignores document structure and semantic boundaries. Can lead to chunks that split sentences or paragraphs in awkward places, impacting context.
- May require manual tweaking of chunk size and overlap to achieve acceptable results.
Fixed-size chunking is a good starting point for experimentation, especially with relatively homogenous data where preserving fine-grained semantic boundaries isn’t critical.
2. Recursive Chunking
Recursive chunking aims to respect the inherent structure of a document by splitting it hierarchically. It starts by attempting to split the document at the highest level of structure (e.g., chapters, sections, paragraphs) and then recursively splits each segment until it reaches a desired chunk size. Langchain’s RecursiveCharacterTextSplitter is a popular implementation of this strategy.
from langchain.text_splitter import RecursiveCharacterTextSplitter
text = """
# Chapter 1: Introduction
This is the first chapter. It introduces the main concepts.
## Section 1.1: Background
Some background information.
This is a new paragraph.
"""
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 100,
chunk_overlap = 20,
separators=["
", "
", " ", ""] # Prioritise splitting by paragraph, then newline, then space.
)
chunks = text_splitter.split_text(text)
print(chunks) Pros:
- Preserves document structure and semantic relationships.
- Generally produces more coherent and contextually relevant chunks.
Cons:
- More complex to implement than fixed-size chunking.
- Performance can be slower due to the recursive nature of the algorithm.
- Requires careful selection of separators to align with the document structure.
This is often a great default choice, offering a good balance between simplicity and quality. Quartalis uses recursive chunking as a foundation for its advanced document processing features, allowing users to easily tailor the separators to suit their specific data formats.
3. Semantic Chunking
Semantic chunking focuses on grouping sentences or paragraphs that are semantically related. The idea is to create chunks that represent complete ideas or concepts. This can be achieved by using sentence embeddings to measure the semantic similarity between sentences and grouping them based on a similarity threshold.
from sentence_transformers import SentenceTransformer
import numpy as np
def semantic_chunking(text, chunk_size=3, threshold=0.7):
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = text.split(".") #Naive sentence splitting
embeddings = model.encode(sentences)
chunks = []
current_chunk = [sentences[0]]
current_embedding = embeddings[0]
for i in range(1, len(sentences)):
similarity = np.dot(current_embedding, embeddings[i]) / (np.linalg.norm(current_embedding) * np.linalg.norm(embeddings[i]))
if similarity > threshold and len(current_chunk) < chunk_size:
current_chunk.append(sentences[i])
current_embedding = (current_embedding + embeddings[i]) / 2
else:
chunks.append(".".join(current_chunk))
current_chunk = [sentences[i]]
current_embedding = embeddings[i]
chunks.append(".".join(current_chunk))
return chunks
text = "This is the first sentence. The second sentence is similar. The third sentence is unrelated. The fourth sentence is also unrelated."
chunks = semantic_chunking(text)
print(chunks) Pros:
- Creates chunks that are semantically coherent and represent complete ideas.
- Can improve retrieval accuracy by focusing on the underlying meaning of the text.
Cons:
- Computationally expensive, requiring the calculation of sentence embeddings.
- Sensitive to the choice of sentence embedding model and similarity threshold.
- May not be suitable for documents with complex or nuanced semantic structures.
Semantic chunking is best suited for scenarios where semantic accuracy is paramount and computational resources are available. It can be particularly effective for handling complex documents with intricate relationships between concepts.
4. Markdown-Aware Chunking
If your documents are in Markdown format, you can leverage the structure inherent in the Markdown syntax to guide the chunking process. This involves splitting the document based on headings, subheadings, lists, and other Markdown elements.
import re
def markdown_chunking(text):
chunks = re.split(r"(^#+s.*$)", text, flags=re.MULTILINE)
# The regex splits the string by headings. It also captures the headings.
# This results in a list where headings and content alternate.
# We need to combine them back into logical chunks.
formatted_chunks = []
for i in range(1, len(chunks), 2):
heading = chunks[i].strip()
content = chunks[i+1].strip()
formatted_chunks.append(f"{heading}
{content}")
return formatted_chunks
markdown_text = """
# Title
Some introductory text.
## Section 1
Content of section 1.
## Section 2
Content of section 2.
"""
chunks = markdown_chunking(markdown_text)
print(chunks) Pros:
- Preserves the logical structure of Markdown documents.
- Creates chunks that are well-organised and easy to understand.
- Can be combined with other chunking strategies for further refinement.
Cons:
- Specific to Markdown format.
- Requires careful handling of different Markdown elements.
- May not be suitable for documents with inconsistent or poorly formatted Markdown.
For projects like Quartalis’s documentation pipeline, which relies heavily on Markdown, this strategy provides a natural and effective way to create high-quality chunks.
Overlap: A Safety Net for Context
Chunk overlap, also known as “sliding window”, is a technique where consecutive chunks share some common text. This helps to maintain context across chunk boundaries and prevent information loss. The amount of overlap can be adjusted depending on the specific requirements of the application.
def fixed_size_chunking(text, chunk_size, chunk_overlap=0):
chunks = []
start = 0
while start < len(text):
end = min(start + chunk_size, len(text))
chunks.append(text[start:end])
start += chunk_size - chunk_overlap
return chunks
text = "This is a longer text to be chunked. We will split it into chunks with overlap."
chunks = fixed_size_chunking(text, chunk_size=30, chunk_overlap=10)
print(chunks) In the above example, the chunk_overlap parameter specifies the number of characters to overlap between consecutive chunks. Choosing the right amount of overlap is crucial. Too little overlap, and you risk losing context. Too much, and you introduce redundancy and increase the size of your index. A good starting point is to use an overlap of 10-20% of the chunk size.
Benchmarking and Optimisation: Finding Your Sweet Spot
The optimal chunking strategy and chunk size depend on several factors, including the characteristics of your data, the LLM you’re using, and the specific task you’re trying to accomplish. The best way to determine the optimal configuration is through experimentation and benchmarking.
Here’s a general approach:
- Define Evaluation Metrics: Establish metrics to assess the quality of your RAG system. Common metrics include retrieval accuracy (the percentage of relevant chunks retrieved) and generation quality (measured using metrics like BLEU or ROUGE). You could also devise a human evaluation process.
- Create a Test Dataset: Prepare a representative test dataset that reflects the type of documents and queries you expect to handle in production.
- Experiment with Different Chunking Strategies: Implement different chunking strategies and chunk sizes.
- Measure Performance: Evaluate the performance of each configuration using your chosen metrics.
- Iterate and Refine: Analyse the results and iterate on your chunking strategy and chunk size until you achieve satisfactory performance.
Wrapping Up
Choosing the right chunking strategy is crucial for building effective RAG systems. While there’s no one-size-fits-all solution, understanding the trade-offs between different approaches and carefully tuning the chunk size and overlap can significantly improve retrieval accuracy and overall system performance. Don’t be afraid to experiment and iterate to find the sweet spot for your specific use case. The Quartalis platform is designed to make this experimentation process easier, with flexible chunking options and built-in evaluation tools. Good luck chunking!
Need this built for your business?
Get In TouchRelated Posts
Topic Consolidation: Turning Thousands of Chunks into Structured Knowledge
## Example Code: Topic Clustering ##
Recency Weighting in RAG: When Newer Information Matters More
In Retrieval Augmented Generation (RAG) systems, we often treat all information as equally relevant, regardless of when it was created. But what if the freshness of information *really* matters?...
Multi-Query Retrieval: Ask the Same Question Five Different Ways
Imagine you're searching for the perfect recipe. You wouldn't just type 'chocolate cake,' would you? You might try 'best chocolate cake recipe,' 'easy chocolate cake,' 'chocolate fudge cake,' or...