Name: Quartalis
Address: GB
Price range: ££

Building a Retrieval-Augmented Generation (RAG) system that can answer questions based on a knowledge base is cool, but what if you want it to remember what you’ve already talked about? Creating a conversation-aware RAG is the next level, letting you build chatbots that feel more natural and understand the context of your ongoing conversation. This post will walk you through the key techniques for giving your RAG system a memory, complete with practical code examples.

The Challenge: RAG Without Context is Forgetful

Standard RAG systems treat each query in isolation. They take your question, retrieve relevant documents, and generate an answer. This works well for standalone queries, but falls apart when you start a conversation. Imagine asking:

“What’s the return policy?”

followed by:

“Is that different for sale items?”

A naive RAG system will treat the second question as if you’d never asked the first, failing to connect “that” to the return policy you just discussed. This leads to frustrating, repetitive interactions. We need to equip our RAG system with the ability to understand and use the context of the conversation.

Technique 1: Query Rewriting – Making it Explicit

One of the simplest and most effective techniques is query rewriting. The idea is to rewrite each incoming query to include the context from previous turns. This transforms implicit references into explicit questions that the RAG system can understand.

Let’s say we have a chat history like this:

chat_history = [
    {"role": "user", "content": "What's the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "content": "What about Germany?"}
]

The final query, “What about Germany?”, is ambiguous without context. We can use a language model to rewrite it:

from transformers import pipeline

def rewrite_query(chat_history):
    """Rewrites the last query in the chat history to include context."""
    if len(chat_history) < 2:
        return chat_history[-1]["content"]  # Nothing to rewrite

    context = "
".join([f"{msg['role']}: {msg['content']}" for msg in chat_history[:-1]])
    current_query = chat_history[-1]["content"]

    prompt = f"""Given the following conversation:
    {context}

    Rewrite the last user query to be a standalone question:
    {current_query}
    """

    pipe = pipeline("text2text-generation", model="google/flan-t5-base") #A decent and accessible model
    rewritten_query = pipe(prompt, max_length=128)[0]["generated_text"]
    return rewritten_query

rewritten_query = rewrite_query(chat_history)
print(f"Original query: {chat_history[-1]['content']}")
print(f"Rewritten query: {rewritten_query}")

This code snippet uses a text2text-generation pipeline (in this case, google/flan-t5-base) to rewrite the query. The prompt provides the chat history as context, instructing the model to create a standalone question. The output would be something like:

Original query: What about Germany?
Rewritten query: What is the capital of Germany?

Now, the RAG system can process “What is the capital of Germany?” without needing to remember the previous question about France.

Technique 2: Conversation History Injection – Feeding the RAG System Memory

Another approach is to directly inject the conversation history into the RAG pipeline. Instead of rewriting the query, we augment the query with relevant snippets from the previous exchanges. This can be done in a few ways:

Concatenate History: Simply append the entire conversation history to the query. This is the simplest, but can quickly become unwieldy and exceed the context window of your language model.
Selectively Inject: Use another language model (or a set of rules) to select the most relevant parts of the conversation history to include with the query. This requires more sophistication but is more scalable.

Here’s an example of selective injection:

from transformers import pipeline

def select_relevant_history(chat_history, query, max_history_length=512):
    """Selects relevant portions of the chat history to include in the context."""

    # Create a simplified representation of the chat history for the scoring model
    history_strings = [f"{msg['role']}: {msg['content']}" for msg in chat_history]

    # Use a model to score the relevance of each historical message to the current query.
    # I'm using a zero-shot classification pipeline as an example.
    classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

    candidate_labels = ["relevant", "irrelevant"]
    relevance_scores = []
    for h in history_strings:
        res = classifier(h, candidate_labels, hypothesis_template="This text is {} to the current query.")
        relevance_scores.append(res['scores'][0]) # Score for "relevant"

    # Keep only the messages with relevance score above a threshold
    threshold = 0.7
    relevant_history = [h for i, h in enumerate(history_strings) if relevance_scores[i] > threshold]

    # Combine the relevant history into a single string, truncated if necessary
    context = "
".join(relevant_history)
    if len(context) > max_history_length:
        context = context[:max_history_length] + "..." #Truncate if too long

    return context

def rag_with_history(query, chat_history, knowledge_base):
    """Performs RAG with injected conversation history."""

    relevant_history = select_relevant_history(chat_history, query)
    augmented_query = f"Context: {relevant_history}
Question: {query}"

    #RAG pipeline code goes here - simplified example:
    relevant_documents = knowledge_base.search(augmented_query)
    answer = generate_answer(augmented_query, relevant_documents)

    return answer

In this example, the select_relevant_history function uses a zero-shot classification model (facebook/bart-large-mnli) to determine the relevance of each message in the history. Only the messages deemed “relevant” are included in the context that’s prepended to the query. This relevance selection prevents overwhelming the RAG system with too much irrelevant information.

Technique 3: Topic Tracking – Remembering the Big Picture

Sometimes, the conversation’s context isn’t just about the immediate previous turn; it’s about the overall topic being discussed. Topic tracking involves identifying and maintaining a representation of the main topics of the conversation. This can be achieved using techniques like:

Keyword Extraction: Extracting key phrases from each turn and combining them to form a topic summary.
Topic Modelling: Using unsupervised learning techniques like LDA to identify underlying topics in the conversation history.
Named Entity Recognition (NER): Identifying key entities in the text (people, places, organizations) which can help anchor the conversation.

Let’s look at a simple example using keyword extraction:

from collections import Counter
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)

def extract_keywords(text, top_n=5):
    """Extracts top keywords from a text, removing stop words."""
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(text)
    words = [w.lower() for w in words if w.isalpha() and w.lower() not in stop_words]
    word_counts = Counter(words)
    return [word for word, count in word_counts.most_common(top_n)]

def update_topic(current_topic, new_text, decay_factor=0.8):
    """Updates the current topic with new keywords, applying a decay factor."""
    new_keywords = extract_keywords(new_text)
    updated_topic = {}
    for keyword in current_topic:
        updated_topic[keyword] = current_topic[keyword] * decay_factor # Decay old keywords
    for keyword in new_keywords:
        updated_topic[keyword] = updated_topic.get(keyword, 0) + 1 #Add new keywords, increment existing ones
    return dict(sorted(updated_topic.items(), key=lambda item: item[1], reverse=True)) #Sort by score

This code maintains a current_topic dictionary, which stores keywords and their associated scores. Each time new text is processed, the update_topic function adds new keywords and updates the scores of existing keywords, applying a decay factor to gradually fade out older topics. You can then inject these keywords into the RAG query for context.

For example, you might have a starting topic:

current_topic = {"france": 1.0, "capital": 1.0}

Then update it with new information:

new_text = "Germany's capital is Berlin."
current_topic = update_topic(current_topic, new_text)
print(current_topic)

Which would yield something like:

{'capital': 1.8, 'berlin': 1.0, 'germany': 1.0, 'france': 0.8}

As the conversation moves on, the topic will gradually shift away from France and towards Germany and Berlin. This allows the system to “remember” the overall subject of the conversation.

Implementation Considerations with Quartalis

The techniques described above can be implemented within the Quartalis ecosystem. For example, the select_relevant_history function could be easily integrated as a custom component within a Quartalis RAG pipeline. You could create a custom “Context Enricher” node that takes the query and chat history as input and outputs an augmented query with injected context. The zero-shot classification model could be deployed and managed using the Quartalis model registry.

Similarly, the topic tracking logic could be implemented as a stateful component within the Quartalis engine. The current_topic dictionary could be stored as part of the pipeline’s state, allowing it to persist across multiple turns of the conversation. The Quartalis event system could be used to trigger the update_topic function whenever a new message is received.

Wrapping Up

Building conversation-aware RAG systems requires more than just a basic question-answering pipeline. By implementing techniques like query rewriting, conversation history injection, and topic tracking, you can create chatbots that understand the context of the conversation and provide more relevant and coherent responses. These methods, while presented individually, can also be combined for optimal performance. Experiment with the various techniques to determine what works best for your specific use case and data. By leveraging the tools and infrastructure offered by platforms like Quartalis, you can streamline the development and deployment of these sophisticated AI systems.

Making RAG Conversation-Aware: Context That Remembers

The Challenge: RAG Without Context is Forgetful

Technique 1: Query Rewriting – Making it Explicit

Technique 2: Conversation History Injection – Feeding the RAG System Memory

Technique 3: Topic Tracking – Remembering the Big Picture

Implementation Considerations with Quartalis

Wrapping Up

Related Posts

Topic Consolidation: Turning Thousands of Chunks into Structured Knowledge

Recency Weighting in RAG: When Newer Information Matters More

Multi-Query Retrieval: Ask the Same Question Five Different Ways