Name: Quartalis
Address: GB
Price range: ££

Maintaining data quality in vector databases is crucial for accurate and reliable AI applications. Duplicate entries not only waste storage space but can also skew search results and negatively impact the performance of retrieval-augmented generation (RAG) pipelines. In a recent project involving a ChromaDB instance with over 250,000 text chunks, we encountered a significant duplication issue. This post details the methods we employed to identify and remove these duplicates, ensuring a clean and efficient knowledge base.

The Problem: Duplicate Data in Vector Databases

Vector databases store data as high-dimensional vectors, enabling efficient similarity searches. These vectors are typically generated by embedding models from text, images, or other types of data. However, various factors can lead to duplicate entries in these databases. Common causes include:

Data ingestion errors: Bugs in the ingestion pipeline may accidentally re-insert the same data multiple times.
Redundant data sources: Combining data from multiple sources can introduce duplicates if those sources overlap.
Inconsistent data processing: Variations in preprocessing steps (e.g., tokenization, cleaning) can lead to semantically similar but technically distinct embeddings.
Accidental re-runs: Re-running an ETL job without proper checks and balances will naturally lead to data duplication.

In our case, the problem stemmed from a combination of redundant data sources and an initial lack of robust deduplication measures during the database setup. The result was a significant number of duplicate chunks, affecting the accuracy and efficiency of our RAG pipeline. The Quartalis ecosystem has some handy data quality checks that can identify some of these problems as part of the data ingestion pipeline — but in this case the problem crept in earlier!

Deduplication Techniques

We employed a multi-faceted approach to tackle the deduplication challenge, combining exact matching with semantic similarity analysis.

1. Exact Hash Deduplication

The simplest and most efficient method is to identify exact duplicates based on a hash of the content. This involves calculating a hash (e.g., SHA-256) for each text chunk and comparing the hashes. If two chunks have the same hash, they are considered identical. This technique is very fast and accurate but only catches exact matches.

Here’s a Python code snippet demonstrating exact hash deduplication:

import hashlib

def calculate_sha256(text):
    """Calculates the SHA-256 hash of a given text."""
    return hashlib.sha256(text.encode('utf-8')).hexdigest()

def deduplicate_exact(data):
    """Deduplicates a list of text chunks based on exact hash matching."""
    seen_hashes = set()
    unique_data = []
    duplicates = []

    for item in data:
        text = item['text'] # Assuming your data is a list of dictionaries with a 'text' key
        hash_value = calculate_sha256(text)
        if hash_value not in seen_hashes:
            seen_hashes.add(hash_value)
            unique_data.append(item)
        else:
            duplicates.append(item)

    return unique_data, duplicates

# Example usage (assuming 'data' is a list of dictionaries with 'text' and 'embedding' keys)
unique_data, duplicates = deduplicate_exact(data)

print(f"Original size: {len(data)}")
print(f"Unique size after exact deduplication: {len(unique_data)}")
print(f"Number of exact duplicates: {len(duplicates)}")

This function iterates through the data, calculates the SHA-256 hash of each text chunk, and keeps track of the hashes encountered. If a hash is already in the seen_hashes set, the corresponding chunk is identified as a duplicate.

2. Semantic Similarity Deduplication

Exact matching only catches identical chunks. To identify near-duplicates or semantically similar chunks, we need to employ semantic similarity analysis. This involves comparing the embeddings of the text chunks.

Here’s the approach:

Calculate Embeddings: If you don’t already have embeddings, calculate them for each text chunk using a suitable embedding model (e.g., Sentence Transformers, OpenAI embeddings).
Calculate Similarity Matrix: Calculate the cosine similarity between all pairs of embeddings. This results in a similarity matrix where each entry (i, j) represents the cosine similarity between the i-th and j-th embeddings.
Identify Near-Duplicates: Set a similarity threshold (e.g., 0.95). If the cosine similarity between two embeddings exceeds this threshold, they are considered near-duplicates.
Merge/Remove Duplicates: For each set of near-duplicates, choose one representative chunk to keep and remove the others. You might choose the longest chunk, the one with the most citations, or apply some other criterion.

Here’s a Python code snippet demonstrating semantic similarity deduplication using cosine similarity:

from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def deduplicate_semantic(data, similarity_threshold=0.95):
    """Deduplicates a list of text chunks based on semantic similarity."""

    # 1. Calculate Embeddings
    model = SentenceTransformer('all-mpnet-base-v2') # Or any other suitable model
    sentences = [item['text'] for item in data]
    embeddings = model.encode(sentences)

    # 2. Calculate Similarity Matrix
    similarity_matrix = cosine_similarity(embeddings)

    # 3. Identify Near-Duplicates
    duplicates = set()
    for i in range(len(data)):
        for j in range(i + 1, len(data)):
            if similarity_matrix[i, j] > similarity_threshold:
                duplicates.add(j) # Add the higher index as the duplicate, keeping the first one

    # 4. Remove Duplicates
    unique_data = [item for i, item in enumerate(data) if i not in duplicates]

    print(f"Original size: {len(data)}")
    print(f"Unique size after semantic deduplication: {len(unique_data)}")
    print(f"Number of semantic duplicates: {len(duplicates)}")

    return unique_data

# Example usage (assuming 'data' is a list of dictionaries with a 'text' key)
unique_data = deduplicate_semantic(data)

This function calculates embeddings using the Sentence Transformers library, computes the cosine similarity matrix, and identifies near-duplicates based on the specified threshold. It then returns a list of unique data points.

Choosing the Right Threshold: The similarity threshold is a critical parameter. A higher threshold will result in fewer duplicates being removed, while a lower threshold may lead to the removal of chunks that are semantically distinct. The optimal threshold depends on the specific data and the desired level of granularity.

3. Content Fingerprinting

Content fingerprinting involves extracting a set of representative features from each chunk and comparing these features to identify near-duplicates. This technique is less computationally expensive than calculating full embeddings but can still effectively identify semantically similar chunks.

One common approach is to use MinHash and Locality Sensitive Hashing (LSH). MinHash generates a signature for each chunk by selecting the minimum hash value of several permutations of the chunk’s tokens. LSH then groups chunks with similar MinHash signatures into buckets, allowing for efficient near-duplicate detection.

Here’s a simplified conceptual example using Python:

from datasketch import MinHash, MinHashLSH

def fingerprint_deduplication(data, threshold=0.7):
    """Deduplicates a list of text chunks based on content fingerprinting using MinHash and LSH."""

    lsh = MinHashLSH(threshold=threshold, num_perm=128)  # Adjust num_perm for desired accuracy/performance

    # Create MinHash signatures for each text chunk
    minhashes = {}
    for i, item in enumerate(data):
        text = item['text']
        tokens = text.lower().split() # Simple tokenization, you might want something more sophisticated
        m = MinHash(num_perm=128)
        for token in tokens:
            m.update(token.encode('utf8'))
        minhashes[i] = m
        lsh.insert(i, m)

    # Query LSH for near-duplicates
    duplicates = set()
    for i, m in minhashes.items():
        for j in lsh.query(m):
            if i != j and j > i: # Avoid self-comparison and duplicate pairs (i, j) and (j, i)
                duplicates.add(j)

    # Remove Duplicates
    unique_data = [item for i, item in enumerate(data) if i not in duplicates]

    print(f"Original size: {len(data)}")
    print(f"Unique size after fingerprint deduplication: {len(unique_data)}")
    print(f"Number of fingerprint duplicates: {len(duplicates)}")

    return unique_data

# Example usage (assuming 'data' is a list of dictionaries with a 'text' key)
unique_data = fingerprint_deduplication(data)

This example uses the datasketch library to implement MinHash and LSH. It tokenizes the text, creates MinHash signatures, inserts them into an LSH index, and then queries the index to identify near-duplicates. The threshold parameter controls the similarity threshold for LSH.

Choosing the Right Parameters: The num_perm parameter in MinHash controls the number of permutations used to generate the signature. A higher value increases accuracy but also increases computation time. The threshold parameter in LSH controls the similarity threshold for near-duplicate detection. These parameters should be tuned based on the specific data and desired performance.

Automation and Monitoring

Deduplication should not be a one-time task. It should be integrated into the data ingestion pipeline and monitored regularly.

Automated Deduplication Pipeline: Implement an automated pipeline that performs deduplication as part of the data ingestion process. This pipeline should include both exact hash deduplication and semantic similarity analysis.
Regular Monitoring: Monitor the number of duplicates in the vector database over time. Set up alerts to notify you if the number of duplicates exceeds a certain threshold.
Data Quality Checks: Integrate data quality checks into the ingestion pipeline. These checks can identify potential sources of duplicates and prevent them from being introduced into the database. Quartalis provides data quality checking tools which can be easily integrated into any pipeline.

Results

By combining exact hash deduplication, semantic similarity analysis, and content fingerprinting, we were able to significantly reduce the number of duplicates in our ChromaDB instance. Out of 257,000 initial chunks, we identified and removed over 4,000 duplicates, resulting in a cleaner, more efficient, and more accurate knowledge base. The precise overlap between the different deduplication methods depends heavily on the source data and the threshold values used. We found that running exact match first, then semantic or fingerprint methods helped improve performance.

Wrapping Up

Deduplicating vector databases is an essential step in maintaining data quality and ensuring the accuracy and efficiency of AI applications. By combining exact matching with semantic similarity analysis or content fingerprinting, you can effectively identify and remove duplicate entries, resulting in a cleaner and more reliable knowledge base. Remember to automate the deduplication process and monitor data quality regularly to prevent the re-introduction of duplicates. This improved data quality will pay dividends in downstream tasks such as RAG and other information retrieval applications.