Back to Projects
ChromaDBLangChainOllamaRAGPythonEmbeddings

AI Memory System

19-feature RAG pipeline processing 243K+ chunks with HyDE, CRAG, agentic RAG, semantic caching, cross-encoder reranking, and RAGAS evaluation.

The Problem

Standard AI chatbots have no memory. They treat every conversation as if it’s the first. For a personal AI assistant, this is useless — it needs to remember decisions made months ago, recall specific dates and facts, and build on previous context. The challenge: how do you give an AI genuine long-term memory without making it slow or unreliable?

The Solution

A 19-feature Retrieval-Augmented Generation (RAG) pipeline that indexes, deduplicates, compresses, and retrieves relevant context from over 243,000 document chunks (cleaned and deduplicated from an initial 257K). It goes far beyond basic “embed and search” — with query rewriting, multi-strategy retrieval, cross-encoder reranking, corrective feedback loops, and live evaluation metrics.

Architecture

  • Vector Store: ChromaDB with HNSW indexing, running on the Brain server
  • Embeddings: qwen3-embedding:8b (4096 dimensions, MTEB score 70.58) via Ollama, with automatic fallback from GPU to CPU
  • Retrieval: Hybrid search combining dense embeddings with BM25 sparse retrieval
  • Reranking: Cross-encoder model for precision reranking of retrieved chunks
  • Evaluation: RAGAS framework with 4 LLM-as-judge metrics

The 19 Features

  1. Smart Retrieval — Adaptive chunk count based on query complexity
  2. HyDE (Hypothetical Document Embeddings) — Generates hypothetical answers to improve retrieval
  3. Contextual Compression — Strips irrelevant content from retrieved chunks
  4. Multi-Query Expansion — Rewrites queries from multiple angles
  5. Conversation-Aware Retrieval — Uses chat history to inform searches
  6. Query Routing — Directs queries to specialised retrieval strategies
  7. LLM Fact Extraction — Extracts and stores factual claims from conversations
  8. Recency Weighting — Prioritises newer information
  9. Embedding Fallback — GPU → CPU automatic fallback
  10. Topic Consolidation — 7 consolidated summaries across major topics
  11. Deduplication — Exact and near-duplicate removal (73 dupes caught in first batch)
  12. Feedback Loop — Tracks retrieval quality and adjusts parameters
  13. HNSW Tuning — Optimised index parameters for the collection size
  14. Cross-Encoder Reranking — Precision scoring of candidate chunks
  15. Hybrid BM25 Search — Combines keyword and semantic matching
  16. Semantic Caching — 500-entry cache with 60-minute TTL, cosine similarity ≥0.92
  17. Corrective RAG (CRAG) — Verifies and retries on moderate confidence
  18. Agentic RAG Loop — Gap detection and multi-hop retrieval for complex queries
  19. RAGAS Evaluation — 4 LLM-as-judge metrics for continuous quality monitoring

Key Technical Decisions

  • Embedding migration: Moved from nomic-embed-text (768 dims) to qwen3-embedding:8b (4096 dims) — required full re-indexing of all chunks
  • Async wrappers: All ChromaDB calls wrapped in asyncio.to_thread() to prevent blocking the FastAPI event loop
  • Semantic cache threshold: Cosine similarity ≥0.92 balances hit rate against false matches
  • CRAG confidence thresholds: moderate triggers verify+retry; low triggers full re-retrieval

Results

  • 243K+ chunks indexed and searchable (cleaned from initial 257K — dedup + consolidation)
  • ~3s cached / ~33s cold context build time
  • 7 consolidated topic summaries (DVLA, Quartalis, YouTube, Finance, Family, etc.)
  • 73 exact duplicates automatically removed
  • 7 new API endpoints for cache/CRAG/agentic/RAGAS monitoring
  • Full embedding migration completed with zero data loss

Tech Stack

Python, ChromaDB, LangChain, Ollama, qwen3-embedding:8b, cross-encoder, BM25, RAGAS, SQLite, FastAPI

Interested in something similar?

I build custom AI systems and infrastructure for businesses.

Get In Touch