AI Memory System
19-feature RAG pipeline processing 243K+ chunks with HyDE, CRAG, agentic RAG, semantic caching, cross-encoder reranking, and RAGAS evaluation.
The Problem
Standard AI chatbots have no memory. They treat every conversation as if it’s the first. For a personal AI assistant, this is useless — it needs to remember decisions made months ago, recall specific dates and facts, and build on previous context. The challenge: how do you give an AI genuine long-term memory without making it slow or unreliable?
The Solution
A 19-feature Retrieval-Augmented Generation (RAG) pipeline that indexes, deduplicates, compresses, and retrieves relevant context from over 243,000 document chunks (cleaned and deduplicated from an initial 257K). It goes far beyond basic “embed and search” — with query rewriting, multi-strategy retrieval, cross-encoder reranking, corrective feedback loops, and live evaluation metrics.
Architecture
- Vector Store: ChromaDB with HNSW indexing, running on the Brain server
- Embeddings: qwen3-embedding:8b (4096 dimensions, MTEB score 70.58) via Ollama, with automatic fallback from GPU to CPU
- Retrieval: Hybrid search combining dense embeddings with BM25 sparse retrieval
- Reranking: Cross-encoder model for precision reranking of retrieved chunks
- Evaluation: RAGAS framework with 4 LLM-as-judge metrics
The 19 Features
- Smart Retrieval — Adaptive chunk count based on query complexity
- HyDE (Hypothetical Document Embeddings) — Generates hypothetical answers to improve retrieval
- Contextual Compression — Strips irrelevant content from retrieved chunks
- Multi-Query Expansion — Rewrites queries from multiple angles
- Conversation-Aware Retrieval — Uses chat history to inform searches
- Query Routing — Directs queries to specialised retrieval strategies
- LLM Fact Extraction — Extracts and stores factual claims from conversations
- Recency Weighting — Prioritises newer information
- Embedding Fallback — GPU → CPU automatic fallback
- Topic Consolidation — 7 consolidated summaries across major topics
- Deduplication — Exact and near-duplicate removal (73 dupes caught in first batch)
- Feedback Loop — Tracks retrieval quality and adjusts parameters
- HNSW Tuning — Optimised index parameters for the collection size
- Cross-Encoder Reranking — Precision scoring of candidate chunks
- Hybrid BM25 Search — Combines keyword and semantic matching
- Semantic Caching — 500-entry cache with 60-minute TTL, cosine similarity ≥0.92
- Corrective RAG (CRAG) — Verifies and retries on moderate confidence
- Agentic RAG Loop — Gap detection and multi-hop retrieval for complex queries
- RAGAS Evaluation — 4 LLM-as-judge metrics for continuous quality monitoring
Key Technical Decisions
- Embedding migration: Moved from nomic-embed-text (768 dims) to qwen3-embedding:8b (4096 dims) — required full re-indexing of all chunks
- Async wrappers: All ChromaDB calls wrapped in
asyncio.to_thread()to prevent blocking the FastAPI event loop - Semantic cache threshold: Cosine similarity ≥0.92 balances hit rate against false matches
- CRAG confidence thresholds: moderate triggers verify+retry; low triggers full re-retrieval
Results
- 243K+ chunks indexed and searchable (cleaned from initial 257K — dedup + consolidation)
- ~3s cached / ~33s cold context build time
- 7 consolidated topic summaries (DVLA, Quartalis, YouTube, Finance, Family, etc.)
- 73 exact duplicates automatically removed
- 7 new API endpoints for cache/CRAG/agentic/RAGAS monitoring
- Full embedding migration completed with zero data loss
Tech Stack
Python, ChromaDB, LangChain, Ollama, qwen3-embedding:8b, cross-encoder, BM25, RAGAS, SQLite, FastAPI
Interested in something similar?
I build custom AI systems and infrastructure for businesses.
Get In Touch