AI Memory System — Quartalis

The Problem

Standard AI chatbots have no memory. They treat every conversation as if it’s the first. For a personal AI assistant, this is useless — it needs to remember decisions made months ago, recall specific dates and facts, and build on previous context. The challenge: how do you give an AI genuine long-term memory without making it slow or unreliable?

The Solution

A 19-feature Retrieval-Augmented Generation (RAG) pipeline that indexes, deduplicates, compresses, and retrieves relevant context from over 243,000 document chunks (cleaned and deduplicated from an initial 257K). It goes far beyond basic “embed and search” — with query rewriting, multi-strategy retrieval, cross-encoder reranking, corrective feedback loops, and live evaluation metrics.

Architecture

Vector Store: ChromaDB with HNSW indexing, running on the Brain server
Embeddings: qwen3-embedding:8b (4096 dimensions, MTEB score 70.58) via Ollama, with automatic fallback from GPU to CPU
Retrieval: Hybrid search combining dense embeddings with BM25 sparse retrieval
Reranking: Cross-encoder model for precision reranking of retrieved chunks
Evaluation: RAGAS framework with 4 LLM-as-judge metrics

The 19 Features

Smart Retrieval — Adaptive chunk count based on query complexity
HyDE (Hypothetical Document Embeddings) — Generates hypothetical answers to improve retrieval
Contextual Compression — Strips irrelevant content from retrieved chunks
Multi-Query Expansion — Rewrites queries from multiple angles
Conversation-Aware Retrieval — Uses chat history to inform searches
Query Routing — Directs queries to specialised retrieval strategies
LLM Fact Extraction — Extracts and stores factual claims from conversations
Recency Weighting — Prioritises newer information
Embedding Fallback — GPU → CPU automatic fallback
Topic Consolidation — 7 consolidated summaries across major topics
Deduplication — Exact and near-duplicate removal (73 dupes caught in first batch)
Feedback Loop — Tracks retrieval quality and adjusts parameters
HNSW Tuning — Optimised index parameters for the collection size
Cross-Encoder Reranking — Precision scoring of candidate chunks
Hybrid BM25 Search — Combines keyword and semantic matching
Semantic Caching — 500-entry cache with 60-minute TTL, cosine similarity ≥0.92
Corrective RAG (CRAG) — Verifies and retries on moderate confidence
Agentic RAG Loop — Gap detection and multi-hop retrieval for complex queries
RAGAS Evaluation — 4 LLM-as-judge metrics for continuous quality monitoring

Key Technical Decisions

Embedding migration: Moved from nomic-embed-text (768 dims) to qwen3-embedding:8b (4096 dims) — required full re-indexing of all chunks
Async wrappers: All ChromaDB calls wrapped in asyncio.to_thread() to prevent blocking the FastAPI event loop
Semantic cache threshold: Cosine similarity ≥0.92 balances hit rate against false matches
CRAG confidence thresholds: moderate triggers verify+retry; low triggers full re-retrieval

Results

243K+ chunks indexed and searchable (cleaned from initial 257K — dedup + consolidation)
~3s cached / ~33s cold context build time
7 consolidated topic summaries (DVLA, Quartalis, YouTube, Finance, Family, etc.)
73 exact duplicates automatically removed
7 new API endpoints for cache/CRAG/agentic/RAGAS monitoring
Full embedding migration completed with zero data loss

Tech Stack

Python, ChromaDB, LangChain, Ollama, qwen3-embedding:8b, cross-encoder, BM25, RAGAS, SQLite, FastAPI