Quartalis AI Ecosystem
Unified AI backend with 6 providers, auto-routing, WebSocket streaming, conversation management, financial advisor, and deep memory system — all self-hosted on bare metal.
The Problem
Most AI applications are tightly coupled to a single provider. When that provider goes down, rate limits, or changes pricing — the entire system breaks. I needed an AI backend that could survive any single point of failure, route intelligently between providers, and maintain persistent context across conversations.
The Solution
Quartalis is a unified AI backend built with FastAPI that abstracts away provider complexity behind a single API. It manages 6 AI providers (local Ollama, Claude, Gemini, OpenAI, OpenRouter, DeepSeek) with automatic failover routing, WebSocket streaming for real-time responses, and a deep memory system that gives the AI genuine long-term recall.
Architecture
The system runs entirely on self-hosted infrastructure:
- Backend: FastAPI with async/await, running in Docker on an HP DL380 Gen9 server (Unraid)
- Primary AI: Local Ollama on a dedicated workstation with RTX 5070 Ti (192.168.0.92)
- Fallback Chain: Muscle GPU → Gemini → Claude → OpenRouter → DeepSeek
- Streaming: WebSocket connections with 15-second first-token timeout and automatic provider fallback
- Database: SQLite for conversations, settings, and financial data
- Memory: ChromaDB vector store with 243,000+ embedded chunks (19-feature RAG pipeline)
Key Features
Multi-Provider Auto-Routing
When the primary local model is unavailable or too slow (>15s to first token), the system automatically falls through to cloud providers. Each provider has a standardised interface via a base class, making it trivial to add new providers.
WebSocket Streaming
Real-time token-by-token streaming over WebSocket, with proper ping/pong keepalive. The async architecture ensures that heavy memory retrieval operations don’t block the event loop — all ChromaDB calls are wrapped in asyncio.to_thread().
Financial Command Centre
An AI-powered financial advisor module with UC (Universal Credit) rules engine, credit card analysis with utilization tracking, and investment monitoring. All AI-suggested actions go through a pending approval system — nothing auto-executes.
Conversation Management
Full CRUD for conversations with automatic titling, message history, and context windowing. Each conversation maintains its own memory context.
Technical Decisions
- FastAPI over Flask/Django: Async-native, WebSocket support built-in, automatic OpenAPI docs
- SQLite over PostgreSQL: Single-file database, zero configuration, perfect for single-server deployment
- Docker with host networking: Simplifies inter-container communication on the same server
- Provider abstraction: Base class pattern allows adding new AI providers in under 50 lines
Results
- 6 AI providers with automatic failover — zero downtime from provider outages
- Sub-200ms response initiation for cached queries via semantic cache
- 24/7 uptime on self-hosted infrastructure
- 18 API endpoints for the financial module alone
- 15-second first-token timeout with graceful fallback
Tech Stack
Python, FastAPI, WebSocket, SQLite, Docker, Ollama, Claude API, Gemini API, OpenAI API, OpenRouter, DeepSeek, nginx, Cloudflare
Interested in something similar?
I build custom AI systems and infrastructure for businesses.
Get In Touch