Quartalis AI Ecosystem

The Problem

Most AI applications are tightly coupled to a single provider. When that provider goes down, rate limits, or changes pricing — the entire system breaks. I needed an AI backend that could survive any single point of failure, route intelligently between providers, and maintain persistent context across conversations.

The Solution

Quartalis is a unified AI backend built with FastAPI that abstracts away provider complexity behind a single API. It manages 6 AI providers (local Ollama, Claude, Gemini, OpenAI, OpenRouter, DeepSeek) with automatic failover routing, WebSocket streaming for real-time responses, and a deep memory system that gives the AI genuine long-term recall.

Architecture

The system runs entirely on self-hosted infrastructure:

Backend: FastAPI with async/await, running in Docker on an HP DL380 Gen9 server (Unraid)
Primary AI: Local Ollama on a dedicated workstation with RTX 5070 Ti (192.168.0.92)
Fallback Chain: Muscle GPU → Gemini → Claude → OpenRouter → DeepSeek
Streaming: WebSocket connections with 15-second first-token timeout and automatic provider fallback
Database: SQLite for conversations, settings, and financial data
Memory: ChromaDB vector store with 243,000+ embedded chunks (19-feature RAG pipeline)

Key Features

Multi-Provider Auto-Routing

When the primary local model is unavailable or too slow (>15s to first token), the system automatically falls through to cloud providers. Each provider has a standardised interface via a base class, making it trivial to add new providers.

WebSocket Streaming

Real-time token-by-token streaming over WebSocket, with proper ping/pong keepalive. The async architecture ensures that heavy memory retrieval operations don’t block the event loop — all ChromaDB calls are wrapped in asyncio.to_thread().

Financial Command Centre

An AI-powered financial advisor module with UC (Universal Credit) rules engine, credit card analysis with utilization tracking, and investment monitoring. All AI-suggested actions go through a pending approval system — nothing auto-executes.

Conversation Management

Full CRUD for conversations with automatic titling, message history, and context windowing. Each conversation maintains its own memory context.

Technical Decisions

FastAPI over Flask/Django: Async-native, WebSocket support built-in, automatic OpenAPI docs
SQLite over PostgreSQL: Single-file database, zero configuration, perfect for single-server deployment
Docker with host networking: Simplifies inter-container communication on the same server
Provider abstraction: Base class pattern allows adding new AI providers in under 50 lines

Results

6 AI providers with automatic failover — zero downtime from provider outages
Sub-200ms response initiation for cached queries via semantic cache
24/7 uptime on self-hosted infrastructure
18 API endpoints for the financial module alone
15-second first-token timeout with graceful fallback

Tech Stack

Python, FastAPI, WebSocket, SQLite, Docker, Ollama, Claude API, Gemini API, OpenAI API, OpenRouter, DeepSeek, nginx, Cloudflare