Back to Projects
FastAPIPythonDockerWebSocketAISQLite

Quartalis AI Ecosystem

Unified AI backend with 6 providers, auto-routing, WebSocket streaming, conversation management, financial advisor, and deep memory system — all self-hosted on bare metal.

The Problem

Most AI applications are tightly coupled to a single provider. When that provider goes down, rate limits, or changes pricing — the entire system breaks. I needed an AI backend that could survive any single point of failure, route intelligently between providers, and maintain persistent context across conversations.

The Solution

Quartalis is a unified AI backend built with FastAPI that abstracts away provider complexity behind a single API. It manages 6 AI providers (local Ollama, Claude, Gemini, OpenAI, OpenRouter, DeepSeek) with automatic failover routing, WebSocket streaming for real-time responses, and a deep memory system that gives the AI genuine long-term recall.

Architecture

The system runs entirely on self-hosted infrastructure:

  • Backend: FastAPI with async/await, running in Docker on an HP DL380 Gen9 server (Unraid)
  • Primary AI: Local Ollama on a dedicated workstation with RTX 5070 Ti (192.168.0.92)
  • Fallback Chain: Muscle GPU → Gemini → Claude → OpenRouter → DeepSeek
  • Streaming: WebSocket connections with 15-second first-token timeout and automatic provider fallback
  • Database: SQLite for conversations, settings, and financial data
  • Memory: ChromaDB vector store with 243,000+ embedded chunks (19-feature RAG pipeline)

Key Features

Multi-Provider Auto-Routing

When the primary local model is unavailable or too slow (>15s to first token), the system automatically falls through to cloud providers. Each provider has a standardised interface via a base class, making it trivial to add new providers.

WebSocket Streaming

Real-time token-by-token streaming over WebSocket, with proper ping/pong keepalive. The async architecture ensures that heavy memory retrieval operations don’t block the event loop — all ChromaDB calls are wrapped in asyncio.to_thread().

Financial Command Centre

An AI-powered financial advisor module with UC (Universal Credit) rules engine, credit card analysis with utilization tracking, and investment monitoring. All AI-suggested actions go through a pending approval system — nothing auto-executes.

Conversation Management

Full CRUD for conversations with automatic titling, message history, and context windowing. Each conversation maintains its own memory context.

Technical Decisions

  • FastAPI over Flask/Django: Async-native, WebSocket support built-in, automatic OpenAPI docs
  • SQLite over PostgreSQL: Single-file database, zero configuration, perfect for single-server deployment
  • Docker with host networking: Simplifies inter-container communication on the same server
  • Provider abstraction: Base class pattern allows adding new AI providers in under 50 lines

Results

  • 6 AI providers with automatic failover — zero downtime from provider outages
  • Sub-200ms response initiation for cached queries via semantic cache
  • 24/7 uptime on self-hosted infrastructure
  • 18 API endpoints for the financial module alone
  • 15-second first-token timeout with graceful fallback

Tech Stack

Python, FastAPI, WebSocket, SQLite, Docker, Ollama, Claude API, Gemini API, OpenAI API, OpenRouter, DeepSeek, nginx, Cloudflare

Interested in something similar?

I build custom AI systems and infrastructure for businesses.

Get In Touch