(01)Overview

Moufida (Tunisian Arabic for 'useful/helpful') is a fully local-first, privacy-preserving AI desktop copilot built during the AI Minds hackathon in a 24-hour sprint. It runs entirely on local hardware with no mandatory cloud dependencies. A Tauri + Next.js transparent desktop overlay connects to 5 independent Python microservices: a multimodal retrieval engine that generates CLIP/BLIP/Whisper embeddings locally, a semantic search service using a custom XQdrant fork for explainable similarity scores, a natural-language file organisation service using DBSCAN clustering, a voice copilot with local STT and TTS, and an autonomous knowledge-gap detection engine backed by MongoDB and APScheduler.

(02)The Problem

Modern AI tools require cloud connectivity and send user data to external servers. For sensitive domains (medical, legal, research), this is a hard blocker. The challenge was to build a fully-featured, multi-modal AI assistant that operates entirely on local hardware within 24 hours — with no privacy trade-offs, no vendor lock-in, and no latency penalty from cloud round-trips.

(03)Solution

01
Built 5 independent FastAPI microservices (ports 8000–8400) behind a Tauri + Next.js transparent desktop overlay providing 6 panels: Agents, Copilot, Files, Graph, Insights, and Settings. Services communicate directly without a central gateway.
02
Engineered the Retrieval Service (port 8100): a multimodal ingestion engine that detects file modality (text/PDF/DOCX/image/audio/video/HTML), generates embeddings in shared CLIP space (with BLIP for image captioning and Whisper transcription for audio/video), stores vectors in Qdrant/XQdrant and metadata in MongoDB, and builds a dynamic similarity graph with timeline events. Includes a filesystem watcher and weekly digest generation.
03
Implemented explainable semantic search via the Search Service (port 8400) using the custom XQdrant fork — every result can include a `score_explanation` field decomposing the vector similarity score for mathematical transparency. An agent search mode adds Qwen3 LLM query reformulation, per-result reasoning, and summary insights.
04
Built the Organisation Service (port 8200) for natural-language filesystem planning: groups files using DBSCAN clustering (eps=0.35, min_samples=2) over their vector embeddings, generates LLM-assisted folder plans, executes a preview-then-apply dry-run workflow, and updates MongoDB metadata/timeline collections.
05
Integrated a full voice pipeline in the Copilot Service (port 8300): faster-whisper (CTranslate2) for offline STT, Piper TTS for local text-to-speech, and a streaming LLM chat loop calling Qwen3 4B through an OpenAI-compatible ngrok endpoint. Added Knowledge Gap Service (port 8000) with 5 APScheduler-driven detection strategies: topic sparsity, incomplete plans, unresolved questions, abandoned topics, and decisions without justification.

(04)Architecture

The Tauri desktop overlay (Next.js 6-panel UI) calls 5 FastAPI services directly. Retrieval (8100) is the data backbone: it ingests files/URLs, generates CLIP/BLIP/Whisper embeddings locally, upserts to XQdrant vectors and MongoDB metadata, and maintains a similarity graph. Search (8400) queries the shared XQdrant index with CLIP-embedded queries and optionally runs Qwen3 agent reasoning. Organisation (8200) fetches vectors and metadata, runs DBSCAN clustering, presents a dry-run plan, then applies confirmed moves. Copilot (8300) runs faster-whisper STT → Qwen3 chat → Piper TTS locally. Knowledge Gap (8000) uses APScheduler to scan MongoDB every 720 minutes for sparse/abandoned/unresolved topics. Qwen3 4B runs via Ollama on a local machine and is exposed OpenAI-compatibly over ngrok so all services share one model host without moving data off-device. Prometheus + custom observability are wired to every service.

(05)Tech Stack

Tauri + Next.js

Transparent desktop overlay with 6 panels: Agents, Copilot, Files, Graph, Insights, Settings

CLIP + BLIP + Whisper

Multimodal embedding pipeline — text/image/audio/video all embedded in shared CLIP space, locally

XQdrant (Custom Fork)

Modified Qdrant with `score_explanation` field for explainable vector similarity decomposition

DBSCAN Clustering

Scikit-learn DBSCAN (eps=0.35) over vector embeddings for natural-language file organisation plans

faster-whisper + Piper TTS

Local STT via CTranslate2 Whisper and local TTS via Piper — fully offline voice pipeline

Qwen3 4B via ngrok

Local Ollama model exposed OpenAI-compatibly over ngrok for search reasoning, organisation, and chat

MongoDB

Stores multimodal metadata, similarity graph, timeline events, chat history, and knowledge gaps

APScheduler

Knowledge Gap service runs 5 gap detection strategies every 720 minutes via APScheduler

(06)Results

24h Build

Full working prototype delivered in a single hackathon sprint

5 Services

Retrieval, Search, Organisation, Copilot, Knowledge Gap — all independent

100% Local

All embeddings, STT, and TTS run on-device — zero cloud data transfer

5 Modalities

Text, PDF, DOCX, image, audio, video, and HTML ingested in shared CLIP space

Explainable Search

XQdrant `score_explanation` gives per-dimension vector similarity breakdown

5 Gap Types

Topic sparsity, incomplete plans, unresolved questions, abandoned topics, unjustified decisions

(07)Takeaways

(01)

Building 5 independent services in 24 hours works only if each service has a clearly bounded responsibility from the start — we defined the port map and API contracts in the first 30 minutes, then each engineer worked in parallel.

(02)

XQdrant's score_explanation field is a major UX differentiator: showing users *why* a document was retrieved (which dimensions contributed to the score) builds trust that 'smart' keyword search never could.

(03)

Qwen3 4B punches well above its weight class — it consistently tops benchmarks against models 2–3× its size on reasoning, instruction-following, and multilingual tasks. Running it locally via Ollama gave us near-instant cold-start and effectively zero marginal cost per query, which matters when 5 services are all calling it in parallel.

(04)

We initially tried to upgrade to Qwen3.5 VL for native vision-language understanding (directly describing or reasoning about ingested images without needing separate BLIP captioning). It turned out Ollama doesn't yet support the VL architecture introduced in the Qwen3.5 visual series — the multimodal projector layers aren't mapped in the GGUF backend yet — so we fell back to our BLIP-in-CLIP pipeline, which worked well enough for the 24h scope.

(05)

faster-whisper + Piper TTS is the fastest path to a fully local voice pipeline: both are CTranslate2-native, no internet required, and combined latency (STT + TTS roundtrip) was under 2 seconds on commodity hardware.

(06)

Knowledge gap detection via scheduled MongoDB analysis is underrated — flagging 'abandoned topics' and 'unresolved questions' automatically surfaces blind spots users didn't even know they had.

←PreviousFAIN Next→QDesign

All projects