Why
Nothing leaves the machine unless a tool explicitly reaches the internet. That constraint — no cloud, no API keys — was the design brief, and an excuse to learn the local-AI stack hands-on, from voice activity detection to vector memory.
What it does
- Conversation — type or talk from any browser on the LAN; a local LLM (Ollama) replies, spoken aloud by a local Piper voice.
- Memory — long-term recall across sessions via Mem0 + ChromaDB with local embeddings.
- Tools — web search, weather, news, maps, YouTube Music — all server-side, all keyless.
- The orb — a Canvas-rendered ferrofluid orb driven by a Web Audio analyser, pulsing with the synthesized speech.
How the system works
- Browsermic + chat, WebSocket to server
- STTfaster-whisper, int8 on CPU
- LLMOllama, resident model, tool calls
- MemoryMem0 + ChromaDB embeddings
- TTSPiper ONNX → MP3 stream → orb
The pipeline is fully streaming and async (FastAPI + asyncio): tokens, audio chunks and orb-amplitude data flow over one WebSocket. The hard part is latency on consumer hardware — keeping the model resident, quantizing STT, and streaming TTS so the orb starts speaking before the sentence is even finished generating.