Named after Arsene Lupin, the gentleman thief. More about the AF part when the v0.1.6 branch lands in mid-to-late April.
A voice-first AI agent platform that closes the voice loop from browser UI through agent execution into developer tooling and back -- with Bayesian trust learning, fine-tuned intent routing, and solution caching built in.
FastAPI | Voice I/O | PEFT/LoRA | LanceDB | Claude Agent SDK | Bayesian Trust | MCP Protocol
Current version: v0.1.6 | License: Apache 2.0
Every agentic AI platform needs human oversight. Most implement it as a modal dialog: click approve, type feedback, wait. Lupin takes a fundamentally different approach -- voice-first human-in-the-loop.
Agents speak to you. You speak back. A Bayesian trust engine learns your preferences over time, escalating only when confidence is low and auto-approving when it has earned your trust. The result: human oversight that works from across the room, while you're multitasking, or even from your phone -- no screen required.
This is the missing piece in agentic AI: not just making agents smarter, but making human oversight effortless.
Talk to the computer, and it tells you, or does, something useful.
Currently, AI agents and chatbots are slow and expensive. They make silly mistakes. They're forgetful. And they work too hard reinventing the wheel.
Even the simplest vox-in and vox-out UX -- especially when coupled with agentic behaviors -- is hard. It's asynchronous, and usually frustratingly slow. It's a new way of interacting with computers, which requires a global rethinking of how different the UI control and display modalities interact.
Fine-tune small models for cheap, fast intent routing -- not prompt engineering, actual PEFT/LoRA fine-tuning. Escalate to frontier models only when complexity demands it. Cache solutions via vector search so agents never solve the same problem twice. Layer Bayesian trust learning so the system earns autonomy over time, minimizing human interruptions without sacrificing oversight. And voice-enable everything -- from the browser UI, through agent execution, into Claude Code developer sessions via 6 system hooks and an MCP voice server, and back again.
flowchart TD
subgraph Input
MIC["Microphone"] --> ASR["ASR (Whisper)"]
TEXT["Text Input"] --> ROUTER
end
ASR --> ROUTER["Intent Router<br/>(PEFT/LoRA fine-tuned)"]
ROUTER --> SNAP{"Solution Snapshot<br/>Lookup (LanceDB)"}
SNAP -- "Cache Hit" --> TTS["TTS Output"]
SNAP -- "Cache Miss" --> CJ["CJ Flow Queue"]
subgraph CJ Flow
CJ --> SYNC["Sync Agents<br/>Math · Calendar · Calculator<br/>CRUD · Weather · DateTime"]
CJ --> ASYNC["Async Agents<br/>Deep Research · Podcast<br/>SWE Team · Claude Code"]
end
SYNC --> TTS
ASYNC --> PROXY["Decision Proxy<br/>(Bayesian Trust · L1-L5)"]
PROXY --> TTS
TTS --> WS["WebSocket<br/>(queue + audio channels)"]
WS --> BROWSER["Browser UI"]
subgraph Claude Code Voice Loop
HOOKS["6 System Hooks<br/>(PreToolUse · PostToolUse · Notification<br/>Stop · PermissionRequest · UserPromptSubmit)"] --> MCP["cosa-voice<br/>MCP Server"]
MCP --> ROUTER
end
style HOOKS fill:#f9f,stroke:#333,stroke-width:2px
style MCP fill:#f9f,stroke:#333,stroke-width:2px
Voice flows end-to-end: browser microphone through agent execution into Claude Code sessions and back via dual-channel WebSocket audio streaming.
17 specialized agents -- from sub-second sync responders to long-running autonomous research pipelines -- all routed through fine-tuned small models and unified by a single voice-first queue system.
| Agent | Purpose |
|---|---|
| MathAgent | Symbolic math via LLM |
| CalendarAgent | Date-aware scheduling |
| DateTimeAgent | Time queries and conversions |
| WeatherAgent | Weather lookups |
| TodoListAgent | Persistent task management |
| CalculatorAgent | Natural language calculator (508 LoRA templates), MathAgent fallback |
| CRUDAgent | Voice-controlled DataFrame create/read/update/delete |
| ReceptionistAgent | Top-level intent router |
| RuntimeArgumentExpeditor | LLM-powered gap analysis -- asks for missing arguments via voice |
| Agent | Purpose |
|---|---|
| DeepResearchAgent | Background research with automatic report generation |
| PodcastGeneratorAgent | Convert documents to audio podcast format |
| ResearchToPodcastAgent | Chained research-to-podcast pipeline |
| PresentationGeneratorAgent | Multi-phase pipeline: outline → elaborate → render → deliver (Phases 1-8) |
| ResearchToPresentationAgent | Chained research-to-presentation pipeline |
| ClaudeCodeAgent | Claude Agent SDK tasks (BOUNDED or INTERACTIVE mode) |
| SWETeamAgent | 4-phase dev team: Lead, Coder, Tester, Trust Proxy |
| Agent | Purpose |
|---|---|
| BugFixExpediter (BFE) | Dead-job auto-recovery: diagnose → propose → fix → git → retry |
| TestFixExpediter (TFE) | Test-failure auto-fix: cluster → diagnose → propose → fix → git → rerun |
| TestSuiteJob | Scheduled test-suite runs via CJ Flow with watchdog-triggered TFE handoff |
| Agent | Purpose |
|---|---|
| NotificationProxyAgent | Phi-4 fuzzy script matching for automated interactive testing |
| DecisionProxyAgent | Universal Prediction Engine (7 slices) · Bayesian Beta-Bernoulli trust · Thompson Sampling · Conformal prediction · L1-L5 escalation · Circuit breaker |
No other platform closes the voice loop this completely:
- Browser to agents: Dual-channel WebSocket architecture (queue events + audio streaming) with ASR (Whisper) to TTS pipeline, end to end
- Agents to developer tools: 6 Claude Code system hooks (
PreToolUse,PostToolUse,Notification,Stop,PermissionRequest,UserPromptSubmit) bridge voice into every coding session - Developer tools back to browser: cosa-voice MCP server provides 5 voice tools (
notify,converse,ask_yes_no,ask_multiple_choice,ask_open_ended_batch) - Session continuity: Stable session IDs survive context clears via write-once atomic lockfile -- no identity drift
- Stop hook gisting: Ultra-short TTS summaries of completed work via frontier model distillation
- Voice injection: tmux-based voice input into idle Claude Code sessions -- speak and it types
While most platforms route via system prompts or keyword matching, Lupin fine-tunes:
- 39,871 training examples across 35 command intents
- PEFT/LoRA on Phi-4, Qwen, and Llama -- local GPU inference, zero API calls for routing
- Sub-second classification with GSM8K-validated post-quantization math reasoning
- Result: routing that is faster, cheaper, and more reliable than prompt-based alternatives
When an agent solves a problem, the solution is embedded and cached in LanceDB. Next time the same (or similar) question arrives, the answer comes from vector search -- not from re-running the agent.
| Operation | File-Based | LanceDB | Speedup |
|---|---|---|---|
| Search (exact) | 96 ms | 0.1 ms | 960x |
| Add snapshot | 827 ms | 15 ms | 55x |
| Search (fuzzy) | 120 ms | 0.3 ms | 400x |
Local GPU embeddings (CodeRankEmbed + nomic-embed-text-v1.5) vs OpenAI API:
| Operation | Content | Local GPU | OpenAI API | Speedup |
|---|---|---|---|---|
| Single embed | prose | 164 ms | 1,146 ms | 7x |
| Single embed | code | 70 ms | 1,211 ms | 17x |
| Batch (3) | prose | 8 ms | 2,989 ms | 374x |
| Batch (3) | code | 8 ms | 3,183 ms | 398x |
The first decision proxy for AI agents with academic-grade statistical rigor:
- Universal Prediction Engine: 7 prediction slices with 87 unit tests and 21 end-to-end tests
- Bayesian Beta-Bernoulli trust model: Per-agent trust learning with conjugate prior updates
- Thompson Sampling: Exploration-exploitation balance for when to auto-approve vs. escalate
- Conformal prediction: Calibrated confidence intervals -- not guesses, statistical guarantees
- LanceDB-backed preference embeddings: Semantic similarity with response_type filtering
- L1-L5 trust escalation: Five trust levels from "always ask" to "full autonomy" with circuit breaker pattern
- Morning coffee batch review: Non-urgent decisions queued for human review at your convenience
- Ratification API: Post-hoc approval with trust feedback loop
| Suite | Count | Coverage |
|---|---|---|
| Unit tests | 3,549+ | Core logic, trust engine, hooks, credentials, prediction engine, agentic orchestrators |
| WebSocket tests | 50 | Connection, auth, event routing, session management |
| Integration tests | 228+ | End-to-end API workflows against dedicated dual-container test server |
| E2E UI (Playwright) | 357+ | Full browser-driven flows including 12-page visual regression |
| Interactive proxy tests | 12 scenarios | Calculator, CRUD, and Expediter agents via auto-proxy |
Built and maintained by a single engineer. Every PR must pass all five tiers before merge.
# Prerequisites: Python 3.11+, GPU recommended, PostgreSQL
export LUPIN_ROOT=/path/to/lupin
# Configure credentials
src/scripts/lupin_config.py init
# Start the server
src/scripts/run-fastapi-lupin.sh # FastAPI on port 7999
src/scripts/run-lupin-gui.sh # Browser GUI client
# Run tests
pytest src/tests/unit/ # 3,549+ unit tests
src/scripts/run-websocket-smoke-tests.sh # 50 WebSocket tests
src/tests/run-integration-tests.sh --bg -v # Integration gate (dual-container, :8000)
src/scripts/run-e2e-ui-tests.sh --bg -v # 357+ Playwright tests incl. visual regression
# Install cosa-voice MCP server (for Claude Code voice I/O)
claude mcp add cosa-voice -- python ${LUPIN_ROOT}/src/lupin_mcp/cosa_voice_mcp.pyConfig: src/conf/lupin-app.ini | Docker: docker build -f docker/lupin/Dockerfile . | GSM8K: src/scripts/run-gsm8k.sh --help
- REST API Reference — all HTTP and WebSocket endpoints
- WebSocket Architecture — dual-session design and event system
- Notification API — comprehensive notification reference with Mermaid diagrams
- CJ Flow Packaging Guide — how to add new QueueableJob types
- cosa-voice MCP Server — MCP server setup and tool reference
- Agentic Voice Workflow — building new agents with voice I/O
Bug Fix Expediter (dead-job auto-recovery), Test Fix Expediter (test-failure auto-fix), and the TestSuiteJob scheduler share a common foundation in src/cosa/agents/shared/. See the Agents subsystem documentation for the full subsystem:
- Bug Fix Expediter Guide — diagnose → propose → fix → git → retry pipeline
- Test Fix Expediter Guide — cluster → diagnose → propose → fix → git → rerun pipeline
- Test-Suite Scheduling Guide — TestSuiteJob +
/schedule-testsskill - Shared Fix Primitives Reference — PlanWriter, GitStrategist, FixExecutor
- Decision Proxy Admin Guide — Trust Dashboard and ratification how-to
- Automated Interactive Testing — proxy auto-answer testing guide
- WebSocket Troubleshooting — common issues and debugging procedures
Over 130 dated planning and research documents in src/rnd/.
Codebase metrics: Lupin parent vs CoSA comparison — 2026-04-12 snapshot of LoC distribution with mermaid diagram, 60/40 Python split, docstring-ratio observations, and operational implications of the CoSA-never-commit rule.
v0.1.6 (April 2026) — Presentation Generator agent (multi-phase outline → elaborate → render → deliver, 8 phases). CJ Flow persistence: PostgreSQL write-through for todo/running/done queues with startup recovery, timed execution + monopolize + pause flags, and Job History UI (5th collapsible section with time-window filter). Auto-recovery agent family: Bug Fix Expediter and Test Fix Expediter with Claude Agent SDK worktree isolation and Resume-with-overrides UI. Playwright E2E suite expanded from ~100 to 357 tests across 8 phases, including 12-page visual regression with deterministic font rendering. Dual-container test architecture (lupin-rest-test on :8000). set_session_topic() MCP tool for stop-hook context. Graceful STT degradation (server starts without GPU). Claude Agent SDK config migration to INI keys. 3,549+ unit tests.
v0.1.5 (March 2026) — Voice-first human-in-the-loop. Full voice loop inside Claude Code via 6 system hooks + cosa-voice MCP. Trust-aware Decision Proxy with Universal Prediction Engine, Bayesian Beta-Bernoulli trust, Thompson Sampling, and conformal prediction. Credential consolidation. Stable session identity architecture. 2,075+ tests.
v0.1.4 — cosa-voice MCP server, SWE Team Agent, Calculator Agent, CRUD Agent, Notification Proxy, 881 to 1170 unit tests, 39,871 training examples, local GPU embeddings
v0.1.3 — CJ Flow agentic job system, Deep Research + Podcast agents, Claude Agent SDK integration, JWT WebSocket auth, 100% test coverage
Lupin is an active research platform at v0.1.6. Developed by a solo engineer, it combines voice-first agent orchestration, PEFT fine-tuning, and Bayesian decision theory into a production-grade stack backed by 4,180+ automated tests across five tiers (unit, WebSocket, integration, Playwright E2E, interactive proxy), full CI discipline, and a FastAPI + PostgreSQL + LanceDB architecture. Through a series of ambitious refactorings made possible by Claude Code and the Planning is Prompting methodology, Lupin has evolved from single-user PoC sketches into a multi-user platform entering GCP testing.