Skip to content

deepily/lupin

Repository files navigation

Lupin AF

Named after Arsene Lupin, the gentleman thief. More about the AF part when the v0.1.6 branch lands in mid-to-late April.

A voice-first AI agent platform that closes the voice loop from browser UI through agent execution into developer tooling and back -- with Bayesian trust learning, fine-tuned intent routing, and solution caching built in.

FastAPI | Voice I/O | PEFT/LoRA | LanceDB | Claude Agent SDK | Bayesian Trust | MCP Protocol

Current version: v0.1.6 | License: Apache 2.0


Human in the loop, reimagined

Every agentic AI platform needs human oversight. Most implement it as a modal dialog: click approve, type feedback, wait. Lupin takes a fundamentally different approach -- voice-first human-in-the-loop.

Agents speak to you. You speak back. A Bayesian trust engine learns your preferences over time, escalating only when confidence is low and auto-approving when it has earned your trust. The result: human oversight that works from across the room, while you're multitasking, or even from your phone -- no screen required.

This is the missing piece in agentic AI: not just making agents smarter, but making human oversight effortless.


The dream

Talk to the computer, and it tells you, or does, something useful.

The problem

Currently, AI agents and chatbots are slow and expensive. They make silly mistakes. They're forgetful. And they work too hard reinventing the wheel.

What most people probably don't realize

Even the simplest vox-in and vox-out UX -- especially when coupled with agentic behaviors -- is hard. It's asynchronous, and usually frustratingly slow. It's a new way of interacting with computers, which requires a global rethinking of how different the UI control and display modalities interact.

Lupin's approach

Fine-tune small models for cheap, fast intent routing -- not prompt engineering, actual PEFT/LoRA fine-tuning. Escalate to frontier models only when complexity demands it. Cache solutions via vector search so agents never solve the same problem twice. Layer Bayesian trust learning so the system earns autonomy over time, minimizing human interruptions without sacrificing oversight. And voice-enable everything -- from the browser UI, through agent execution, into Claude Code developer sessions via 6 system hooks and an MCP voice server, and back again.


Architecture

flowchart TD
    subgraph Input
        MIC["Microphone"] --> ASR["ASR (Whisper)"]
        TEXT["Text Input"] --> ROUTER
    end

    ASR --> ROUTER["Intent Router<br/>(PEFT/LoRA fine-tuned)"]

    ROUTER --> SNAP{"Solution Snapshot<br/>Lookup (LanceDB)"}
    SNAP -- "Cache Hit" --> TTS["TTS Output"]
    SNAP -- "Cache Miss" --> CJ["CJ Flow Queue"]

    subgraph CJ Flow
        CJ --> SYNC["Sync Agents<br/>Math · Calendar · Calculator<br/>CRUD · Weather · DateTime"]
        CJ --> ASYNC["Async Agents<br/>Deep Research · Podcast<br/>SWE Team · Claude Code"]
    end

    SYNC --> TTS
    ASYNC --> PROXY["Decision Proxy<br/>(Bayesian Trust · L1-L5)"]
    PROXY --> TTS

    TTS --> WS["WebSocket<br/>(queue + audio channels)"]
    WS --> BROWSER["Browser UI"]

    subgraph Claude Code Voice Loop
        HOOKS["6 System Hooks<br/>(PreToolUse · PostToolUse · Notification<br/>Stop · PermissionRequest · UserPromptSubmit)"] --> MCP["cosa-voice<br/>MCP Server"]
        MCP --> ROUTER
    end

    style HOOKS fill:#f9f,stroke:#333,stroke-width:2px
    style MCP fill:#f9f,stroke:#333,stroke-width:2px
Loading

Voice flows end-to-end: browser microphone through agent execution into Claude Code sessions and back via dual-channel WebSocket audio streaming.


Agent ecosystem

17 specialized agents -- from sub-second sync responders to long-running autonomous research pipelines -- all routed through fine-tuned small models and unified by a single voice-first queue system.

Synchronous agents (respond in <1s via PEFT routing)

Agent Purpose
MathAgent Symbolic math via LLM
CalendarAgent Date-aware scheduling
DateTimeAgent Time queries and conversions
WeatherAgent Weather lookups
TodoListAgent Persistent task management
CalculatorAgent Natural language calculator (508 LoRA templates), MathAgent fallback
CRUDAgent Voice-controlled DataFrame create/read/update/delete
ReceptionistAgent Top-level intent router
RuntimeArgumentExpeditor LLM-powered gap analysis -- asks for missing arguments via voice

Long-running agents (async via CJ Flow queue)

Agent Purpose
DeepResearchAgent Background research with automatic report generation
PodcastGeneratorAgent Convert documents to audio podcast format
ResearchToPodcastAgent Chained research-to-podcast pipeline
PresentationGeneratorAgent Multi-phase pipeline: outline → elaborate → render → deliver (Phases 1-8)
ResearchToPresentationAgent Chained research-to-presentation pipeline
ClaudeCodeAgent Claude Agent SDK tasks (BOUNDED or INTERACTIVE mode)
SWETeamAgent 4-phase dev team: Lead, Coder, Tester, Trust Proxy

Auto-recovery agents (self-healing via Claude Agent SDK + worktree isolation)

Agent Purpose
BugFixExpediter (BFE) Dead-job auto-recovery: diagnose → propose → fix → git → retry
TestFixExpediter (TFE) Test-failure auto-fix: cluster → diagnose → propose → fix → git → rerun
TestSuiteJob Scheduled test-suite runs via CJ Flow with watchdog-triggered TFE handoff

Infrastructure agents

Agent Purpose
NotificationProxyAgent Phi-4 fuzzy script matching for automated interactive testing
DecisionProxyAgent Universal Prediction Engine (7 slices) · Bayesian Beta-Bernoulli trust · Thompson Sampling · Conformal prediction · L1-L5 escalation · Circuit breaker

Key capabilities

Voice-first everywhere -- browser to agents to developer tooling

No other platform closes the voice loop this completely:

  • Browser to agents: Dual-channel WebSocket architecture (queue events + audio streaming) with ASR (Whisper) to TTS pipeline, end to end
  • Agents to developer tools: 6 Claude Code system hooks (PreToolUse, PostToolUse, Notification, Stop, PermissionRequest, UserPromptSubmit) bridge voice into every coding session
  • Developer tools back to browser: cosa-voice MCP server provides 5 voice tools (notify, converse, ask_yes_no, ask_multiple_choice, ask_open_ended_batch)
  • Session continuity: Stable session IDs survive context clears via write-once atomic lockfile -- no identity drift
  • Stop hook gisting: Ultra-short TTS summaries of completed work via frontier model distillation
  • Voice injection: tmux-based voice input into idle Claude Code sessions -- speak and it types

Intent routing via fine-tuned small models -- not prompt engineering

While most platforms route via system prompts or keyword matching, Lupin fine-tunes:

  • 39,871 training examples across 35 command intents
  • PEFT/LoRA on Phi-4, Qwen, and Llama -- local GPU inference, zero API calls for routing
  • Sub-second classification with GSM8K-validated post-quantization math reasoning
  • Result: routing that is faster, cheaper, and more reliable than prompt-based alternatives

Solution snapshot memory -- agents that learn from their own work

When an agent solves a problem, the solution is embedded and cached in LanceDB. Next time the same (or similar) question arrives, the answer comes from vector search -- not from re-running the agent.

Operation File-Based LanceDB Speedup
Search (exact) 96 ms 0.1 ms 960x
Add snapshot 827 ms 15 ms 55x
Search (fuzzy) 120 ms 0.3 ms 400x

Local GPU embeddings (CodeRankEmbed + nomic-embed-text-v1.5) vs OpenAI API:

Operation Content Local GPU OpenAI API Speedup
Single embed prose 164 ms 1,146 ms 7x
Single embed code 70 ms 1,211 ms 17x
Batch (3) prose 8 ms 2,989 ms 374x
Batch (3) code 8 ms 3,183 ms 398x

Trust-aware decision proxy -- Bayesian autonomy that earns your confidence

The first decision proxy for AI agents with academic-grade statistical rigor:

  • Universal Prediction Engine: 7 prediction slices with 87 unit tests and 21 end-to-end tests
  • Bayesian Beta-Bernoulli trust model: Per-agent trust learning with conjugate prior updates
  • Thompson Sampling: Exploration-exploitation balance for when to auto-approve vs. escalate
  • Conformal prediction: Calibrated confidence intervals -- not guesses, statistical guarantees
  • LanceDB-backed preference embeddings: Semantic similarity with response_type filtering
  • L1-L5 trust escalation: Five trust levels from "always ask" to "full autonomy" with circuit breaker pattern
  • Morning coffee batch review: Non-urgent decisions queued for human review at your convenience
  • Ratification API: Post-hoc approval with trust feedback loop

Battle-tested -- 4,180+ automated tests

Suite Count Coverage
Unit tests 3,549+ Core logic, trust engine, hooks, credentials, prediction engine, agentic orchestrators
WebSocket tests 50 Connection, auth, event routing, session management
Integration tests 228+ End-to-end API workflows against dedicated dual-container test server
E2E UI (Playwright) 357+ Full browser-driven flows including 12-page visual regression
Interactive proxy tests 12 scenarios Calculator, CRUD, and Expediter agents via auto-proxy

Built and maintained by a single engineer. Every PR must pass all five tiers before merge.


Quick start

# Prerequisites: Python 3.11+, GPU recommended, PostgreSQL
export LUPIN_ROOT=/path/to/lupin

# Configure credentials
src/scripts/lupin_config.py init

# Start the server
src/scripts/run-fastapi-lupin.sh          # FastAPI on port 7999
src/scripts/run-lupin-gui.sh              # Browser GUI client

# Run tests
pytest src/tests/unit/                     # 3,549+ unit tests
src/scripts/run-websocket-smoke-tests.sh   # 50 WebSocket tests
src/tests/run-integration-tests.sh --bg -v # Integration gate (dual-container, :8000)
src/scripts/run-e2e-ui-tests.sh --bg -v    # 357+ Playwright tests incl. visual regression

# Install cosa-voice MCP server (for Claude Code voice I/O)
claude mcp add cosa-voice -- python ${LUPIN_ROOT}/src/lupin_mcp/cosa_voice_mcp.py

Config: src/conf/lupin-app.ini | Docker: docker build -f docker/lupin/Dockerfile . | GSM8K: src/scripts/run-gsm8k.sh --help


Documentation

For developers

Agentic jobs, recovery & test scheduling

Bug Fix Expediter (dead-job auto-recovery), Test Fix Expediter (test-failure auto-fix), and the TestSuiteJob scheduler share a common foundation in src/cosa/agents/shared/. See the Agents subsystem documentation for the full subsystem:

For operators

R&D archive

Over 130 dated planning and research documents in src/rnd/.

Codebase metrics: Lupin parent vs CoSA comparison — 2026-04-12 snapshot of LoC distribution with mermaid diagram, 60/40 Python split, docstring-ratio observations, and operational implications of the CoSA-never-commit rule.


Version history

v0.1.6 (April 2026) — Presentation Generator agent (multi-phase outline → elaborate → render → deliver, 8 phases). CJ Flow persistence: PostgreSQL write-through for todo/running/done queues with startup recovery, timed execution + monopolize + pause flags, and Job History UI (5th collapsible section with time-window filter). Auto-recovery agent family: Bug Fix Expediter and Test Fix Expediter with Claude Agent SDK worktree isolation and Resume-with-overrides UI. Playwright E2E suite expanded from ~100 to 357 tests across 8 phases, including 12-page visual regression with deterministic font rendering. Dual-container test architecture (lupin-rest-test on :8000). set_session_topic() MCP tool for stop-hook context. Graceful STT degradation (server starts without GPU). Claude Agent SDK config migration to INI keys. 3,549+ unit tests.

v0.1.5 (March 2026) — Voice-first human-in-the-loop. Full voice loop inside Claude Code via 6 system hooks + cosa-voice MCP. Trust-aware Decision Proxy with Universal Prediction Engine, Bayesian Beta-Bernoulli trust, Thompson Sampling, and conformal prediction. Credential consolidation. Stable session identity architecture. 2,075+ tests.

v0.1.4 — cosa-voice MCP server, SWE Team Agent, Calculator Agent, CRUD Agent, Notification Proxy, 881 to 1170 unit tests, 39,871 training examples, local GPU embeddings

v0.1.3 — CJ Flow agentic job system, Deep Research + Podcast agents, Claude Agent SDK integration, JWT WebSocket auth, 100% test coverage

Full changelog


Project status

Lupin is an active research platform at v0.1.6. Developed by a solo engineer, it combines voice-first agent orchestration, PEFT fine-tuning, and Bayesian decision theory into a production-grade stack backed by 4,180+ automated tests across five tiers (unit, WebSocket, integration, Playwright E2E, interactive proxy), full CI discipline, and a FastAPI + PostgreSQL + LanceDB architecture. Through a series of ambitious refactorings made possible by Claude Code and the Planning is Prompting methodology, Lupin has evolved from single-user PoC sketches into a multi-user platform entering GCP testing.


License

Apache 2.0

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors