The intelligence is the box.
Run DeepSeek V4-Flash — 284 billion parameters, 1 million tokens of context — on the AMD Strix Halo desktop sitting under your desk. No cloud. No subscription. No data leaves your network. One install command.
▶ 60-second demo (video coming soon — shot list ready)
git clone https://github.com/Intuition-Labs-LLC/strix-mind
cd strix-mind && bin/mind-upOpen http://localhost:8080 in your browser. That's the chat.
You bought (or are considering) an AMD Strix Halo machine — a Framework Desktop, a Bosgame M5, a HP Z2 Mini G1a. 128 GB of unified memory. $3,400-ish. You wanted to run real AI on it, and the reviews you read couldn't agree on what was actually possible.
This repo is the answer. The model is DeepSeek V4-Flash (284B-parameter MoE, 1M native context, instruction-tuned, MIT-licensed, April 2026). The engine is bati.cpp with Vulkan kernels for the V4 architecture. The substrate is Mesa/RADV with RADV_DEBUG=novram — the one env-var setting that unlocks the unified memory pool. The agents are opencode, pi, hermes, and carl — best-in-class open-source surfaces, each pointed at one local endpoint.
We figured out what works. Now you don't have to.
- A private, multi-user web chat at
http://localhost:8080(the engine's built-in UI). - DeepSeek V4-Flash loaded once, served forever, talking to four agent surfaces simultaneously through one OpenAI-compatible endpoint.
- Multi-agent concurrency via continuous batching — the engine never idles while there's work in the request queue.
- No metering. No telemetry. No subscriptions. Your box, your model, your data.
| Surface | Purpose |
|---|---|
opencode |
Day-to-day coding TUI (150k★, 6.5M devs/month) |
pi |
Minimal agent loop; RPC dispatch target |
hermes |
Persistent personal agent — Telegram/Discord/Slack/Signal bridges, durable kanban |
carl |
Coherence-aware training studio — Phi telemetry on every generation |
AMD Ryzen AI Max+ 395 / Radeon 8060S, 128 GB unified memory, Bazzite 42, kernel 6.17, Mesa 26.0.
| Workload | Prefill | Decode |
|---|---|---|
| Single user, short prompt | — | ~19 t/s |
| Single user, 322-token prompt | 26.5 t/s | 19.0 t/s |
| 2 concurrent agents | — | 9.6 t/s each (19.3 aggregate) |
| 4 concurrent agents | — | 3.3 t/s each (13.4 aggregate) |
V4-Flash IQ2_XXS quantization (78 GB). f16 KV cache (required by V4's compressed-KV path). --batch-size 4096. Continuous batching enabled. Sweet spot: 1–2 concurrent agents.
# 1. Clone
git clone https://github.com/Intuition-Labs-LLC/strix-mind
cd strix-mind
# 2. Install prerequisites (one-time)
# See docs/INSTALL.md for the full sequence: bati.cpp build, model download,
# agent-surface install. Roughly 30 minutes start to finish on a fast network.
# 3. Boot the appliance
bin/mind-up # idempotent; bring everything online
bin/mind-status # green dots across the board
bin/mind-smoke # one-shot prompt round-trips
# 4. Use it
# Browser: http://localhost:8080 (built-in chat UI)
# or via the agent surfaces:
opencode # coding TUI
pi -p "what's in pwd?" # one-shot agent
hermes # personal-agent CLI
carl chat # Phi-aware conversation user / household
│
┌─────────┼──────────┬──────────┬──────────┬─────────┐
│ │ │ │ │ │
Open WebUI opencode pi hermes carl raw curl
(browser) (TUI) (agent) (personal) (train) (any)
│ │ │ │ │ │
└─────────┴────┬─────┴──────────┴──────────┴─────────┘
│ OpenAI /v1 over localhost:8080
▼
┌───────────────────────────┐
│ bati-server (V4-aware) │ ← continuous batching, HTTP queue
├───────────────────────────┤
│ DeepSeek V4-Flash IQ2_XXS│ ← 284B params, 1M ctx, 78 GB on disk
├───────────────────────────┤
│ Vulkan / RADV STRIX_HALO │ ← RADV_DEBUG=novram for unified mem
├───────────────────────────┤
│ AMD Ryzen AI Max+ 395 │ ← 16 cores, 128 GB unified, gfx1151
└───────────────────────────┘
Three things that took us a while to figure out — and that this repo encodes so you don't have to:
RADV_DEBUG=novramcollapses the small VRAM heap into the GTT pool. Without it, the Vulkan backend asks fordomains: 4(VRAM-only) allocations that don't fit, then OOM-cascades. With it, all 120 GiB of GTT is one unified pool that the engine can use.- Drop
--mlock. Pinning every page of a 78 GB model on a 128 GB box leaves no headroom; mmap + the kernel page cache handle hot/cold pages efficiently because only ~13B parameters are active per token in V4-Flash's MoE. - f16 KV cache is mandatory for V4-Flash. Quantizing K/V to q4_0 or q8_0 produces multilingual token-salad output — V4's compressed KV path (ratio-4 CSA) has an f16 pin. Non-V4 models can safely use quantized KV; V4 cannot.
Full failure-mode catalogue with reproduction recipes: docs/TROUBLESHOOTING.md.
- Not the inference engine. That's
bati.cpp. Bug reports for the engine go upstream. - Not the model. That's DeepSeek V4-Flash. Model behavior questions go upstream.
- Not a training framework. That's
carl-studio, which strix-mind hosts. - Not a multi-model router. One mind, loaded once, shared four ways. Swap models via
bin/mind-swap-modelorbin/mind-use-v4.
What strix-mind is: the orchestration layer that takes the upstream stack and makes it run as a single coherent appliance on AMD Strix Halo, with one command and a stable mental model.
| Component | State |
|---|---|
| V4-Flash IQ2_XXS load on bati Vulkan | ✓ Working, validated |
Substrate tuning baked into mind-up |
✓ Working |
| Continuous batching across agent surfaces | ✓ Working |
One-shot smoke test (mind-smoke) |
✓ Passing |
| Open WebUI integration (recipe) | ✓ Documented, manual setup |
| Public install.sh wrapper | ⏳ Planned for v0.2.0 |
| antirez-style minimal C reference port | ⏳ Planned for v1.0 |
See docs/STATUS.md for the live operational state and docs/TROUBLESHOOTING.md for everything we've personally broken.
docs/INSTALL.md— fresh-box install sequence, ~30 mindocs/STATUS.md— current operational truth, benchmarksdocs/TROUBLESHOOTING.md— every failure mode we've hitdocs/ARCHITECTURE.md— engineering rationaledocs/DEMO_SCRIPT.md— the 60-second video shot listNOTICE.md— credits with proper attribution
strix-mind is an orchestration layer over a tall stack of brilliant work by other people. Full attributions in NOTICE.md. The short version:
- bati.cpp — the only llama.cpp fork with working Vulkan V4-Flash kernels. Without bati, this appliance does not exist.
- llama.cpp (Georgi Gerganov + ggml authors) — the foundation. Continuous batching, the OpenAI-compatible server, the built-in chat UI we use at
localhost:8080. - DeepSeek AI — V4-Flash, MIT licensed, 284B parameters, the model.
- ssweens — the IQ2_XXS quantization (78 GB) that actually fits a 128 GB box.
- Mesa / RADV — the Vulkan driver.
RADV_DEBUG=novramis theirs. - opencode, pi, hermes-agent, carl-studio — the four agent surfaces.
- antirez — for proving over decades that small, hackable, sincerely-cited code is the right shape for systems software. We try to follow the form.
MIT. © 2026 Intuition Labs LLC. See LICENSE.
Compatible with every upstream we depend on. Take this, fork it, ship something better.
Built by Tej Desai with AI collaboration for Intuition Labs LLC. The intelligence is the box.