Skip to content

Intuition-Labs-LLC/strix-mind

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

strix-mind

The intelligence is the box.

Run DeepSeek V4-Flash — 284 billion parameters, 1 million tokens of context — on the AMD Strix Halo desktop sitting under your desk. No cloud. No subscription. No data leaves your network. One install command.

▶ 60-second demo (video coming soon — shot list ready)

git clone https://github.com/Intuition-Labs-LLC/strix-mind
cd strix-mind && bin/mind-up

Open http://localhost:8080 in your browser. That's the chat.


Why this exists

You bought (or are considering) an AMD Strix Halo machine — a Framework Desktop, a Bosgame M5, a HP Z2 Mini G1a. 128 GB of unified memory. $3,400-ish. You wanted to run real AI on it, and the reviews you read couldn't agree on what was actually possible.

This repo is the answer. The model is DeepSeek V4-Flash (284B-parameter MoE, 1M native context, instruction-tuned, MIT-licensed, April 2026). The engine is bati.cpp with Vulkan kernels for the V4 architecture. The substrate is Mesa/RADV with RADV_DEBUG=novram — the one env-var setting that unlocks the unified memory pool. The agents are opencode, pi, hermes, and carl — best-in-class open-source surfaces, each pointed at one local endpoint.

We figured out what works. Now you don't have to.

What you get

  • A private, multi-user web chat at http://localhost:8080 (the engine's built-in UI).
  • DeepSeek V4-Flash loaded once, served forever, talking to four agent surfaces simultaneously through one OpenAI-compatible endpoint.
  • Multi-agent concurrency via continuous batching — the engine never idles while there's work in the request queue.
  • No metering. No telemetry. No subscriptions. Your box, your model, your data.
Surface Purpose
opencode Day-to-day coding TUI (150k★, 6.5M devs/month)
pi Minimal agent loop; RPC dispatch target
hermes Persistent personal agent — Telegram/Discord/Slack/Signal bridges, durable kanban
carl Coherence-aware training studio — Phi telemetry on every generation

Performance, measured on a Framework Desktop

AMD Ryzen AI Max+ 395 / Radeon 8060S, 128 GB unified memory, Bazzite 42, kernel 6.17, Mesa 26.0.

Workload Prefill Decode
Single user, short prompt ~19 t/s
Single user, 322-token prompt 26.5 t/s 19.0 t/s
2 concurrent agents 9.6 t/s each (19.3 aggregate)
4 concurrent agents 3.3 t/s each (13.4 aggregate)

V4-Flash IQ2_XXS quantization (78 GB). f16 KV cache (required by V4's compressed-KV path). --batch-size 4096. Continuous batching enabled. Sweet spot: 1–2 concurrent agents.

Quick start

# 1. Clone
git clone https://github.com/Intuition-Labs-LLC/strix-mind
cd strix-mind

# 2. Install prerequisites (one-time)
#    See docs/INSTALL.md for the full sequence: bati.cpp build, model download,
#    agent-surface install. Roughly 30 minutes start to finish on a fast network.

# 3. Boot the appliance
bin/mind-up                 # idempotent; bring everything online
bin/mind-status             # green dots across the board
bin/mind-smoke              # one-shot prompt round-trips

# 4. Use it
#    Browser: http://localhost:8080  (built-in chat UI)
#    or via the agent surfaces:
opencode                    # coding TUI
pi -p "what's in pwd?"      # one-shot agent
hermes                      # personal-agent CLI
carl chat                   # Phi-aware conversation

Architecture

            user / household
                  │
        ┌─────────┼──────────┬──────────┬──────────┬─────────┐
        │         │          │          │          │         │
   Open WebUI   opencode    pi       hermes      carl    raw curl
   (browser)    (TUI)     (agent)   (personal)  (train)   (any)
        │         │          │          │          │         │
        └─────────┴────┬─────┴──────────┴──────────┴─────────┘
                       │  OpenAI /v1 over localhost:8080
                       ▼
           ┌───────────────────────────┐
           │  bati-server (V4-aware)   │  ← continuous batching, HTTP queue
           ├───────────────────────────┤
           │  DeepSeek V4-Flash IQ2_XXS│  ← 284B params, 1M ctx, 78 GB on disk
           ├───────────────────────────┤
           │  Vulkan / RADV STRIX_HALO │  ← RADV_DEBUG=novram for unified mem
           ├───────────────────────────┤
           │  AMD Ryzen AI Max+ 395    │  ← 16 cores, 128 GB unified, gfx1151
           └───────────────────────────┘

How it works (the substrate)

Three things that took us a while to figure out — and that this repo encodes so you don't have to:

  1. RADV_DEBUG=novram collapses the small VRAM heap into the GTT pool. Without it, the Vulkan backend asks for domains: 4 (VRAM-only) allocations that don't fit, then OOM-cascades. With it, all 120 GiB of GTT is one unified pool that the engine can use.
  2. Drop --mlock. Pinning every page of a 78 GB model on a 128 GB box leaves no headroom; mmap + the kernel page cache handle hot/cold pages efficiently because only ~13B parameters are active per token in V4-Flash's MoE.
  3. f16 KV cache is mandatory for V4-Flash. Quantizing K/V to q4_0 or q8_0 produces multilingual token-salad output — V4's compressed KV path (ratio-4 CSA) has an f16 pin. Non-V4 models can safely use quantized KV; V4 cannot.

Full failure-mode catalogue with reproduction recipes: docs/TROUBLESHOOTING.md.

What strix-mind is NOT

  • Not the inference engine. That's bati.cpp. Bug reports for the engine go upstream.
  • Not the model. That's DeepSeek V4-Flash. Model behavior questions go upstream.
  • Not a training framework. That's carl-studio, which strix-mind hosts.
  • Not a multi-model router. One mind, loaded once, shared four ways. Swap models via bin/mind-swap-model or bin/mind-use-v4.

What strix-mind is: the orchestration layer that takes the upstream stack and makes it run as a single coherent appliance on AMD Strix Halo, with one command and a stable mental model.

Status

Component State
V4-Flash IQ2_XXS load on bati Vulkan ✓ Working, validated
Substrate tuning baked into mind-up ✓ Working
Continuous batching across agent surfaces ✓ Working
One-shot smoke test (mind-smoke) ✓ Passing
Open WebUI integration (recipe) ✓ Documented, manual setup
Public install.sh wrapper ⏳ Planned for v0.2.0
antirez-style minimal C reference port ⏳ Planned for v1.0

See docs/STATUS.md for the live operational state and docs/TROUBLESHOOTING.md for everything we've personally broken.

Documentation

Credits

strix-mind is an orchestration layer over a tall stack of brilliant work by other people. Full attributions in NOTICE.md. The short version:

  • bati.cpp — the only llama.cpp fork with working Vulkan V4-Flash kernels. Without bati, this appliance does not exist.
  • llama.cpp (Georgi Gerganov + ggml authors) — the foundation. Continuous batching, the OpenAI-compatible server, the built-in chat UI we use at localhost:8080.
  • DeepSeek AI — V4-Flash, MIT licensed, 284B parameters, the model.
  • ssweens — the IQ2_XXS quantization (78 GB) that actually fits a 128 GB box.
  • Mesa / RADV — the Vulkan driver. RADV_DEBUG=novram is theirs.
  • opencode, pi, hermes-agent, carl-studio — the four agent surfaces.
  • antirez — for proving over decades that small, hackable, sincerely-cited code is the right shape for systems software. We try to follow the form.

License

MIT. © 2026 Intuition Labs LLC. See LICENSE.

Compatible with every upstream we depend on. Take this, fork it, ship something better.


Built by Tej Desai with AI collaboration for Intuition Labs LLC. The intelligence is the box.

About

Personal AI appliance for AMD Strix Halo workstations — DeepSeek V4-Flash served to four agent surfaces over one local endpoint. MIT.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages