feat: live dashboard monitor + serve loop improvements by msitarzewski · Pull Request #5 · danveloper/flash-moe

msitarzewski · 2026-03-21T15:16:25Z

Summary

ncurses dashboard — htop-style terminal monitor (dashboard.c) that reads /tmp/flash-moe-stats.json and shows real-time inference status, progress bars, TTFT, tok/s, and rolling averages
SSE streaming — per-token Server-Sent Events for /v1/chat/completions (OpenAI-compatible streaming)
Dashboard stats reporting — server writes live state (prefill progress, generation metrics, uptime) to JSON for the dashboard
Tool call parsing — detects <tool_call> blocks in model output and returns structured tool_calls in the response
Session state — save/restore KV cache and linear attention state for multi-turn conversations
GPU KV buffer — increased pre-allocation from 8K to 32K tokens
CPU 2-bit expert path — fallback compute path for 2-bit quantized experts

Testing

Tested end-to-end on Apple M5 Max (128GB RAM):

make && make dashboard builds cleanly
./infer --serve 6601 --2bit + ./dashboard — live monitoring works across idle/prefilling/generating states
./infer --serve 6601 (4-bit) — 10.5 tok/s served with correct SSE streaming
Terminal resize, disconnect/reconnect, and q exit all work correctly
Verified dashboard borders render correctly at various terminal widths

Test plan

make clean && make && make chat && make dashboard — all targets build
./infer --serve 6601 --2bit + ./chat --port 6601 — interactive chat works
./dashboard shows live stats during generation
Dashboard shows DISCONNECTED when server is stopped
Dashboard adapts to terminal resize

🤖 Generated with Claude Code

Dashboard: - ncurses-based htop-style terminal monitor (dashboard.c) - Reads /tmp/flash-moe-stats.json written by the inference server - Shows real-time status, progress bars, TTFT, tok/s, rolling averages - Auto-adapts to terminal width, clean exit with q or Ctrl+C Serve loop: - SSE streaming with per-token delta events - Dashboard stats reporting (server state, prefill progress, generation metrics) - Tool call parsing from model output (<tool_call> blocks) - Session state save/restore for multi-turn conversations - GPU KV buffer increased to 32K pre-allocation - CPU 2-bit expert forward path for fallback compute Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: live dashboard monitor + serve loop improvements#5

feat: live dashboard monitor + serve loop improvements#5
msitarzewski wants to merge 1 commit intodanveloper:mainfrom
msitarzewski:feat/dashboard-serve-improvements

msitarzewski commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

msitarzewski commented Mar 21, 2026

Summary

Testing

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant