Skip to content

feat: live dashboard monitor + serve loop improvements#5

Open
msitarzewski wants to merge 1 commit intodanveloper:mainfrom
msitarzewski:feat/dashboard-serve-improvements
Open

feat: live dashboard monitor + serve loop improvements#5
msitarzewski wants to merge 1 commit intodanveloper:mainfrom
msitarzewski:feat/dashboard-serve-improvements

Conversation

@msitarzewski
Copy link
Copy Markdown

Summary

  • ncurses dashboard — htop-style terminal monitor (dashboard.c) that reads /tmp/flash-moe-stats.json and shows real-time inference status, progress bars, TTFT, tok/s, and rolling averages
  • SSE streaming — per-token Server-Sent Events for /v1/chat/completions (OpenAI-compatible streaming)
  • Dashboard stats reporting — server writes live state (prefill progress, generation metrics, uptime) to JSON for the dashboard
  • Tool call parsing — detects <tool_call> blocks in model output and returns structured tool_calls in the response
  • Session state — save/restore KV cache and linear attention state for multi-turn conversations
  • GPU KV buffer — increased pre-allocation from 8K to 32K tokens
  • CPU 2-bit expert path — fallback compute path for 2-bit quantized experts

Testing

Tested end-to-end on Apple M5 Max (128GB RAM):

  • make && make dashboard builds cleanly
  • ./infer --serve 6601 --2bit + ./dashboard — live monitoring works across idle/prefilling/generating states
  • ./infer --serve 6601 (4-bit) — 10.5 tok/s served with correct SSE streaming
  • Terminal resize, disconnect/reconnect, and q exit all work correctly
  • Verified dashboard borders render correctly at various terminal widths

Test plan

  • make clean && make && make chat && make dashboard — all targets build
  • ./infer --serve 6601 --2bit + ./chat --port 6601 — interactive chat works
  • ./dashboard shows live stats during generation
  • Dashboard shows DISCONNECTED when server is stopped
  • Dashboard adapts to terminal resize

🤖 Generated with Claude Code

Dashboard:
- ncurses-based htop-style terminal monitor (dashboard.c)
- Reads /tmp/flash-moe-stats.json written by the inference server
- Shows real-time status, progress bars, TTFT, tok/s, rolling averages
- Auto-adapts to terminal width, clean exit with q or Ctrl+C

Serve loop:
- SSE streaming with per-token delta events
- Dashboard stats reporting (server state, prefill progress, generation metrics)
- Tool call parsing from model output (<tool_call> blocks)
- Session state save/restore for multi-turn conversations
- GPU KV buffer increased to 32K pre-allocation
- CPU 2-bit expert forward path for fallback compute

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant