Your server is slow. You SSH in. Now what?
melisai answers that question in 10 seconds. Single Go binary — no agents, no daemons, no config files. It runs 67 BCC/eBPF tools + 8 procfs collectors, detects 37 anomalies, computes a health score, and tells you exactly what sysctl to change. Works standalone or as an MCP server for Claude/Cursor to diagnose servers interactively.
$ sudo melisai collect --profile quick -o report.json
melisai v0.4.1 | profile=quick | duration=10s
Tier 1 (procfs) ████████████████████████████████████████ 8/8 2.1s
Tier 2 (BCC) ████████████████████████████████████████ 4/4 10.3s
Health Score: 68 / 100 ⚠️
Anomalies: cpu_utilization CRITICAL (98.7%)
load_average WARNING (3.2x CPUs)
Recommendations: 2
Report saved to report.json
| Feature | What it does |
|---|---|
| Health Score | Single 0-100 number. Green/yellow/red. AI agents use it to decide next steps |
| 37 Anomaly Rules | Rate-based detection (not cumulative counters). Catches problems happening right now |
| 35 Recommendations | Copy-paste sysctl commands with evidence. Each tagged as "fix" or "optimization" |
| 67 BCC Tools | ~84% of Brendan Gregg's BPF toolkit. Histograms, events, stack traces |
| GPU/PCIe Topology | Detects NVIDIA GPUs, maps PCI→NUMA, flags cross-NUMA GPU-NIC pairs (30-50% DMA penalty) |
| Page Reclaim Tracking | Direct reclaim rate, compaction stalls, THP splits — the invisible latency killers |
| NUMA Analysis | Per-node miss ratio, distance matrix, CPU mapping. Catches 30-50% latency penalties |
| Network Deep Dive | Conntrack, softnet, IRQ distribution, accept queue depth, 30+ TCP/UDP sysctls |
| MCP Server | Claude Desktop / Cursor connect over stdio. Interactive server diagnostics |
| Before/After Diff | Compare two reports. See what improved, what regressed |
| Observer Effect Mitigation | Two-phase collection: baselines first, then BCC tools. PID exclusion |
| 22 Chapter Documentation | Book-level guide (EN + RU) covering CPU, memory, disk, network, GPU, NUMA, THP |
One command. 60 seconds. Here's what melisai found on a production Docker host (8 CPUs, 32 GB RAM, HDD RAID):
Health Score: 46 / 100
Anomalies (6):
[CRITICAL] tcp_retransmits 134/sec
[WARNING] disk_utilization 80.5%
[WARNING] cpu_psi_pressure 17.6%
[WARNING] io_psi_pressure 12.8%
[WARNING] disk_avg_latency 12.1ms
[WARNING] tcp_close_wait 1 socket
What melisai found that manual debugging wouldn't:
The root cause chain — HDD latency (p99=49ms) caused page reclaim pressure (PruneCalled=105), which made the kernel prune TCP buffers, triggering 31 retransmits/sec, and cascading into 226K dropped packets on the WireGuard tunnel. Four separate symptoms, one root cause.
| What | Manual | melisai |
|---|---|---|
| Find the cascade: HDD → reclaim → tcp prune → retransmits → VPN drops | Requires years of kernel experience | Automatic — 37 rules check it all |
Know to check PruneCalled in /proc/net/netstat |
You have to know it exists | Collected and evaluated automatically |
| Identify the process eating 245% CPU + 8.6 GB RAM | top shows it, but not in context |
Correlated with health score and anomalies |
| Collect 17,330 stack traces for flame graph | perf record + manual setup |
Included in one --profile deep run |
| Get copy-paste sysctl fixes with evidence | Google + trial and error | 7 recommendations with exact commands |
The diagnosis: server outgrew its hardware. SSD would fix 80% of the problems. The other 20% is workload sizing.
- SRE / DevOps — "server is slow, find out why" in one command
- ML Engineers — GPU-NIC NUMA topology, PCIe bandwidth diagnostics
- Database Admins — THP splits, page reclaim, dirty writeback, NUMA miss ratio
- Platform Teams — container CPU throttling, cgroup memory pressure, K8s NUMA topology
- AI Agents — structured JSON + MCP protocol = autonomous server diagnostics
| melisai | Monitoring agents | |
|---|---|---|
| Deployment | Single binary, run once | Daemon + config + server |
| Time to answer | 10 seconds | Set up dashboards, wait for data |
| BPF tools | 67 tools in one run | Separate setup per tool |
| Recommendations | Exact sysctl commands | Raw metrics only |
| AI-native | MCP server, structured JSON, AI prompt | Requires adapter/integration |
| GPU topology | Cross-NUMA detection | Not typically covered |
| Cost | Free, Apache 2.0 | Free to $$$ |
melisai is not a replacement for monitoring — it's a diagnostic tool you run when something is already wrong (or to validate that everything is right).
# Build (requires Go 1.23+, cross-compile from macOS/Linux)
GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build -o melisai ./cmd/melisai/
# Deploy and run
scp melisai root@server:/usr/local/bin/
ssh root@server "melisai collect --profile quick -o /tmp/report.json"
# Install BCC tools (first time only)
ssh root@server "melisai install"No config files. No YAML. No environment variables.
melisai includes a built-in Model Context Protocol server. AI agents connect over stdio and interactively diagnose system performance.
Claude Desktop / Cursor config (claude_desktop_config.json):
{
"mcpServers": {
"melisai": {
"command": "ssh",
"args": ["root@your-server", "/usr/local/bin/melisai", "mcp"]
}
}
}| Tool | What it does | Time |
|---|---|---|
get_health |
Quick 0-100 score + anomalies. Tier 1 only | ~2s |
collect_metrics |
Full profile with BCC/eBPF tools. Args: profile, focus, pid |
10-60s |
explain_anomaly |
Root causes + recommendations for any anomaly ID | instant |
list_anomalies |
All 37 anomaly metric IDs with descriptions | instant |
Agent melisai
│ │
├── get_health ──────────────────► │ "score: 68, cpu_utilization CRITICAL"
│ │
├── explain_anomaly ─────────────► │ "Root causes, what to check, sysctl fixes..."
│ anomaly_id: cpu_utilization │
│ │
├── collect_metrics ─────────────► │ Full JSON: 67 BCC tools, histograms,
│ profile: standard │ events, stack traces, AI prompt
│ │
└── (agent analyzes & recommends) │
Three collection tiers with automatic fallback:
Tier 1: 8 collectors (/proc, /sys, ss, ethtool, nvidia-smi, dmesg)
Network deep diagnostics, page reclaim & THP tracking,
NUMA topology analysis, GPU/PCIe cross-NUMA detection
Always works. No root needed.
Tier 2: 67 BCC tools — runqlat, biolatency, tcpconnlat, etc.
Root + bcc-tools required. `melisai install` handles setup.
Tier 3: Native eBPF (cilium/ebpf) — tcpretrans kprobe
Root + kernel ≥ 5.8 + BTF. Zero external dependencies.
Collection runs in two phases to eliminate observer effect:
- Phase 1 — Tier 1 collectors capture clean baselines
- Phase 2 — BCC/eBPF tools run without contaminating baselines
| Section | Content |
|---|---|
summary.health_score |
Weighted 0-100 score (CPU 1.5x, Memory 1.5x, Disk 1.0x, Network 1.0x) |
summary.anomalies[] |
Detected issues with severity, metric, value, threshold |
summary.resources |
USE metrics per resource (utilization, saturation, errors) |
summary.recommendations[] |
Exact commands with type (fix/optimization) and evidence |
categories.* |
Raw data: histograms, events, stack traces per subsystem |
ai_context.prompt |
Dynamic prompt with system context and 27 anti-patterns |
| Profile | Duration | What runs | Best for |
|---|---|---|---|
| quick | 10s | Tier 1 + biolatency, tcpretrans, opensnoop, oomkill | Health checks, CI gates |
| standard | 30s | All Tier 1 + all 67 BCC tools | Regular diagnostics |
| deep | 60s | Everything + memleak, biostacks, wakeuptime, biotop | Root cause analysis |
# Quick health check
sudo melisai collect --profile quick -o report.json
# Full analysis with AI prompt
sudo melisai collect --profile standard --ai-prompt -o report.json
# Deep dive focused on disk
sudo melisai collect --profile deep --focus disk -o report.json
# Profile a specific process
sudo melisai collect --profile standard --pid 12345 -o app.json
# Profile a container
sudo melisai collect --profile standard --cgroup /sys/fs/cgroup/system.slice/nginx.service -o nginx.json
# Compare before/after a change
melisai diff before.json after.json -o diff.jsonAll rate-based rules use two-point sampling (delta over interval). No false positives from cumulative counters on long-uptime systems.
| Metric | Warning | Critical | Source |
|---|---|---|---|
| cpu_utilization | 80% | 95% | /proc/stat |
| cpu_iowait | 10% | 30% | /proc/stat |
| load_average | 2x CPUs | 4x CPUs | /proc/loadavg |
| memory_utilization | 85% | 95% | /proc/meminfo |
| swap_usage | 10% | 50% | /proc/meminfo |
| direct_reclaim_rate | 10/s | 1000/s | /proc/vmstat (rate) |
| compaction_stall_rate | 1/s | 100/s | /proc/vmstat (rate) |
| thp_split_rate | 1/s | 100/s | /proc/vmstat (rate) |
| numa_miss_ratio | 5% | 20% | /sys/devices/system/node |
| disk_utilization | 70% | 90% | /proc/diskstats |
| disk_avg_latency | 5ms | 50ms | /proc/diskstats |
| tcp_retransmits | 10/s | 50/s | /proc/net/snmp (rate) |
| conntrack_usage_pct | 70% | 90% | /proc/sys/net/netfilter |
| softnet_dropped | 1/s | 100/s | /proc/net/softnet_stat (rate) |
| listen_overflows | 1/s | 100/s | /proc/net/netstat (rate) |
| tcp_close_wait | 1 | 100 | ss |
| tcp_rcvq_drop | 1/s | 100/s | /proc/net/netstat (rate) |
| tcp_zero_window_drop | 1/s | 50/s | /proc/net/netstat (rate) |
| listen_queue_saturation | 70% | 90% | ss -tnl fill % |
| irq_imbalance | 5x | 20x | /proc/softirqs (rate) |
| gpu_nic_cross_numa | 1 pair | 1 pair | sysfs PCI NUMA |
| ... and 16 more (BCC histograms, PSI, container, NIC, UDP) |
~84% coverage of Brendan Gregg's BPF observability tools.
| Subsystem | Count | Tools |
|---|---|---|
| CPU | 10 | runqlat, runqlen, cpudist, hardirqs, softirqs, runqslower, cpufreq, cpuunclaimed, llcstat, funccount |
| Disk | 21 | biolatency, biosnoop, biotop, bitesize, ext4slower/dist, btrfsslower/dist, xfsslower/dist, nfsslower/dist, zfsslower/dist, fileslower, filelife, mountsnoop, mdflush, scsilatency, nvmelatency, vfsstat |
| Memory | 7 | cachestat, oomkill, drsnoop, shmsnoop, numamove, memleak, slabratetop |
| Network | 14 | tcpconnlat, tcpretrans, tcprtt, tcpdrop, tcpstates, tcpconnect, tcpaccept, tcplife, udpconnect, sofdsnoop, sockstat, skbdrop, tcpsynbl, gethostlatency |
| Process | 9 | execsnoop, opensnoop, killsnoop, threadsnoop, syncsnoop, exitsnoop, statsnoop, capable, syscount |
| Stacks | 6 | profile, offcputime, wakeuptime, offwaketime, biostacks, stackcount |
24 tools support --pid filtering for per-process analysis.
# Health score
jq '.summary.health_score' report.json
# All anomalies
jq '.summary.anomalies[]' report.json
# Copy-paste recommendations
jq '.summary.recommendations[] | {type, title, commands}' report.json
# Network sysctls
jq '.categories.network[0].data.sysctls' report.json
# Page reclaim pressure
jq '.categories.memory[0].data.reclaim' report.json
# NUMA miss ratio per node
jq '.categories.memory[0].data.numa_nodes[] | {node, miss_ratio, cpus}' report.json
# Top CPU processes
jq '.categories.process[0].data.top_by_cpu[:5]' report.json
# BCC histogram percentiles
jq '.categories.cpu[].histograms[]? | {name, p50, p99, max}' report.jsonComprehensive book-level guide covering Linux performance theory and melisai implementation:
| Chapters | Topic |
|---|---|
| 00-01 | Introduction, Linux fundamentals (/proc, /sys, cgroups, PSI) |
| 02-07 | CPU, Memory, Disk, Network, Process, Container analysis |
| 08-10 | System collector, BCC tools registry, Native eBPF |
| 11-13 | Anomaly detection (37 rules), Recommendations engine, AI integration |
| 14-17 | Report diffing, Orchestrator, Output formats, Appendix |
| 18 | GPU & PCIe Topology — nvidia-smi, NUMA mapping, GPUDirect RDMA |
| 19 | Page Reclaim & THP — watermarks, direct reclaim, compaction, THP defrag |
| 20 | NUMA Optimization — distance matrix, numactl, K8s topology manager |
| 21 | Production Tuning Checklist — all sysctls + one-liner tuning script |
All chapters available in English and Russian: doc/en/ | doc/ru/
cmd/melisai/ CLI (cobra) + MCP subcommand
internal/
├── collector/ 8 Tier 1 collectors + BCC adapter + GPU/PCIe
├── executor/ BCC runner, security, 67 parsers, registry
├── ebpf/ Native eBPF loader (cilium/ebpf, CO-RE)
├── mcp/ MCP server (4 tools, stdio JSON-RPC)
├── model/ Types, USE metrics, 37 anomalies, health score
├── observer/ PID tracker, overhead measurement
├── orchestrator/ Two-phase execution, signal handling, profiles
├── output/ JSON, FlameGraph SVG, AI prompt generator
├── diff/ Report comparison engine
└── installer/ Distro detection, package installation
Security: BCC binaries verified (root-owned, not world-writable, allowed paths only). No shell execution. Symlink resolution. Output capped at 50MB per tool.
| Minimum | Notes | |
|---|---|---|
| Build | Go 1.23+ | Cross-compile: GOOS=linux GOARCH=amd64 |
| Tier 1 | Any Linux kernel | No root needed |
| Tier 2 | bcc-tools installed | sudo melisai install handles this |
| Tier 3 | Kernel ≥ 5.8 with BTF | Falls back to Tier 2 automatically |
Ubuntu 24.04 (full validation), Ubuntu 22.04, Debian 12, Fedora 39, CentOS Stream 9
go test ./... -v # 458 tests
go test ./... -race # with race detector
make lint # golangci-lint
make build # cross-compile to linux/amd64
make test-validation # Linux + root + stress-ng workload tests- Concurrent ethtool calls (parallel NIC enrichment)
-
--compactflag for 256+ CPU servers - Per-cgroup memory pressure tracking
- AMD ROCm GPU support
- OpenTelemetry export
- Web UI for report visualization
Contributions welcome! Areas where help is especially appreciated:
- New BCC tool parsers — we're at 67/80 of Gregg's toolkit
- AMD GPU support — ROCm equivalent of our nvidia-smi integration
- Documentation — translations, tutorials, real-world case studies
- Testing — malformed input edge cases, container-specific scenarios
Please open an issue first to discuss significant changes.
| Problem | Fix |
|---|---|
tool "X" not found |
sudo melisai install |
binary "X" not owned by root |
chown root:root /usr/sbin/X-bpfcc |
| Empty histogram data | Normal — no events during window |
exit status 1 from BCC tool |
Check dmesg for BPF errors |
Being honest about what melisai is not:
- Not a monitoring system. It's a diagnostic tool — you run it when something is wrong (or to validate that everything is right). For continuous monitoring, use Prometheus/Grafana/Datadog alongside melisai.
- No historical trends. melisai captures a point-in-time snapshot. It can't tell you "retransmits spiked yesterday at 3 PM."
- No alerting. It doesn't run as a daemon or send notifications. Use it on-demand or via cron + MCP.
- BCC tools need kernel headers. Some distros require
linux-headers-$(uname -r)for Tier 2 tools. Tier 1 always works. - System-level only. melisai sees that a process uses 245% CPU, but not why inside the application. For app-level profiling, use pprof/async-profiler on the process melisai identified.
- NVIDIA GPUs only. AMD ROCm support is on the roadmap but not implemented yet.
Built on the shoulders of Brendan Gregg's BPF ecosystem, the USE Method, and cilium/ebpf.
