Skip to content

feat(sight): add agent activity monitor via schedmon BPF#822

Open
jfeng18 wants to merge 1 commit into
alibaba:mainfrom
jfeng18:feat/activity-monitor
Open

feat(sight): add agent activity monitor via schedmon BPF#822
jfeng18 wants to merge 1 commit into
alibaba:mainfrom
jfeng18:feat/activity-monitor

Conversation

@jfeng18

@jfeng18 jfeng18 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Supersedes #662.

Summary

  • Replace cgroup-based idle-burst scheduler with observation-only activity monitor
  • Add schedmon BPF probe (tp_btf/sched_switch + sched_wakeup) for per-thread sleep/wakeup tracking
  • Add ActivityMonitor state machine with per-family idle/active tracking and configurable debounce
  • No cgroup operations, no cpu.idle toggle — CPU policy belongs in container spec

Why: The original design (#662) set cpu.idle=1 on idle agents, which drops CFS weight to 3 (vs 1024 normal). On API response wakeup, the thread cannot preempt any non-idle task and may starve 10+ms — the opposite of the intended acceleration.

Changed files

File What
schedmon.bpf.c / schedmon.h New BPF tracepoint programs
probes/schedmon.rs Rust wrapper with shared map reuse
scheduler/mod.rs ActivityMonitor (337 lines, 9 tests)
probes/probes.rs SchedMon integration + ring buffer dispatch
unified.rs Event handling + process lifecycle wiring
config.rs activity_monitor.enabled + idle_threshold_ms
event.rs / build.rs / lib.rs / probes/mod.rs / parser/unified.rs Plumbing

Test plan

  • Unit tests: 553 passed, 0 failed (full suite)
  • Scheduler tests: 9/9 passed (state machine)
  • Schedmon tests: 3/3 passed (probe wrapper)
  • ECS E2E: BPF probe loads, Sched events flow through ring buffer to ActivityMonitor
  • ECS E2E: Cosh agent auto-discovered, Sched events received for traced process
  • Adversarial workflow review: 0 blocking issues

Replace the cgroup-based idle-burst scheduler (which caused CFS weight
starvation on wakeup) with a pure observation-only activity monitor.

The schedmon BPF probe attaches to tp_btf/sched_switch and
tp_btf/sched_wakeup to track per-thread sleep/wakeup events for traced
agent processes. The userspace ActivityMonitor aggregates these into
per-family idle/active state with configurable debounce threshold.

No cgroup operations, no cpu.idle toggle, no weight manipulation —
CPU scheduling policy belongs in the container spec, not in the
observability layer.

New files:
- schedmon.bpf.c / schedmon.h: BPF tracepoint programs
- probes/schedmon.rs: Rust wrapper with map reuse
- scheduler/mod.rs: ActivityMonitor state machine (9 tests)

Config: activity_monitor.enabled + idle_threshold_ms in agentsight.json

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jfeng18

jfeng18 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Supersedes #662 — reworked from cgroup-based scheduler to observation-only activity monitor per review feedback on CFS weight starvation.

@github-actions github-actions Bot added the component:sight src/agentsight/ label Jun 10, 2026
@jfeng18 jfeng18 force-pushed the feat/activity-monitor branch from 62606eb to a3fc6f8 Compare June 10, 2026 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component:sight src/agentsight/

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant