Skip to content

feat(sight): agent activity monitor via schedmon BPF#662

Closed
jfeng18 wants to merge 10 commits into
alibaba:mainfrom
jfeng18:feat/idle-burst-scheduler
Closed

feat(sight): agent activity monitor via schedmon BPF#662
jfeng18 wants to merge 10 commits into
alibaba:mainfrom
jfeng18:feat/idle-burst-scheduler

Conversation

@jfeng18

@jfeng18 jfeng18 commented May 29, 2026

Copy link
Copy Markdown
Contributor

Summary

Replaces the cgroup-based idle-burst-idle scheduler with a pure observability activity monitor. The BPF scheduling probes are retained; the cgroup actuation layer is removed.

Why the change

Reviewer feedback (执云/智彻) + kernel code analysis identified fundamental flaws in the cpu.idle toggle approach:

  1. Architectural race: userspace round-trip (BPF event → daemon → cgroup write) is slower than the kernel's first scheduling decision after wakeup
  2. Active anti-pattern: cpu.idle=1 sets weight to WEIGHT_IDLEPRIO=3 (kernel/sched/fair.c:14141), making waking tasks unable to preempt non-idle tasks (line 9087) until userspace restores cpu.idle=0
  3. No-op when idle: setting cpu.idle on sleeping tasks has no scheduling effect (nothing on runqueue)

Correct approach for CPU acceleration/high-density: static cpu.weight in container spec, or future sched_ext BPF scheduler (in-kernel, zero round-trip). ECS already has CONFIG_SCHED_CLASS_EXT=y.

What's kept

  • schedmon BPF probes (tp_btf/sched_switch + sched_wakeup): per-tid sleep/wakeup detection
  • Activity state machine: per-family idle/active tracking with debounce
  • Metrics: idle_to_active_count, active_to_idle_count, duration tracking
  • All unit tests (state machine correctness)

What's removed

  • All cgroup operations (create/remove/write cpu.idle/cpu.weight/migrate pid)
  • active_weight, cgroup_root config fields
  • Drop/cleanup/sweep logic
  • README zero-intrusion scoping docs (no longer needed — purely observational)

Stacked on #661 (lineage tree). Review top commits only.

Test plan

  • 466 unit tests pass, 9 activity monitor tests
  • Workflow adversarial review: PASS
  • ECS E2E: verify idle/active log transitions with real agent workload
  • Reviewer sign-off

@github-actions github-actions Bot added the component:sight src/agentsight/ label May 29, 2026
jfeng18 added a commit to jfeng18/anolisa that referenced this pull request Jun 1, 2026
Distinguishes default zero-intrusion monitoring from opt-in subsystems
(e.g. the idle-burst-idle scheduler enabled by --enable-scheduler).

Sweeps all three product-summary surfaces under src/agentsight: README.md,
README_CN.md (full-width punctuation, plain-language phrasing), and the RPM
%description in agentsight.spec.in.

NOTE: depends on commit 8a7a036 in this same PR, which introduces the
--enable-scheduler flag. Do not split this README change out of alibaba#662 or
land it ahead of 8a7a036; --enable-scheduler does not exist on main yet.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
jfeng18 added a commit to jfeng18/anolisa that referenced this pull request Jun 3, 2026
Distinguishes default zero-intrusion monitoring from opt-in subsystems
(e.g. the idle-burst-idle scheduler enabled by --enable-scheduler).

Sweeps all three product-summary surfaces under src/agentsight: README.md,
README_CN.md (full-width punctuation, plain-language phrasing), and the RPM
%description in agentsight.spec.in.

NOTE: depends on commit 8a7a036 in this same PR, which introduces the
--enable-scheduler flag. Do not split this README change out of alibaba#662 or
land it ahead of 8a7a036; --enable-scheduler does not exist on main yet.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jfeng18 jfeng18 force-pushed the feat/idle-burst-scheduler branch from 35342ed to bed254c Compare June 3, 2026 11:24
@jfeng18 jfeng18 changed the title feat(sight): idle-burst-idle scheduler for Agent families feat(sight): agent activity monitor via schedmon BPF Jun 4, 2026
jfeng18 added a commit to jfeng18/anolisa that referenced this pull request Jun 4, 2026
EVENT_SOURCE_SCHED=7 (from alibaba#662 schedmon) would collide with
EVENT_SOURCE_LSM=7 on merge. Renumber LSM to 8.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jfeng18 jfeng18 force-pushed the feat/idle-burst-scheduler branch 2 times, most recently from 3a33e35 to 907ad01 Compare June 4, 2026 23:09
jfeng18 added a commit to jfeng18/anolisa that referenced this pull request Jun 6, 2026
Distinguishes default zero-intrusion monitoring from opt-in subsystems
(e.g. the idle-burst-idle scheduler enabled by --enable-scheduler).

Sweeps all three product-summary surfaces under src/agentsight: README.md,
README_CN.md (full-width punctuation, plain-language phrasing), and the RPM
%description in agentsight.spec.in.

NOTE: depends on commit 8a7a036 in this same PR, which introduces the
--enable-scheduler flag. Do not split this README change out of alibaba#662 or
land it ahead of 8a7a036; --enable-scheduler does not exist on main yet.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jfeng18 jfeng18 force-pushed the feat/idle-burst-scheduler branch from 907ad01 to 5d5be6b Compare June 6, 2026 07:16
jfeng18 added a commit to jfeng18/anolisa that referenced this pull request Jun 6, 2026
EVENT_SOURCE_SCHED=7 (from alibaba#662 schedmon) would collide with
EVENT_SOURCE_LSM=7 on merge. Renumber LSM to 8.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jfeng18 added a commit to jfeng18/anolisa that referenced this pull request Jun 6, 2026
Distinguishes default zero-intrusion monitoring from opt-in subsystems
(e.g. the idle-burst-idle scheduler enabled by --enable-scheduler).

Sweeps all three product-summary surfaces under src/agentsight: README.md,
README_CN.md (full-width punctuation, plain-language phrasing), and the RPM
%description in agentsight.spec.in.

NOTE: depends on commit 8a7a036 in this same PR, which introduces the
--enable-scheduler flag. Do not split this README change out of alibaba#662 or
land it ahead of 8a7a036; --enable-scheduler does not exist on main yet.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jfeng18 jfeng18 force-pushed the feat/idle-burst-scheduler branch from 5d5be6b to 7ab3d31 Compare June 6, 2026 10:22
jfeng18 and others added 10 commits June 6, 2026 20:56
Introduce a userspace blood lineage tree that tracks Agent process
families (Agent -> SubAgent -> Tool / Skill). Nodes carry pid/ppid,
process type, AGENT_MODE flag, comm and an optional agent name, and
maintain parent->child links on insert/remove.

classify() assigns a type from the process's ancestry and environment:
a child of an Agent/SubAgent becomes SubAgent (if it matches an agent
pattern) or Tool; a parentless process with AGENT_MODE=1 becomes an
Agent root; everything else stays Unknown. subtree()/roots() expose the
forest for queries.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Wire the lineage tree into the event loop. proctrace exec/exit events
maintain the tree (insert+classify on exec, remove on exit), inferring
AGENT_MODE and agent-pattern matches from the pid->agent-name cache to
avoid redundant /proc reads.

Add scanner helpers read_ppid() and has_agent_mode() that read
/proc/<pid>/stat and /proc/<pid>/environ, used by the procmon path to
auto-detect AGENT_MODE=1 roots.

ensure_lineage_node() closes a race: proctrace does not emit an exec
event for an AGENT_MODE root (it was not yet in traced_processes when it
execed), so the procmon detection path inserts and classifies the node
directly, making detection order-independent.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Walk a process's ancestry to the nearest Agent root (bounded to 64 hops
to guard against cycles), returning None when no Agent ancestor exists.
This is the prerequisite the scheduler uses to group a process into its
Agent family.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Detect idle/active transitions of traced processes via the BTF-typed
sched_switch / sched_wakeup tracepoints. tp_btf gives direct typed access
to the task_struct, which is what makes correct detection possible:

- sched_switch reads the raw prev->__state and the `preempt` flag, so a
  task that is only preempted while still runnable is not misread as
  going to sleep (the format-struct prev_state field is the encoded
  TASK_REPORT value, which is essentially never 0 and would flag every
  context switch as a sleep).
- both tracepoints resolve tgid (to filter traced Agent families) and
  tid (the actual thread) from the task_struct, and emit per-tid, so a
  multithreaded process can be ACTIVE while any one of its threads runs.

Per-tid state-dedup (no time cooldown) avoids re-emitting the same state
while still delivering every genuine transition; the LRU map self-evicts
since schedmon has no thread-exit hook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the SchedMon probe (reuses the shared traced_processes and ring-buffer
maps, attaches the two BTF tracepoints) and the Event::Sched variant
carrying (tgid, tid, event_type). The unified parser treats Sched events
as a no-op — they are consumed by the scheduler, not parsed into messages.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Group an Agent family (keyed by the Agent root PID) into a cgroup v2 cpu
cgroup and drive cpu.idle from the family's aggregate scheduling state:
ACTIVE (cpu.idle=0, cpu.weight=active_weight) while any thread is runnable,
IDLE (cpu.idle=1, SCHED_IDLE) once every thread has been blocked longer
than idle_threshold_ms. A per-tid runnable set makes "any thread runnable"
correct for multithreaded processes.

Details that the kernel forces:
- cpu.idle is the idle mechanism; we never write cpu.weight while idle
  (the kernel rejects it and ignores the value), and clear cpu.idle before
  restoring cpu.weight on the ACTIVE transition.
- cpu controller is enabled top-down from the v2 root to cgroup_root so the
  agent-* leaves actually expose cpu.idle/cpu.weight.

Robust teardown so cgroups never leak or strand processes at idle weight:
- reap_exited_families() removes a family once its cgroup.procs is empty
  (proctrace only emits exit for its own child_pids, so an AGENT_MODE root
  would otherwise never be torn down);
- remove_cgroup() evacuates any remaining (fork-without-exec) processes to
  the cgroup root before rmdir to avoid EBUSY leaks;
- a startup sweep clears empty agent-* dirs left by a previous SIGKILL, and
  Drop cleans up on graceful shutdown.

active_weight is clamped to the kernel-valid range; the idle debounce starts
only when the runnable set first empties (not on every sleep edge).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Wire the scheduler end-to-end: dispatch the schedmon probe (enabled only
when the scheduler is on), route Event::Sched to on_sched_event, register a
process with its Agent family on classification (via lineage find_root) and
remove it on exit, and finalize debounced idle transitions from a shared
on_idle_tick() called by both the CLI run loop and the FFI driver loop (so
the scheduler is not stuck never going idle in embedded mode).

Adds the --enable-scheduler CLI flag and a JSON `scheduler` config block
(active_weight, idle_threshold_ms, cgroup_root); warns when the config file's
enabled value overrides the CLI flag.

Verified on kernel 6.6.102 with a multithreaded CPU-bound agent: BURST ->
ACTIVE within ~10ms, sustained SLEEP -> IDLE within ~150ms, clean per-cycle
transitions, no cgroup leaks, no cpu.idle/weight write errors.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Distinguishes default zero-intrusion monitoring from opt-in subsystems
(e.g. the idle-burst-idle scheduler enabled by --enable-scheduler).

Sweeps all three product-summary surfaces under src/agentsight: README.md,
README_CN.md (full-width punctuation, plain-language phrasing), and the RPM
%description in agentsight.spec.in.

NOTE: depends on commit 8a7a036 in this same PR, which introduces the
--enable-scheduler flag. Do not split this README change out of alibaba#662 or
land it ahead of 8a7a036; --enable-scheduler does not exist on main yet.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
cpu.idle toggle is architecturally flawed: userspace round-trip
(BPF→daemon→cgroup write) is slower than the kernel's first scheduling
decision on wakeup. Worse, cpu.idle=1 sets weight to WEIGHT_IDLEPRIO=3,
making the waking agent unable to preempt non-idle tasks until the
userspace restores cpu.idle=0. Net effect: acceleration anti-pattern.

Keep: BPF schedmon probes (sched_switch/sched_wakeup), per-tid activity
tracking state machine, debounced idle/active transitions, unit tests.

Remove: all cgroup operations (create/remove/write cpu.idle/cpu.weight/
migrate pid), Drop cleanup, sweep_stale_cgroups, active_weight and
cgroup_root config fields.

Add: per-family transition metrics (idle_to_active_count,
active_to_idle_count, duration tracking) for capacity planning.

Revert docs commit that scoped "zero-intrusion" claim — no longer
needed since the monitor is purely observational.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
B1: procmon.bpf.c used get_task_ns_pid() for event->pid but host tgid
for ppid — inconsistent in containers. Use host tgid (from
bpf_get_current_pid_tgid()) for both, matching proctrace convention.

B2: root agent exit only triggered ProcMonEvent::Exit (not proctrace
VariableEvent::Exit), so lineage tree was never cleaned. Add
lineage_tree.remove() in ProcMonEvent::Exit handler.
Also clean activity_monitor.remove_process() in same handler.

I1: LineageTree::remove() now reparents children to grandparent instead
of orphaning them (mirrors kernel subreaper behavior).

Found via workflow kernel-code cross-reference review against
cloud-kernel 6.6 branch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jfeng18 jfeng18 force-pushed the feat/idle-burst-scheduler branch from 7ab3d31 to c599945 Compare June 6, 2026 13:24
jfeng18 added a commit to jfeng18/anolisa that referenced this pull request Jun 6, 2026
EVENT_SOURCE_SCHED=7 (from alibaba#662 schedmon) would collide with
EVENT_SOURCE_LSM=7 on merge. Renumber LSM to 8.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jfeng18

jfeng18 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Superseded by #822 — rewrote from cgroup-based scheduler to observation-only activity monitor per review feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component:sight src/agentsight/

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant