Skip to content

feat(sight): add ReactiveExporter for observe→act pipeline#804

Open
jfeng18 wants to merge 4 commits into
alibaba:mainfrom
jfeng18:feat/reactive-exporter
Open

feat(sight): add ReactiveExporter for observe→act pipeline#804
jfeng18 wants to merge 4 commits into
alibaba:mainfrom
jfeng18:feat/reactive-exporter

Conversation

@jfeng18

@jfeng18 jfeng18 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

What

The first piece of dynamic orchestration in ANOLISA: agentsight can now react to observed events, not just record them.

ReactiveExporter is a new GenAIExporter that inspects each LLM call event for critical signals and triggers actions:

  • Critical interruption (crash / OOM / SIGKILL in error field) → spawn ws-ckpt checkpoint to save workspace state
  • Token waste advisory (>50K input tokens with zero prompt caching) → log a recommendation

Why

From the north-star ("max power inference + dynamic workflow orchestration") gap analysis: agentsight scores 2/10 on "dynamic orchestration" because modules only observe — they never act. This PR is the 0→1: the first time a module's observation triggers another module's action.

Design

  • Non-blocking: export() does try_send on a bounded channel (cap 32). Background thread runs ws-ckpt.
  • Timeout-protected: try_wait poll loop with 10s deadline + kill. A stuck ws-ckpt never blocks the thread or Drop.
  • Debounced: at most 1 checkpoint per interval (default 30s). Prevents storm cascades.
  • Graceful: ws-ckpt not installed → new() returns None → not registered (zero cost).
  • Default disabled: needs explicit config. Zero impact on existing deployments.

Config

{ "reactive": { "enabled": true, "debounce_secs": 30, "workspace": "/root" } }

Testing

  • ✅ 8 unit tests (detection, advisory, disabled, integration: ws-ckpt spawn + timeout + debounce + Drop)
  • ✅ 546 full regression
  • ✅ ECS E2E: registers in trace mode, zero false positives on normal traffic
  • ✅ ECS integration test: crash event → spawn → timeout → kill → debounce → Drop clean (13.2s)
  • ✅ Adversarial review (7 agents): 1 confirmed finding (timeout hang) → fixed

🤖 Generated with Claude Code

@jfeng18 jfeng18 requested a review from chengshuyi as a code owner June 9, 2026 12:06
@github-actions github-actions Bot added the component:sight src/agentsight/ label Jun 9, 2026
@jfeng18

jfeng18 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

E2E verification: full signal chain on live ECS

Triggered a real RetryStorm (6 identical invalid API calls → 5+ auth_error in one conversation) with the reactive exporter enabled and ws-ckpt daemon running:

--- interruption_events ---
auth_error|b59780476b41bbe7f6e4461c7199e096
retry_storm|b59780476b41bbe7f6e4461c7199e096

--- ws-ckpt snapshots ---
WORKSPACE            SNAPSHOT                         CREATED             MESSAGE
---------------------------------------------------------------------------------
/root/test-workspace auto-20260609T140312-retry_storm 2026-06-09 14:03:12 reactive: retry_storm (conv=b59780476b41bbe7f6e4461c7199e096)

--- reactive log ---
[2026-06-09T14:03:12Z INFO  agentsight::genai::reactive] [reactive] checkpoint created: auto-20260609T140312-retry_storm

Full chain verified: eBPF capture → interruption detection → RetryStorm insert → notify_interruption → background thread → ws-ckpt spawn → real btrfs snapshot created on disk.

jfeng18 and others added 4 commits June 10, 2026 10:54
The first piece of dynamic orchestration in ANOLISA: agentsight can now
react to observed events, not just record them.

ReactiveExporter is a GenAIExporter that inspects each LLM call event
for critical signals and triggers actions:

  - Critical interruption (crash/OOM/SIGKILL in error) → spawn ws-ckpt
    checkpoint to save workspace state automatically
  - Token waste advisory (>50K input with no prompt caching) → log a
    recommendation

Design:
  - Non-blocking: export() does try_send on a bounded channel (cap 32);
    background thread owns the receiver and runs ws-ckpt
  - Timeout-protected: try_wait poll loop with 10s deadline + kill,
    so a stuck ws-ckpt never blocks the thread or Drop
  - Debounced: at most 1 checkpoint per configurable interval (default
    30s), preventing storm-triggered cascades
  - Graceful: if ws-ckpt is not installed, new() returns None and no
    exporter is registered (zero runtime cost)
  - Default disabled: requires explicit config to activate

Config (agentsight.json):
  { "reactive": { "enabled": true, "debounce_secs": 30, "workspace": "/root" } }

Tested: 8 unit tests (detection, advisory, disabled, integration with
real ws-ckpt spawn + timeout + debounce + clean Drop), full 546-test
regression, ECS E2E (registration confirmed, zero false positives on
normal traffic, integration test passes).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three new capabilities:

1. context_overflow checkpoint: fires on "context_length_exceeded" /
   "maximum context length" errors — saves workspace before the agent
   potentially loses context or crashes.

2. interruption subscription: notify_interruption() public method lets
   unified.rs forward Critical interruptions (RetryStorm, DeadLoop)
   from the existing detection pipeline — zero detection logic
   duplication, pure event forwarding.

3. cumulative no-cache advisory: per-agent token accumulation in the
   background thread. When an agent exceeds 200K input tokens in one
   hour with no prompt caching, logs a one-time actionable advisory.
   Debounced per-agent per-hour.

Replaces the old per-call 50K check_advisory (too aggressive, no state)
with the cumulative approach (more accurate, fewer false positives).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- export_context_overflow_triggers_checkpoint: full pipeline test
  (export → channel → background thread → ws-ckpt spawn attempt)
- notify_interruption_triggers_checkpoint: verifies the interruption
  subscription path (unified.rs forward → checkpoint attempt)
- cumulative_advisory_fires_at_threshold: 5×50K = 250K tokens with no
  cache → advisory fires; also tests cache-hit resets and clean Drop

All three verify the background thread processes messages correctly,
doesn't hang, and shuts down cleanly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Connects the ReactiveExporter to the existing interruption detection
pipeline in unified.rs, completing the RetryStorm/DeadLoop → checkpoint
signal path.

Changes:
- Add ReactiveNotifier (lightweight, Clone, Send+Sync) holding a
  SyncSender clone. ReactiveExporter::new() now returns the tuple
  (Self, ReactiveNotifier).
- Store reactive_notifier on AgentSight struct.
- Call notify_interruption("retry_storm") after RetryStorm insert
  (guarded by exists_for_conversation dedup — fires at most once per
  conversation).
- Call notify_interruption("dead_loop") after DeadLoop insert (guarded
  by should_detect + LoopDetector.detect — fires only on genuine new
  pattern detection).

Both calls are non-blocking (try_send), properly guarded against
duplicates, and debounced by the background thread (30s default).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jfeng18 jfeng18 force-pushed the feat/reactive-exporter branch from 1d71d7c to c31c4ee Compare June 10, 2026 02:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component:sight src/agentsight/

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant