Skip to content

Add task progress watchdog for stuck workers#948

Open
googs1025 wants to merge 4 commits into
agentscope-ai:mainfrom
googs1025:feat/issue-947-progress-watchdog
Open

Add task progress watchdog for stuck workers#948
googs1025 wants to merge 4 commits into
agentscope-ai:mainfrom
googs1025:feat/issue-947-progress-watchdog

Conversation

@googs1025

@googs1025 googs1025 commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

摘要

  • 新增 Manager 任务进展 watchdog 脚本,对最新任务进展块计算指纹,并在 state.json 中记录重复进展、缺失进展、显式阻塞和长时间运行状态。
  • 支持 Expected next update: <UTC ISO> 长任务宽限窗口,避免大型编译、测试、训练、集成任务在正常运行期间被过早判断为疑似卡住。
  • 将 watchdog 接入 Manager heartbeat 指引和 task-management 参考文档,并更新各 Worker runtime 的 progress 指引。
  • 新增聚焦 shell 测试,覆盖正常进展、重复进展、缺失 progress log、显式 blocker 和 long-running grace。

Closes #947.

测试计划

  • bash tests/test-26-progress-watchdog.sh
  • bash -n manager/agent/skills/task-management/scripts/check-progress-watchdog.sh
  • bash -n tests/test-26-progress-watchdog.sh
  • git diff --check upstream/main..HEAD

@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

📊 CI Metrics Report

Summary

Metric Current Baseline Change
LLM Calls 167 81 +86 ↑ +106.2%
Input Tokens 4806319 2803871 +2002448 ↑ +71.4%
Output Tokens 53508 16791 +36717 ↑ +218.7%
Total Tokens 4859827 2820662 +2039165 ↑ +72.3%

By Role

Role Metric Current Baseline Change
🧠 Manager LLM Calls 83 68 +15 ↑ +22.1%
Input Tokens 2694244 2502214 +192030 ↑ +7.7%
Output Tokens 17531 13725 +3806 ↑ +27.7%
Total Tokens 2711775 2515939 +195836 ↑ +7.8%
🔧 Workers LLM Calls 84 13 +71 ↑ +546.2%
Input Tokens 2112075 301657 +1810418 ↑ +600.2%
Output Tokens 35977 3066 +32911 ↑ +1073.4%
Total Tokens 2148052 304723 +1843329 ↑ +604.9%

Per-Test Breakdown

Test Mgr Calls Wkr Calls Δ Calls Mgr In Wkr In Mgr Out Wkr Out Δ Tokens Trend
02-create-worker 6 0 -6 ↓ -50.0% 180687 0 929 0 -177006 ↓ -49.4% ✅ improved
03-assign-task 9 5 -1 ↓ -6.7% 267990 113808 1997 833 -89028 ↓ -18.8% ✅ improved
04-human-intervene 13 11 +11 ↑ +84.6% 369406 244178 2067 1269 +183922 ↑ +42.5% ⚠️ regressed
05-heartbeat 6 1 0 — 0% 197223 25977 1633 82 -50337 ↓ -18.3% — unchanged
06-multi-worker 49 67 +82 ↑ +241.2% 1678938 1728112 10905 33793 +2171614 ↑ +169.6% ⚠️ regressed

Trends

2 test(s) improved (fewer LLM calls)
⚠️ 2 test(s) regressed (more LLM calls)


Generated by HiClaw CI on 2026-06-22 16:17:07 UTC


📦 Download debug logs & test artifacts

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Manager-side “task progress watchdog” to detect stuck finite tasks by fingerprinting the latest progress-log block across heartbeat cycles, then wiring that signal into the Manager’s heartbeat guidance and task-management references.

Changes:

  • Introduce check-progress-watchdog.sh to read latest task progress, compute a fingerprint, and update watchdog fields in ~/state.json.
  • Update Manager heartbeat instructions and task-management docs to incorporate watchdog status (normal/repeated/blocked/unknown).
  • Add a focused shell test covering normal, repeated, missing-progress, and blocked-progress scenarios.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/test-26-progress-watchdog.sh New unit-style shell test for watchdog behavior and state.json updates.
manager/agent/skills/task-management/SKILL.md Adds heartbeat guidance to use the watchdog; updates reference wording.
manager/agent/skills/task-management/scripts/check-progress-watchdog.sh New watchdog script: extracts latest progress block, fingerprints it, updates state.json, emits JSON status.
manager/agent/skills/task-management/references/state-management.md Documents watchdog script usage and the new state.json fields/statuses.
manager/agent/HEARTBEAT.md Adds watchdog invocation and escalation guidance to the finite-task heartbeat flow.
changelog/current.md Records the manager feature addition for the next release notes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test-26-progress-watchdog.sh Outdated
#!/bin/bash
# test-26-progress-watchdog.sh - Unit-style test for Manager task progress watchdog

set -euo pipefail

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The test no longer uses set -e; watchdog invocations are wrapped in run_watchdog so non-zero exits print an explicit failure message.

Comment on lines +13 to +14
- **Always use `manage-state.sh` to modify state.json** — never edit manually with jq. The script handles atomicity, deduplication, and initialization
- **Use the progress watchdog during heartbeat** — `check-progress-watchdog.sh` compares task progress logs across heartbeat cycles so you can distinguish normal long-running work from repeated or stale progress

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The gotcha now says to use task-management scripts for state writes, with manage-state.sh owning task lifecycle fields and check-progress-watchdog.sh owning watchdog snapshots.

previous_count=$(jq -r --arg id "${TASK_ID}" '.active_tasks[] | select(.task_id == $id) | .stale_heartbeat_count // 0' "${STATE_FILE}")
now="$(_ts)"

if printf '%s\n' "${latest_block}" | grep -Eiq '\b(blocked|blocker)\b'; then

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Replaced the non-portable grep -E word-boundary pattern with an awk tokenizer that matches blocked / blocker as standalone words.

Comment on lines +151 to +153
(.active_tasks[] | select(.task_id == $id)) |= (
(if $action == "progress_changed" or $action == "progress_blocked" then .last_progress_at = $now else . end)
| .last_progress_fingerprint = $fingerprint
@github-actions

Copy link
Copy Markdown
Contributor

❌ Integration Tests Failed (llm-interaction-2 / mgr=copaw / wk=hermes)

Commit: 286ba3f
Workflow run: #1365

Test Results
No test output captured.
Debug Log (tail)
No debug logs available.

📦 Download full debug logs & test artifacts

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Comment on lines +133 to +135
jq --arg id "${TASK_ID}" --arg now "${now}" '
(.active_tasks[] | select(.task_id == $id)) |= (
.stale_heartbeat_count = ((.stale_heartbeat_count // 0) + 1)
--arg expected_next_update_at "${expected_next_update_at}" \
--arg progress_changed "${progress_changed}" \
--argjson count "${count}" '
(.active_tasks[] | select(.task_id == $id)) |= (

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The normal/repeated/blocked/long-running update path now also uses the finite selector, and the previous fingerprint/count reads are finite-scoped as well. Added duplicate task_id regression coverage for both progress-present and missing-progress paths.

@googs1025 googs1025 force-pushed the feat/issue-947-progress-watchdog branch from ffe829c to 8bf39c2 Compare June 21, 2026 03:59
@googs1025 googs1025 force-pushed the feat/issue-947-progress-watchdog branch from 8bf39c2 to add0eb2 Compare June 22, 2026 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add task progress watchdog for stuck workers

2 participants