diff --git a/.agents/skills/deep-research/SKILL.md b/.agents/skills/deep-research/SKILL.md index dce039f..1da6f64 100644 --- a/.agents/skills/deep-research/SKILL.md +++ b/.agents/skills/deep-research/SKILL.md @@ -46,6 +46,16 @@ Iterate until evidence quality is sufficient: 8. Run contradiction/counter-evidence checks. 9. Synthesize and produce final report. +## Re-entry Policy (Mid-Run) + +When called during an ongoing run (not only at run start): + +1. Treat invocation as valid and do not require starting a new run by default. +2. Recompute objective delta versus current stage plan. +3. If objective changed materially, reset research focus and run fresh query batches. +4. If objective is similar, perform incremental deep research using existing evidence as baseline. +5. If skipped due to sufficient evidence freshness, emit `dr_skip_reason` with explicit date windows and source counts. + ## Scoping-to-Planning Handoff Policy When deep research is used for open-ended scoping (`idea-exploration`), hand off findings to `research-plan` as the required default next step. Skip only if the user explicitly opts out. @@ -234,10 +244,11 @@ Degrade rules: ## Memory and Search Policy -1. Memory lookup is optional and situational. -2. Use memory when likely to reduce repeated search effort. -3. Use search/deep research directly when topic is new, urgent, or time-sensitive. -4. If memory is skipped, note reason in report trail. +1. Global memory bootstrap (from `run-governor` / `research-workflow`) is mandatory for non-trivial runs. +2. Within deep-research, additional memory retrieval is optional and situational. +3. Use incremental memory retrieval when it can reduce repeated search effort or contradiction resolution cost. +4. Use search/deep research directly when topic is new, urgent, or time-sensitive. +5. If incremental memory retrieval is skipped, note reason in report trail. ## Type-Aware Reporting Requirements diff --git a/.agents/skills/experiment-execution/SKILL.md b/.agents/skills/experiment-execution/SKILL.md index 0396a75..455149d 100644 --- a/.agents/skills/experiment-execution/SKILL.md +++ b/.agents/skills/experiment-execution/SKILL.md @@ -81,7 +81,7 @@ Retry behavior should be mode-aware and evidence-driven. 1. Choose control mode: direct SSH, SSH+session manager, scheduler, or existing remote agent. 2. Declare remote model: remote-native or local-driver. -3. If project-context has remote profile, confirm reuse policy before launch. +3. Use remote profile reuse decision from `run-governor`; if missing, request exactly one confirmation via `human-checkpoint`. 4. Validate connectivity and runtime basics before expensive launch when uncertainty exists. ## Logging and Failure Handling diff --git a/.agents/skills/memory-manager/SKILL.md b/.agents/skills/memory-manager/SKILL.md index cab38b7..b9a3d00 100644 --- a/.agents/skills/memory-manager/SKILL.md +++ b/.agents/skills/memory-manager/SKILL.md @@ -85,6 +85,69 @@ Treat stale working state as risk: 3. Force review before high-resource actions. 4. Force review after interruptions or unexpected failures. +## Invocation Schedule (Balanced, Non-Aggressive) + +1. Mandatory once-per-run operations: + - bootstrap `retrieve/init-working` after intake and before planning/execution + - close-out writeback before final task completion +2. Trigger-based operations between bootstrap and close-out: + - stage transition + - replan + - significant failure or new error signature + - before high-resource action + - before final answer/report handoff +3. Periodic `working` refresh is required when either is true: + - at least 15 minutes since last memory operation + - at least 3 execution cycles since last memory operation +4. Cooldown: + - no more than one non-forced memory operation per cycle + - skip when state delta is negligible +5. Anti-overuse policy: + - do not write memory after every command/tool call + - prefer compact delta updates over full rewrites + - skip repeated retrieval if last retrieval is fresh and task/error signature is unchanged +6. Command-gap fallback: + - if 5 consecutive commands/actions complete without a memory update, force one `working` refresh. + - treat this as a low-cost sync update (delta-first, concise). +7. When skipped, log `memory_skip_reason` for auditability. + +## Post-Compression Recovery (Required) + +When memory is auto-compressed/summarized: + +1. Immediately run a `working` re-read before the next execution step. +2. Rebuild `working` fields from recent evidence: + - latest stage report + - latest action/observation logs + - latest todo diff (`todo_active/todo_done/todo_blocked`) +3. Publish a compact "post-compression state snapshot" and continue only after snapshot is consistent. + +## Layered Retrieval Timing + +Use layer-specific retrieval timing to avoid over-calling: + +1. `working` retrieve: + - mandatory bootstrap + - periodic refresh by Invocation Schedule + - mandatory after memory compression +2. `episode` retrieve: + - at run start for same project/task_type + - at replan or major failure to avoid repeating failed paths +3. `procedure` retrieve: + - before executing a new stage plan + - before high-resource or irreversible actions + - when repeated failure indicates a known SOP may exist +4. `insight` retrieve: + - during planning/replanning for hypothesis shaping + - when evidence conflicts or root cause is unclear + - before final report/answer to run contradiction/boundary checks +5. `persona` retrieve: + - once at run start + - on interaction mode switch or explicit user preference change + - before final user-facing delivery for style/alignment consistency +6. Retrieval cooldown: + - `procedure/insight/persona` at most once per stage unless a new trigger appears. + ## Recovery on Context Drift If execution becomes repetitive or confused: @@ -130,3 +193,4 @@ For each memory operation, emit: 4. `Rationale` 5. `Evidence` 6. `Result` +7. `Trigger` (`bootstrap|stage-change|replan|error|high-resource|periodic|close-out`) diff --git a/.agents/skills/project-context/SKILL.md b/.agents/skills/project-context/SKILL.md index b8a46e6..e4f6079 100644 --- a/.agents/skills/project-context/SKILL.md +++ b/.agents/skills/project-context/SKILL.md @@ -46,7 +46,7 @@ Do not ask for all fields at once. 1. infer task type (`report|sft|rl|eval|generic`) 2. load existing `context.json` + `secrets.json` 3. auto-detect non-sensitive environment values where possible -4. if execution target is `remote`, show stored remote profile and ask whether to reuse it +4. if execution target is `remote`, consume reuse decision from `run-governor` first; ask only if decision is missing 5. ask only for missing required fields for the current task 6. during execution, allow blocker-only delta prompts (e.g. missing API URL/key) 7. persist immediately for reuse @@ -64,9 +64,9 @@ If new missing fields appear later, run preflight again and collect only deltas. Recommended order in research execution: -1. `run-governor` initializes mode and `run_id` -2. `run-governor` collects `local|remote` target -3. `project-context` preflight resolves runtime context and remote reuse decision +1. `run-governor` collects and confirms mode + `local|remote` target +2. `run-governor` initializes `run_id` +3. `project-context` preflight resolves runtime context and consumes remote reuse decision 4. `experiment-execution` runs with resolved context 5. `project-context` snapshot writes run-scoped frozen context diff --git a/.agents/skills/research-workflow/SKILL.md b/.agents/skills/research-workflow/SKILL.md index 62cf119..d6943eb 100644 --- a/.agents/skills/research-workflow/SKILL.md +++ b/.agents/skills/research-workflow/SKILL.md @@ -14,16 +14,18 @@ Drive AI R&D tasks with small, testable, evidence-first steps while respecting t For non-trivial tasks, run this order: 1. Initialize run policy with `run-governor`. -2. Understand user objective and current code/evidence state. -3. Clarify ambiguous requirements through `human-checkpoint`. -4. Complete intake checkpoint before planning or decomposition. -5. Run deep research when needed. -6. Build an execution plan (use `research-plan` for planning-heavy requests). -7. Confirm plan as required by mode. -8. Execute with working-memory todo tracking. -9. Replan on major issues when needed. -10. Emit stage reports and maintain report index. -11. Close task, then optionally publish shared memory. +2. Resolve runtime context with `project-context` before experiment/report/eval execution. +3. Understand user objective and current code/evidence state. +4. Clarify ambiguous requirements through `human-checkpoint`. +5. Complete intake checkpoint before planning or decomposition. +6. Run one `memory-manager` bootstrap (`retrieve/init-working`). +7. Run deep research when needed. +8. Build an execution plan (use `research-plan` for planning-heavy requests). +9. Confirm plan as required by mode. +10. Execute with trigger-based working-memory updates. +11. Replan on major issues when needed. +12. Emit stage reports and maintain report index. +13. Close task, write memory close-out, then optionally publish shared memory. ## Mode-Aware Interaction Policy @@ -50,6 +52,18 @@ Route required user interactions through `human-checkpoint`: 3. Apply this routing to intake clarification, plan confirmation, replan confirmation, and parameter approvals. 4. Log channel choice as `interaction_channel=request_user_input|plain-text-fallback` and include `fallback_reason` when used. +## Mid-Run Intent Switch Gate (Mandatory) + +On each new user message: + +1. Re-evaluate objective and skill routing before executing the next pending action. +2. If user intent shifts to research/scoping/comparison/root-cause inquiry, activate `deep-research` immediately. +3. Do not continue stale execution plans when the objective changed materially. +4. If `deep-research` is skipped, emit `dr_skip_reason` with freshness evidence (date/timestamp and source coverage), then continue. +5. Cooldown: + - no more than one non-forced deep-research call per stage. + - bypass cooldown when objective changed, contradiction appears, or high-impact uncertainty remains unresolved. + ## Default Execution Loop Repeat this loop until completion: @@ -57,7 +71,7 @@ Repeat this loop until completion: 1. Update success criteria. 2. Collect or refresh evidence. 3. Plan the smallest useful next action. -4. Refresh working todo state. +4. Refresh working todo state only when memory trigger conditions are met. 5. Act. 6. Observe outputs. 7. Evaluate result quality and risk. @@ -67,17 +81,32 @@ Repeat this loop until completion: Use these in combination: -1. Treat memory as an optional accelerator, not a hard prerequisite. -2. Use search/deep research directly when topic is time-sensitive, new, or currently blocked. -3. For open-ended research/scoping requests, run deep research before giving decomposition or roadmap recommendations. -4. For unknown errors, use this branch: +1. `memory-manager` bootstrap is mandatory before planning/execution for non-trivial runs. +2. Between bootstrap and close-out, memory operations are trigger-based and non-aggressive. +3. Trigger memory operation when one of the following occurs: + - stage transition + - replan + - significant error or new error signature + - memory auto-compression/summarization completed + - before high-resource action + - before final answer/report handoff +4. Periodic `working` memory refresh is required when either holds: + - at least 15 minutes since last memory operation + - at least 3 execution cycles since last memory operation +5. Command-gap fallback: if 5 consecutive commands/actions finish without a memory update, force one concise `working` refresh. +6. Cooldown: no more than one non-forced memory operation per cycle. +7. Avoid per-command memory writes; batch observations into one delta update. +8. Use search/deep research directly when topic is time-sensitive, new, or currently blocked. +9. For open-ended research/scoping requests, run deep research before giving decomposition or roadmap recommendations. +9.1 For mid-run new research requests, run deep research re-entry before further execution. +10. For unknown errors, use this branch: - local evidence triage (logs, stack trace, recent changes) - targeted search - deep research (debug-investigation) if still unresolved - minimal fix validation -5. If skipping memory before search, record reason in the stage report. -6. If intake information is missing, trigger `human-checkpoint` before deep research or planning. -7. If deep research was used for open-ended scoping, hand off to `research-plan` to convert findings into an execution-ready plan. Skip only if the user explicitly opts out. +11. If skipping memory due to cooldown or low-value delta, record reason in the stage report. +12. If intake information is missing, trigger `human-checkpoint` before deep research or planning. +13. If deep research was used for open-ended scoping, hand off to `research-plan` to convert findings into an execution-ready plan. Skip only if the user explicitly opts out. ## Replanning Policy diff --git a/.agents/skills/run-governor/SKILL.md b/.agents/skills/run-governor/SKILL.md index 89a0f7b..78d5520 100644 --- a/.agents/skills/run-governor/SKILL.md +++ b/.agents/skills/run-governor/SKILL.md @@ -78,6 +78,16 @@ Hard constraints: 6. Confirmation collection must be mediated by `human-checkpoint`. 7. Any assumption for mode/target is non-compliant, even when likely. +## Memory Bootstrap Gate + +Before transitioning from initialization to execution workflow: + +1. Set `memory_policy=balanced-triggered` unless user explicitly overrides. +2. Ensure one `memory-manager` bootstrap operation is complete: + - `retrieve` or `init-working` for current project/task context. +3. If bootstrap is missing, mark status `blocked-awaiting-memory-bootstrap`. +4. This gate enforces only the bootstrap, not per-step memory writes. + ## Run Identity and Directories Use one run identifier: @@ -176,6 +186,7 @@ For each run-governor action, emit: 7. `Confirmation`: `user_confirmed_mode`, `user_confirmed_execution_target`, and whether initialization is permitted (`YES|NO`) 8. `Compliance`: `gate_status=pass|blocked`, with blocked reason when applicable 9. `Interaction`: `interaction_transport` and optional `fallback_reason` +10. `Memory`: `memory_policy` and `memory_bootstrap_done=YES|NO` ## Violation Recovery Policy diff --git a/AGENTS.md b/AGENTS.md index a038dee..b858727 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -17,6 +17,33 @@ This workspace is for AI research and development tasks (reproduction, debugging 12. Follow `REPO_CONVENTIONS.md` for artifact placement and commit hygiene. 13. If a run was initialized before confirmation, stop and run violation recovery: acknowledge, ask whether to keep/clean artifacts, and wait for explicit reconfirmation before continuing. +## Memory Invocation Guardrails (Balanced) +1. `memory-manager` is mandatory for non-trivial runs, but only as a control-plane step, not per command. +2. Mandatory calls per non-trivial run: + - one bootstrap `retrieve/init-working` before planning or execution + - one close-out writeback before task completion +3. Conditional calls between bootstrap and close-out are trigger-based only: + - stage change + - replan + - significant failure or new error signature + - before high-resource action + - before final report/answer handoff +4. Periodic refresh is allowed when either is true: + - at least 15 minutes since last memory operation + - at least 3 execution cycles since last memory operation +5. Cooldown rule: do not invoke `memory-manager` more than once in a cycle unless forced by safety/high-resource/failure triggers. +6. If memory is skipped due to cooldown or low delta, record `memory_skip_reason` in the stage report. + +## Deep-Research Re-entry Guardrails +1. On every new user message, re-run skill routing before continuing prior stage actions. +2. If the new message contains research-intent signals, `deep-research` MUST be activated even mid-run. +3. Research-intent signals include (semantic match, Chinese or English): + - 调研/研究/对比/综述/文献/证据/机制/根因/为什么/可行性/路线图 + - research/investigate/compare/survey/literature/evidence/mechanism/root-cause/why/feasibility/roadmap +4. If skipping `deep-research`, emit `dr_skip_reason` with concrete evidence freshness info (source date / timestamp), not a generic statement. +5. Cooldown for non-forced deep-research calls: + - at most once per stage unless objective changed or new contradiction/high-impact uncertainty appears. + ## Skill Paths - `.agents/skills/run-governor` - `.agents/skills/research-workflow`