diff --git a/.agents/skills/deep-research/SKILL.md b/.agents/skills/deep-research/SKILL.md index b18fd33..9b74329 100644 --- a/.agents/skills/deep-research/SKILL.md +++ b/.agents/skills/deep-research/SKILL.md @@ -2,7 +2,7 @@ name: deep-research description: |- Deep evidence-first research with broad discovery, verification, and traceable citations. Prefer invoking via research-workflow. - TRIGGER when (MANDATORY — you MUST invoke this skill, no exceptions): user message contains ANY of these keywords or synonyms — 调研/研究/对比/综述/文献/证据/机制/根因/为什么/可行性/路线图/分析/探索, or research/investigate/compare/survey/literature/evidence/mechanism/root-cause/why/feasibility/roadmap/analyze/explore — or asks to verify claims, analyze tradeoffs, scope a new topic, or conduct literature review. Also MUST trigger mid-run when new research-intent signals appear. If in doubt whether to trigger, trigger. Skipping when keywords match is a routing violation. + TRIGGER when (MANDATORY — you MUST invoke this skill, no exceptions): user message contains ANY of these keywords or synonyms — 调研/研究/对比/综述/文献/证据/机制/根因/为什么/可行性/路线图/分析/探索, or research/investigate/compare/survey/literature/evidence/mechanism/root-cause/why/feasibility/roadmap/analyze/explore — or asks to verify claims, analyze tradeoffs, scope a new topic, or conduct literature review. Also use this skill as the default gateway for external search. Skipping when keywords match is a routing violation. DO NOT TRIGGER when: user asks for paper-writing output (use paper-writing), experiment launch (use experiment-execution), or plan-only without evidence (use research-plan). --- @@ -10,7 +10,18 @@ description: |- ## Mission -Produce a deeply researched, evidence-grounded answer with clear provenance and actionable conclusions. +Produce a deeply researched, evidence-grounded answer with clear provenance and actionable conclusions, and act as the default gateway for external search in research runs. + +## Search Routing Gate (Mandatory) + +All external search during non-trivial research runs must enter through `deep-research`. + +Rules: + +1. do not bypass `deep-research` with ad hoc direct search when fresh outside evidence is needed +2. `deep-research` may choose a lighter or deeper execution depth internally, but it may not silently skip actual search +3. every `deep-research` run must perform real WebSearch calls and keep an auditable query trail +4. if search is skipped because existing evidence is already fresh enough, emit `dr_skip_reason` with explicit date windows and source counts ## Research Type Selection @@ -26,40 +37,64 @@ If templates do not fit exactly, adapt structure freely but keep depth, verifica ## Intake Checkpoint Gate (Mandatory Before Search) -Before selecting mode or running any WebSearch queries: +Before selecting depth or running any WebSearch queries: + +1. confirm `intake_checkpoint_complete=YES` +2. intake must at least define: objective/scope, constraints, and expected deliverable format +3. route missing-information requests through `human-checkpoint` +4. in `moderate` or `detailed`, prefer built-in user-question tool (`request_user_input`) +5. if built-in tool is unavailable, degrade to concise plain-text questions +6. if intake is incomplete, remain in clarification phase and do not run search, decomposition, or synthesis + +## Frontier-First Scout (Mandatory) + +Every `deep-research` run must begin with a `frontier-first scout` before final depth selection. + +Scout requirements: -1. Confirm `intake_checkpoint_complete=YES`. -2. Intake must at least define: objective/scope, constraints, and expected deliverable format. -3. Route missing-information requests through `human-checkpoint`. -4. In `moderate` or `detailed`, prefer built-in user-question tool (`request_user_input`). -5. If built-in tool is unavailable, degrade to concise plain-text questions. -6. If intake is incomplete, remain in clarification phase and do not run search, decomposition, or synthesis. +1. run at least 6-10 queries total +2. cover at least: + - `bleeding-edge` topic queries + - `frontier` topic queries + - one verification query family + - one counter-evidence or criticism query family +3. capture representative freshness, source quality, and contradiction density +4. use scout evidence to choose final depth + +Scout rules: + +1. scout is mandatory even when a lighter depth is later selected +2. scout counts toward total query budget +3. scout may justify upgrading to `deep` or downgrading to `light` +4. scout may not justify "no search" ## Default Workflow Iterate until evidence quality is sufficient: -1. Confirm intake checkpoint is complete. -2. Restate objective and success criteria. -3. Set explicit `As of: YYYY-MM-DD`. -4. Run staged time-window search with Codex WebSearch. -5. Extract claim-level evidence. -6. Build key-work cards when the topic is paper-centric. -7. Verify high-impact claims independently. -8. Run contradiction/counter-evidence checks. -9. Synthesize and produce final report. +1. confirm intake checkpoint is complete +2. restate objective and success criteria +3. set explicit `As of: YYYY-MM-DD` +4. run the mandatory frontier-first scout +5. select execution depth +6. run staged time-window search with Codex WebSearch +7. extract claim-level evidence +8. build key-work cards when the topic is paper-centric +9. verify high-impact claims independently +10. run contradiction/counter-evidence checks +11. synthesize and produce final report When the topic has implementation, benchmark, reproduction, or planning implications, also apply [references/codebase-and-data-research-rules.md](references/codebase-and-data-research-rules.md). ## Re-entry Policy (Mid-Run) -When called during an ongoing run (not only at run start): +When called during an ongoing run: -1. Treat invocation as valid and do not require starting a new run by default. -2. Recompute objective delta versus current stage plan. -3. If objective changed materially, reset research focus and run fresh query batches. -4. If objective is similar, perform incremental deep research using existing evidence as baseline. -5. If skipped due to sufficient evidence freshness, emit `dr_skip_reason` with explicit date windows and source counts. +1. treat invocation as valid and do not require starting a new run by default +2. recompute objective delta versus current stage plan +3. if objective changed materially, reset research focus and run fresh query batches +4. if objective is similar, perform incremental deep research using existing evidence as baseline +5. if skipped due to sufficient evidence freshness, emit `dr_skip_reason` with explicit date windows and source counts ## Scoping-to-Planning Handoff Policy @@ -67,9 +102,9 @@ When deep research is used for open-ended scoping (`idea-exploration`), hand off Handoff expectations: -1. Preserve core hypotheses, constraints, and evidence-backed tradeoffs. -2. Identify recommended direction and at least one fallback direction. -3. Convert conclusions into executable planning inputs (experiments, implementation prerequisites, data/workload requirements, risks). +1. preserve core hypotheses, constraints, and evidence-backed tradeoffs +2. identify recommended direction and at least one fallback direction +3. convert conclusions into executable planning inputs (experiments, implementation prerequisites, data/workload requirements, risks) ## Completion Gate (Mandatory) @@ -79,16 +114,20 @@ Before synthesis, print: 1. `intake_checkpoint_complete=YES|NO` 2. `intake_channel=request_user_input|plain-text-fallback|none` -3. `selected_mode=quick|default-auditable|deep` -4. `mode_reason=` -5. `total_queries=` -6. `bleeding_edge_queries=` -7. `frontier_queries=` -8. `recent_queries=` -9. `mid_term_queries=` -10. `classic_queries=` -11. `degrade_used=YES|NO` -12. `gate_pass=YES|NO` +3. `search_entry=deep-research` +4. `frontier_first_scout=YES|NO` +5. `selected_depth=light|default-auditable|deep` +6. `depth_reason=` +7. `dr_degrade_reason=` +8. `total_queries=` +9. `scout_queries=` +10. `bleeding_edge_queries=` +11. `frontier_queries=` +12. `recent_queries=` +13. `mid_term_queries=` +14. `classic_queries=` +15. `degrade_used=YES|NO` +16. `gate_pass=YES|NO` If `degrade_used=YES`, also print: @@ -98,86 +137,63 @@ If `degrade_used=YES`, also print: 4. `degrade_queries_run=` 5. `degrade_reason=` -Gate thresholds must be evaluated against the selected mode's minimums. - If `gate_pass=NO`, continue searching and do not finalize. -## Query Budget and Depth Rules +## Search Depth Rules -Support three depth modes: -Select one mode before search starts and record the reason. +Support three execution depths: -1. `quick`: - - total: 20-30 - - stage minimums: `bleeding-edge >= 5`, `frontier >= 4`, `recent >= 4`, `mid-term >= 3`, `classic >= 2` -2. `default-auditable`: - - total: target 60 (acceptable 50-80) +1. `light` + - only for narrow, low-ambiguity verification after scout + - total: 12-24 queries + - stage minimums: `bleeding-edge >= 3`, `frontier >= 3`, `recent >= 2`, `mid-term >= 1`, `classic >= 1` +2. `default-auditable` + - default for bounded but non-trivial research questions + - total: target 50-80 queries - stage minimums: `bleeding-edge >= 12`, `frontier >= 10`, `recent >= 10`, `mid-term >= 8`, `classic >= 6` -3. `deep`: - - total: 100-140 +3. `deep` + - use for broad or open-ended exploration, roadmap design, deep comparisons, or high-uncertainty topics + - total: 100-140 queries - stage minimums: `bleeding-edge >= 28`, `frontier >= 22`, `recent >= 20`, `mid-term >= 16`, `classic >= 10` -Mode selection precedence: - -1. User override wins if explicitly specified (for example: `mode=quick|default-auditable|deep`). -2. If user does not specify, auto-select using scope and research-intent signals (not risk-first). - - `quick`: only for simple, single-point, directly verifiable questions (definition checks, yes/no fact checks, one-paper claim verification). - - `default-auditable`: default for all non-simple research questions with bounded scope. - - `deep`: prioritize when scope is broad or open-ended, especially for research idea exploration ("can X and Y be combined", "how to design a roadmap", "landscape + recipe + tradeoffs"). -3. If ambiguous, do not choose `quick`; choose `default-auditable` or `deep` based on breadth. -4. Practical guardrail: if the task asks for representative works plus training recipes/mechanisms, use `deep` by default. -5. `quick` hard disqualifiers (if any item is true, `quick` is forbidden): - - user asks for research/landscape/survey/roadmap/recipe/mechanism comparison - - user asks whether two methods can be combined and how to do it - - paper-centric deep-dive policy is triggered - - task requires contradiction analysis instead of a single factual verification -6. Mandatory auto-selection algorithm when user does not specify mode: - - step A: check `quick` hard disqualifiers; if any true, candidate mode must be `default-auditable` or `deep` - - step B: if request is open-ended idea exploration (for example "can X and Y be combined", "give landscape + recipe + tradeoffs"), select `deep` - - step C: otherwise select `default-auditable` -7. Language robustness rule: map intent by semantics, not language surface. Treat equivalent Chinese and English phrases (for example "研究"/"research", "调研"/"investigate", "综述"/"survey", "路线图"/"roadmap", "可不可以结合"/"can X and Y be combined", "怎么做"/"how to implement") as identical depth signals. - -If any stage minimum for the selected mode is missed, continue searching before synthesis. - -## Mode Sanity Check (Mandatory Before Search) - -Print this mini-check immediately after selecting mode: - -1. `mode_candidate=` -2. `quick_disqualifiers_hit=` +Selection rules: + +1. user override wins if explicitly specified +2. if the user does not specify, default to `default-auditable` +3. select `deep` when scope is broad, open-ended, contradiction-heavy, or asks for landscape plus recipe or mechanism analysis +4. `light` is not a default mode +5. `light` is allowed only when scout confirms the task is narrow, directly verifiable, and low-ambiguity +6. if ambiguous, do not choose `light` +7. if the prompt mentions 2 or more research-intent terms, do not choose `light` unless the user explicitly forces it + +## Depth Sanity Check (Mandatory Before Full Search) + +Print this mini-check immediately after selecting depth: + +1. `depth_candidate=` +2. `light_disqualifiers_hit=` 3. `open_ended_exploration=YES|NO` 4. `paper_centric=YES|NO` -5. `mode_sanity_pass=YES|NO` +5. `depth_sanity_pass=YES|NO` Rules: -1. If `quick_disqualifiers_hit` is non-empty and `mode_candidate=quick`, set `mode_sanity_pass=NO` and reselect mode before any query. -2. If `open_ended_exploration=YES` and user did not explicitly force `quick`, do not use `quick`. -3. If `paper_centric=YES` and user asks for mechanisms/recipes/comparisons, do not use `quick`. - -## Mode Regression Examples (Use as Tie-Breakers) - -1. Prompt: "验证论文 X 的某个具体结论是否成立" / "Verify whether claim X in paper Y holds" -> expected `quick` -2. Prompt: "帮我调研 A 方法和 B 方法的差异与适用边界" / "Compare method A vs B and their boundaries" -> expected `default-auditable` -3. Prompt: "用 deep research 研究 SFT 和 RL 能不能结合,给训练路线" / "Use deep research to study whether SFT and RL can be combined and propose a recipe" -> expected `deep` -4. Prompt: "给出这个方向的重要论文和方法演进,并提供落地 recipe" / "Provide key papers, method evolution, and an implementation recipe" -> expected `deep` -5. Prompt: "最近 3 个月某模型价格是否变动" / "Did this model's price change in the last 3 months?" -> expected `quick` -6. Prompt: "写一份该技术路线的文献综述(含反证)" / "Write a literature review with contradictions/counter-evidence" -> expected `default-auditable` or `deep` (prefer `deep` when open-ended) -7. Multilingual intent-trigger rule: if 2+ intent terms appear, never `quick`. - - Chinese terms: "研究", "调研", "综述", "路线图", "机制", "对比", "可不可以结合", "怎么做" - - English terms: "research", "investigate", "survey", "landscape", "roadmap", "mechanism", "compare", "can be combined", "how to implement" +1. if `light_disqualifiers_hit` is non-empty and `depth_candidate=light`, set `depth_sanity_pass=NO` and reselect before more search +2. if `open_ended_exploration=YES` and user did not explicitly force `light`, do not use `light` +3. if `paper_centric=YES` and the user asks for mechanisms/recipes/comparisons, do not use `light` ## Search Execution Policy (Codex Native) -1. Use Codex WebSearch directly in-session; do not require external browser interaction. -2. Do not depend on external search APIs for baseline operation. -3. Treat date text in query strings as recall hints only; do not rely on parser-specific `after:`/`before:` behavior for final stage assignment. -4. Use date-window targeting during retrieval (for example recency filters and window-scoped query batches), then assign stage by automatic published-date validation. -5. Compute `days_from_as_of` for each source and map to exactly one stage using the stage boundary rules below. -6. If source date is unknown, keep with uncertainty label and lower priority. -7. Do not claim deep-research completion without actual WebSearch calls and auditable query logs. +1. use Codex WebSearch directly in-session; do not require external browser interaction +2. do not depend on external search APIs for baseline operation +3. treat date text in query strings as recall hints only; do not rely on parser-specific `after:`/`before:` behavior for final stage assignment +4. use date-window targeting during retrieval, then assign stage by published-date validation +5. compute `days_from_as_of` for each source and map to exactly one stage using the stage boundary rules below +6. if source date is unknown, keep with uncertainty label and lower priority +7. do not claim deep-research completion without actual WebSearch calls and auditable query logs +8. prioritize `bleeding-edge`, then `frontier`, then `recent` whenever the user cares about the latest or fastest-moving evidence -## Staged Time Windows (Paper-Centric) +## Staged Time Windows Use five mandatory evidence stages and record source counts for each. Define `days_from_as_of = as_of_date - published_date` (integer days). Stages are mutually exclusive: @@ -188,72 +204,56 @@ Define `days_from_as_of = as_of_date - published_date` (integer days). Stages ar 4. `mid-term` (366-730 days): `366 <= days_from_as_of <= 730` 5. `classic` (>730 days): `days_from_as_of > 730` -When discussing "latest" evidence, prioritize `bleeding-edge`, then `frontier`, then `recent`. - -Allocate budget by stage (must bias to newer windows): - -1. `bleeding-edge`: 15-25% of total queries -2. `frontier`: 12-22% of total queries -3. `recent`: 16-26% of total queries -4. `mid-term`: 12-22% of total queries -5. `classic`: 8-15% of total queries - Freshness floor: -1. `bleeding-edge + frontier >= 35%` (normal) -2. `bleeding-edge + frontier + recent >= 60%` (normal and degraded) +1. `bleeding-edge + frontier >= 35%` for normal runs +2. `bleeding-edge + frontier + recent >= 60%` for all finalized runs ## Stage Search Sequence Per stage, run at least these query families: 1. canonical topic terms -2. synonym/alias expansion +2. synonym or alias expansion 3. counter-evidence and criticism 4. verification queries for high-impact claims Use dynamic query-family expansion: -1. Build seed terms from user question terms and canonical topic terms. -2. Expand with method aliases discovered from high-confidence retrieved sources. -3. Do not hard-code universal mandatory method keywords for all topics. - -Round definitions: - -1. A round is one expansion pass for a stage and may add 1-3 queries per stage. -2. Query-family coverage is checked at stage completion, not required in every single round. +1. build seed terms from user question terms and canonical topic terms +2. expand with aliases discovered from high-confidence retrieved sources +3. do not hard-code universal mandatory method keywords for all topics -Minimum rounds by mode: +Minimum rounds by depth: -1. `quick`: `bleeding-edge/frontier/recent >= 2` rounds, `mid-term/classic >= 1` round -2. `default-auditable`: `bleeding-edge/frontier/recent >= 3` rounds, `mid-term >= 2` rounds, `classic >= 1` round -3. `deep`: `bleeding-edge/frontier/recent >= 4` rounds, `mid-term >= 3` rounds, `classic >= 2` rounds +1. `light`: `bleeding-edge/frontier/recent >= 1`, `mid-term/classic >= 1` +2. `default-auditable`: `bleeding-edge/frontier/recent >= 3`, `mid-term >= 2`, `classic >= 1` +3. `deep`: `bleeding-edge/frontier/recent >= 4`, `mid-term >= 3`, `classic >= 2` -## Stage Deficit Degrade Policy (Allowed with Exhaustion) +## Stage Deficit Degrade Policy If a stage minimum is not met, allow controlled degradation only after an exhaustion pass. -Exhaustion pass minimums (per deficit stage): +Exhaustion pass minimums per deficit stage: -1. `quick`: at least 8 additional stage-targeted queries +1. `light`: at least 6 additional stage-targeted queries 2. `default-auditable`: at least 18 additional stage-targeted queries 3. `deep`: at least 32 additional stage-targeted queries -4. The additional queries must cover all four query families and at least one extra expansion round beyond mode minimum. Degrade rules: -1. Only adjacent fallback is allowed: `bleeding-edge -> frontier`, `frontier -> recent`, `recent -> mid-term`. -2. At most one degrade hop per stage. -3. Borrowed amount cannot exceed 50% of the deficit stage minimum. -4. Even after degradation, keep `bleeding-edge + frontier >= 30%` and `bleeding-edge + frontier + recent >= 60%`. +1. only adjacent fallback is allowed: `bleeding-edge -> frontier`, `frontier -> recent`, `recent -> mid-term` +2. at most one degrade hop per stage +3. borrowed amount cannot exceed 50% of the deficit stage minimum +4. even after degradation, keep `bleeding-edge + frontier >= 30%` and `bleeding-edge + frontier + recent >= 60%` ## Memory and Search Policy -1. Global memory bootstrap (from `run-governor` / `research-workflow`) is mandatory for non-trivial runs. -2. Within deep-research, additional memory retrieval is optional and situational. -3. Use incremental memory retrieval when it can reduce repeated search effort or contradiction resolution cost. -4. Use search/deep research directly when topic is new, urgent, or time-sensitive. -5. If incremental memory retrieval is skipped, note reason in report trail. +1. global memory bootstrap is mandatory for non-trivial runs +2. before heavy search batches, use the current memory snapshot or retrieve relevant `insight`/`procedure` memory when it can reduce redundant search or contradiction cost +3. when scout or full search uncovers a repeated issue already covered by memory, incorporate that memory explicitly rather than rediscovering it silently +4. use search directly and aggressively when the topic is new, urgent, or time-sensitive +5. if a lighter depth is chosen, report why `default-auditable` was not needed ## Type-Aware Reporting Requirements @@ -261,7 +261,7 @@ Always include: 1. objective and scope 2. evidence-based conclusions -3. contradictions/uncertainties +3. contradictions and uncertainties 4. anchored citations 5. research trail summary 6. saved report path @@ -270,9 +270,8 @@ Type-specific emphasis: 1. `debug-investigation` - include error signature, reproduction context, fix candidates, validation outcomes - - benchmark/matrix sections are optional unless directly relevant 2. `design-decision` - - compare alternatives, constraints, cost/risk tradeoffs + - compare alternatives, constraints, and cost/risk tradeoffs 3. `implementation-strategy` - include staged rollout options and operational prerequisites 4. `conflict-resolution` @@ -280,43 +279,37 @@ Type-specific emphasis: 5. `idea-exploration` - include landscape, mechanisms, opportunities, and boundaries -## Representative Works Deep-Dive Policy (Mandatory for Paper-Centric Topics) +## Representative Works Deep-Dive Policy Trigger this policy when user asks for any of: -1. "important works", "representative papers", "state of the art", "research landscape" -2. method comparison across papers (for example: SFT vs RLHF vs DPO) -3. roadmap/recipe requests grounded in prior work +1. important works, representative papers, state of the art, or research landscape +2. method comparison across papers +3. roadmap or recipe requests grounded in prior work When triggered, include a dedicated `Key Works Deep Dive` section and meet minimum coverage: -1. `quick`: 3-5 key works +1. `light`: 3-5 key works 2. `default-auditable`: 6-10 key works 3. `deep`: 10-15 key works -For each key work, provide all required fields: +For each key work, provide: -1. problem addressed (1-2 lines) -2. method/training objective (with concrete loss/optimization framing if available) -3. setup and data regime (what supervision/reward signal is used) +1. problem addressed +2. method or training objective +3. setup and data regime 4. headline results and where they hold 5. limitations or failure boundary -6. why this work matters to the user's question -7. citation(s), with primary source required for every key work - -Depth constraints: - -1. Do not list papers as one-line bullets only. -2. Keep at least two works with explicit contradiction or negative evidence discussion. -3. Prefer tables for side-by-side comparison, then add short narrative synthesis. +6. why the work matters to the user's question +7. primary citation ## Evidence and Citation Policy -1. Cite in text as `[[S#]](#ref-s#)`. -2. Keep references anchored with published and accessed dates. -3. Distinguish fact, inference, and uncertainty. -4. Prefer canonical primary sources. -5. Do not rely on weak secondary sources for core conclusions. +1. cite in text as `[[S#]](#ref-s#)` +2. keep references anchored with published and accessed dates +3. distinguish fact, inference, and uncertainty +4. prefer canonical primary sources +5. do not rely on weak secondary sources for core conclusions ## Quality Gate @@ -327,52 +320,14 @@ Finalize only when: 3. citations are complete and internally consistent 4. report depth matches task type 5. language matches user language -6. if paper-centric policy is triggered, key-work count meets selected mode minimum -7. if paper-centric policy is triggered, each key work has required fields and a primary citation +6. if paper-centric policy is triggered, key-work count meets selected depth minimum +7. every finalized search run recorded scout plus full-query totals 8. if degradation is used, exhaustion minimums and freshness floor are explicitly satisfied and reported -9. mode choice passed `Mode Sanity Check`; if not, rerun with corrected mode before finalizing +9. the selected depth passed `Depth Sanity Check` ## Persistence Policy -1. Always output full report in chat. -2. Save exactly one final report file per deep-research run. -3. Default save path under run logs: +1. always output full report in chat +2. save exactly one final report file per deep-research run +3. default save path under run logs: - `/logs/runs//reports/deep-research-.md` -4. If save fails, report failure reason and still provide full report in chat. - -## Required Output Structure - -Include at minimum: - -1. As-of Date and Scope -2. Intake Checkpoint Status -3. Gate Check -4. Executive Synthesis -5. Comprehensive Analysis -6. Key Works Deep Dive (when paper-centric policy is triggered) -7. Type-Specific Section(s) -8. Research Trail Summary -9. Conclusion and Next Step -10. Saved Report Path and Save Status -11. References - -Additionally include stage coverage counters: - -1. bleeding_edge_sources= -2. frontier_sources= -3. recent_sources= -4. mid_term_sources= -5. classic_sources= - -Also include two auditable tables: - -1. Query Log with fields: `query_id`, `stage`, `query_text`, `date_filter`, `hits_used` -2. Source Log with fields: `source_id`, `title`, `url`, `published_date`, `stage`, `primary_or_secondary` - -If paper-centric policy is triggered, include a third auditable table: - -3. Key Works Matrix with fields: `work_id`, `method_family`, `supervision_signal`, `optimization_type`, `main_gain`, `known_risk`, `best_use_case` - -If degradation is used, include a fourth auditable table: - -4. Degrade Log with fields: `stage`, `required_min`, `achieved_before_degrade`, `additional_queries_run`, `fallback_stage`, `borrowed_count`, `reason` diff --git a/.agents/skills/deep-research/agents/openai.yaml b/.agents/skills/deep-research/agents/openai.yaml index 9220d0b..0113638 100644 --- a/.agents/skills/deep-research/agents/openai.yaml +++ b/.agents/skills/deep-research/agents/openai.yaml @@ -1,4 +1,4 @@ interface: display_name: "Deep Research" - short_description: "Auditable deep research with mandatory key-work deep dives for paper-centric tasks." - default_prompt: "Use deep research to produce evidence-first reports with staged time-window search, contradiction checks, and detailed key-work analysis when the topic is research/paper centric." + short_description: "Default external-search gateway with frontier-first scouting and auditable deep research." + default_prompt: "Use deep research as the default gateway for external search, start with a frontier-first scout, choose light/default/deep execution depth explicitly, run auditable staged searches with contradiction checks, and produce evidence-first reports with detailed key-work analysis when the topic is research or paper centric." diff --git a/.agents/skills/experiment-execution/SKILL.md b/.agents/skills/experiment-execution/SKILL.md index ad620d3..1a23bf2 100644 --- a/.agents/skills/experiment-execution/SKILL.md +++ b/.agents/skills/experiment-execution/SKILL.md @@ -66,14 +66,81 @@ If setup is clear and safe, direct execution is allowed. 4. Resolve only blocking gaps. 5. Launch smallest valid step first when uncertainty is high. 6. Record commands, node assignments, log paths, run IDs. -7. Replan on major failures. +7. If the launched action is long-running, immediately enter watch mode instead of treating launch as completion. +8. After each poll, continue with monitoring, diagnosis, recovery, or result collection; do not default to "job started, come back later." +9. Replan on major failures. + +## Watch Mode Policy + +Long-running experiment execution is an active responsibility, not a fire-and-forget step. + +After launching a long-running job: + +1. stay in watch mode by default +2. poll logs, checkpoints, scheduler state, or metrics on a model-chosen cadence +3. after each poll: + - if `running`, choose the next sleep interval and continue watching + - if `completed`, inspect outputs and continue validation/analysis + - if `stalled`, inspect evidence, retrieve memory, and attempt recovery or replan + - if `failed`, diagnose immediately and attempt the smallest safe recovery +4. ask the user only for hard blockers, major safety/resource approvals, or true decision points +5. only allow explicit fire-and-forget behavior when the user clearly requested it + +## Watch-Loop Execution Template + +Use this template after each experiment poll: + +1. read `status`, `followup_action`, `progress_changed`, and `last_log_tail` +2. branch immediately: + - `continue-watch` or `wait-and-poll` + - choose the next sleep interval + - keep monitoring + - `collect-results` + - inspect outputs, metrics, checkpoints, and artifacts + - continue validation and analysis + - `diagnose-stall` + - inspect logs + - retrieve `procedure` and `episode` + - attempt the smallest safe recovery + - `diagnose-failure` + - inspect failure evidence + - retrieve memory + - attempt recovery or replan + - `replan` + - update route and continue execution +3. write working state update before the next wait or recovery attempt +4. do not stop at "job is still running" unless fire-and-forget was explicitly requested + +## Short Iterative Evaluation Loop + +Short local edit-and-evaluate cycles must be handled as an owned execution loop, not as a one-shot task. + +When the task is iterative optimization: + +1. compile an evaluation ladder: + - baseline or previous-best reference + - primary regression set + - promotion gate for larger evaluation + - final target evaluation +2. prefer broader representative sets over a few hand-picked cases +3. after each batch: + - run the current gate set + - compare score against baseline and best-so-far + - inspect regressions, not just aggregate score + - decide `iterate`, `replan`, or `promote-to-next-gate` +4. if the new result is the best-so-far and the user requested preservation, snapshot the relevant prompt/config/code/results before the next risky change +5. if the current gate is unmet, do not stop merely because one iteration completed cleanly +6. only hand back to the user when: + - compiled targets are met + - a true hard blocker remains + - a safety/resource gate requires approval ## Unknown Error Branch When execution fails with unknown error: 1. local evidence triage (stack, logs, env, recent diffs) -2. optional memory retrieval if likely useful +2. retrieve relevant `procedure` and `episode` memory 3. targeted search 4. deep research (debug-investigation) if unresolved 5. apply smallest fix and validate @@ -97,6 +164,7 @@ Record stable paths for: 4. artifacts On failures, record owner and cleanup plan. +On stalled jobs, record recovery attempt and next watch step. ## Data Analysis Visualization Policy @@ -114,6 +182,7 @@ Do not launch full run when required inputs are still unknown and not explicitly In `full-auto`, continue only if risk is acceptable and no major safety issue exists. In `full-auto`, if remote profile is complete, reuse it by default unless explicitly overridden. +For iterative optimization tasks, do not stop after a single batch while the active evaluation gate or non-regression guard is still unmet. ## Output Contract @@ -146,4 +215,9 @@ analysis_artifacts: figures: next_action: checkpoint_needed: +goal_status: + primary_target: + active_gate: + best_so_far: + done_allowed: ``` diff --git a/.agents/skills/experiment-execution/agents/openai.yaml b/.agents/skills/experiment-execution/agents/openai.yaml index 69ef0d8..959f952 100644 --- a/.agents/skills/experiment-execution/agents/openai.yaml +++ b/.agents/skills/experiment-execution/agents/openai.yaml @@ -1,4 +1,4 @@ interface: display_name: "Experiment Execution" short_description: "Run experiments with mode-aware validation and traceable outputs." - default_prompt: "Use experiment execution to launch local/remote runs with run_id paths, conditional smoke checks, and evidence-backed error recovery." + default_prompt: "Use experiment execution to launch local or remote runs with run_id paths, conditional smoke checks, and evidence-backed error recovery. For iterative optimization tasks, compile an evaluation ladder with baseline, active regression gate, promotion gate, and final target, compare each batch against baseline and best-so-far, preserve strong variants when requested, and keep iterating until the active gate is satisfied or a true blocker remains. For long-running jobs, stay in watch mode and branch on followup_action: continue-watch or wait-and-poll means choose the next sleep interval and keep monitoring; collect-results means inspect outputs and continue validation against the compiled gates; diagnose-stall or diagnose-failure means inspect evidence, retrieve memory, and attempt the smallest safe recovery; replan means update the route and continue." diff --git a/.agents/skills/memory-manager/SKILL.md b/.agents/skills/memory-manager/SKILL.md index af17265..ce328e4 100644 --- a/.agents/skills/memory-manager/SKILL.md +++ b/.agents/skills/memory-manager/SKILL.md @@ -2,15 +2,15 @@ name: memory-manager description: |- Manage long-term AI R&D memory: retrieval, writeback, promotion, and shared export. - TRIGGER when: run bootstrap (retrieve), task completion (writeback), stage change, replan, significant failure, before high-resource action, before final report, or compaction markers detected (Compact/压缩/Summary). - DO NOT TRIGGER when: already called this cycle (cooldown), unless forced by safety/failure/high-resource triggers. + TRIGGER when: run bootstrap, each new user turn, each execution batch, significant failure, replan, high-resource action, long-action resume, final report handoff, or compaction markers detected (Compact/压缩/Summary). + DO NOT TRIGGER when: the exact same retrieval was just performed, freshness is still valid, and no new objective/stage/error signal appeared. --- # Memory Manager ## Mission -Build compounding capability by turning execution traces into reusable, evidence-linked memory. +Build compounding capability by turning execution traces into reusable, evidence-linked memory, with retrieval centered on prior experience rather than only current working state. ## Load References @@ -24,11 +24,19 @@ Load these files before writing or promoting records: Manage these layers: -1. `working`: run-scoped current state and todo tracking. -2. `episode`: concrete run case record. -3. `procedure`: reusable SOP from repeated success. -4. `insight`: cross-task abstraction with boundaries. -5. `persona`: behavior config only. +1. `working` + - run-scoped continuity state + - resume after compaction, interruption, or long waits +2. `episode` + - concrete run case records + - useful for similar errors, repeated attempts, and local history +3. `procedure` + - highest-priority execution memory + - default retrieval layer before acting +4. `insight` + - cross-task abstraction, tradeoffs, boundaries, and contradiction handling +5. `persona` + - behavior config only ## Working Memory Contract @@ -42,25 +50,47 @@ Manage these layers: 6. `next_step` 7. `blockers` 8. `evidence_refs` -9. `todo_active` -10. `todo_done` -11. `todo_blocked` +9. `active_action_ids` +10. `todo_active` +11. `todo_done` +12. `todo_blocked` Todo granularity should be task-level (small stages/subtasks), not command-level. -## Retrieval Policy +## Experience-First Retrieval Policy -Retrieve early when useful, but do not block execution: +Prior experience retrieval is the default. `working` is important for continuity, but it is not the only retrieval path and should not crowd out reusable experience. -1. Query by `project`, `task_type`, `error_signature` first. -2. Upgrade retrieval from optional to mandatory before continuing when either of these triggers is present: - - you are modifying `memory-manager` or another Memory-related skill/instruction - - a status, state, or context file contains compaction markers such as `Compact`, `压缩`, `Summary`, or similar summary/compression techniques -3. In mandatory-retrieval cases, read prior Memory first and treat the result as required context recovery rather than a best-effort lookup. -4. Add tags and FTS when exact filters miss. -5. Prefer `active` procedures/insights when confidence is similar. -6. Flag stale entries with low confidence. -7. If retrieval is low-yield and task is time-sensitive, continue with search/deep research directly only when the mandatory-retrieval triggers are absent. +Mandatory retrieval triggers: + +1. every new user turn +2. every execution batch before acting +3. every replan +4. every significant failure or new error signature +5. every high-resource or irreversible action +6. every long-action resume or post-poll decision +7. before final answer or report handoff +8. when modifying `memory-manager` or another Memory-related skill/instruction +9. when compaction markers such as `Compact`, `压缩`, or `Summary` appear + +Default retrieval order: + +1. `procedure` + - mandatory before every execution batch +2. `episode` + - mandatory when a similar failure, repeated attempt, or same task type is present +3. `insight` + - mandatory during planning, tradeoff analysis, contradiction handling, or final answer shaping +4. `working` + - mandatory for resume, compaction recovery, long-action reconciliation, and final handoff + +Query strategy: + +1. query by `project`, `task_type`, `error_signature`, and stage first +2. add tags and FTS when exact filters miss +3. prefer `active` procedures/insights when confidence is similar +4. prefer recent local episodes over shared memory unless local retrieval is clearly low-yield +5. if retrieval is low-yield, keep going, but record `memory_skip_reason` or `memory_low_yield_reason` ## Shared Retrieval Policy @@ -80,13 +110,15 @@ Treat shared memory as an optional read-only source, not as project-local memory ## Writeback Policy -Write conservatively and continuously: +Write conservatively, but more frequently than before: -1. Update `working` on each meaningful state transition. -2. Write `episode` at milestones, major failure, replan, or human intervention. -3. Create `procedure` draft after repeated successful pattern. -4. Create `insight` draft after cross-task recurring evidence. -5. Store evidence pointers, not narrative only. +1. write a concise `working` delta after every execution batch +2. write a concise `working` delta after every long-action poll cycle that changes status or next step +3. write `episode` at milestones, major failure, replan, or human intervention +4. create `procedure` draft after repeated successful pattern or validated recovery workflow +5. create `insight` draft after cross-task recurring evidence +6. store evidence pointers, not narrative only +7. when a completed long-running action produces results that affect later decisions, record the result summary before leaving watch mode ## Error-Resolution Memory @@ -98,124 +130,123 @@ For significant errors, capture: 4. observed outcomes 5. final fix (if any) 6. unresolved hypotheses +7. retrieved procedures/episodes that influenced the fix ## Working Freshness Rules -Treat stale working state as risk: +Treat stale continuity state as risk: -1. Refresh after plan changes, tool-call batches, or diagnosis updates. -2. Review at least every 15 minutes in active execution. -3. Force review before high-resource actions. -4. Force review after interruptions or unexpected failures. +1. refresh after plan changes, tool-call batches, or diagnosis updates +2. refresh after long-action polls that change status +3. review at least every 15 minutes in active execution +4. force review before high-resource actions +5. force review after interruptions or unexpected failures -## Invocation Schedule (Balanced, Non-Aggressive) +## Invocation Schedule (Experience-First, Frequent but Targeted) 1. Mandatory once-per-run operations: - bootstrap `retrieve/init-working` after intake and before planning/execution - close-out writeback before final task completion -2. Trigger-based operations between bootstrap and close-out: - - stage transition - - replan - - significant failure or new error signature - - before high-resource action - - before final answer/report handoff -3. Periodic `working` refresh is required when either is true: - - at least 15 minutes since last memory operation - - at least 3 execution cycles since last memory operation -4. Cooldown: - - no more than one non-forced memory operation per cycle - - skip when state delta is negligible -5. Anti-overuse policy: - - do not write memory after every command/tool call - - prefer compact delta updates over full rewrites - - skip repeated retrieval if last retrieval is fresh and task/error signature is unchanged -6. Command-gap fallback: - - if 5 consecutive commands/actions complete without a memory update, force one `working` refresh. - - treat this as a low-cost sync update (delta-first, concise). -7. When skipped, log `memory_skip_reason` for auditability. +2. Mandatory per-turn operations: + - retrieve relevant experience on every new user turn +3. Mandatory per-batch operations: + - retrieve `procedure` before every execution batch + - write `working` delta after every execution batch +4. Mandatory trigger-based operations: + - retrieve `episode` on problem, failure, repeated attempt, or new error signature + - retrieve `insight` on planning/replanning/tradeoff/final answer + - retrieve `procedure` plus `episode` before high-resource actions + - reread `working` during resume, compaction recovery, long-action reconciliation, and final handoff + - retrieve `procedure` plus `episode` immediately after stalled or failed poll outcomes + - retrieve `insight` after completed poll outcomes when interpretation or next-step selection is needed +5. Cooldown: + - skip only duplicate retrievals when objective, stage, and error signature are unchanged and the same hit set is still fresh + - cooldown does not suppress a new-trigger retrieval +6. When skipped, log `memory_skip_reason` for auditability. ## Post-Compression Recovery (Required) When memory is auto-compressed/summarized: -1. Immediately run a `working` re-read before the next execution step. -2. Rebuild `working` fields from recent evidence: +1. immediately run a `working` reread before the next execution step +2. rebuild `working` fields from recent evidence: - latest stage report - latest action/observation logs - latest todo diff (`todo_active/todo_done/todo_blocked`) -3. Publish a compact "post-compression state snapshot" and continue only after snapshot is consistent. + - active long-action records +3. publish a compact post-compression state snapshot and continue only after snapshot is consistent ## Layered Retrieval Timing -Use layer-specific retrieval timing to avoid over-calling: +Use layer-specific timing to keep retrieval frequent but useful: -1. `working` retrieve: - - mandatory bootstrap - - periodic refresh by Invocation Schedule - - mandatory after memory compression +1. `procedure` retrieve: + - before every execution batch + - before high-resource or irreversible actions + - after stalled or failed background jobs 2. `episode` retrieve: - at run start for same project/task_type - - at replan or major failure to avoid repeating failed paths -3. `procedure` retrieve: - - before executing a new stage plan - - before high-resource or irreversible actions - - when repeated failure indicates a known SOP may exist -4. `insight` retrieve: + - at replan or major failure + - when repeated failure indicates recent local history may help +3. `insight` retrieve: - during planning/replanning for hypothesis shaping - when evidence conflicts or root cause is unclear - - before final report/answer to run contradiction/boundary checks + - before final report/answer to run boundary checks +4. `working` retrieve: + - bootstrap + - resume/reconcile + - after memory compression + - before final handoff 5. `persona` retrieve: - once at run start - on interaction mode switch or explicit user preference change - - before final user-facing delivery for style/alignment consistency -6. Retrieval cooldown: - - `procedure/insight/persona` at most once per stage unless a new trigger appears. + - before final user-facing delivery ## Recovery on Context Drift If execution becomes repetitive or confused: -1. Rebuild working state from action and observation logs. -2. Run targeted retrieval by project/task/error signature. -3. If drift followed a compaction step or summary-style recovery, read prior Memory before publishing or trusting a compact state summary. -4. Publish compact state summary before continuing. +1. rebuild working state from action and observation logs +2. run targeted retrieval by project/task/error signature +3. if drift followed compaction or summary-style recovery, read prior Memory before publishing or trusting a compact state summary +4. publish compact state summary before continuing ## Compaction Recovery Policy When context may have been compressed: -1. Inspect available status/state/context files for markers such as `Compact`, `压缩`, `Summary`, or equivalent summary/compression techniques. -2. If any marker is present, call `memory-manager` to read prior Memory before editing instructions, planning next actions, or resuming execution. -3. If prior Memory cannot be read, treat that as an active blocker because key context may be missing. -4. Record the compaction trigger and retrieval result in working state or the next stage report. +1. inspect available status/state/context files for markers such as `Compact`, `压缩`, `Summary`, or equivalent summary/compression techniques +2. if any marker is present, call `memory-manager` to read prior Memory before editing instructions, planning next actions, or resuming execution +3. if prior Memory cannot be read, treat that as an active blocker because key context may be missing +4. record the compaction trigger and retrieval result in working state or the next stage report ## Promotion Policy Promote only with evidence: -1. `procedure draft -> active` after successful reuse and stable boundaries. -2. `insight draft -> active` after multi-episode support. -3. Require human review for safety-critical or expensive procedures. -4. Deprecate entries when contradictions accumulate. +1. `procedure draft -> active` after successful reuse and stable boundaries +2. `insight draft -> active` after multi-episode support +3. require human review for safety-critical or expensive procedures +4. deprecate entries when contradictions accumulate ## Shared Export Policy Treat shared export as post-task work: -1. Do not export during main task execution. -2. Export only verified/high-value records. -3. Never export noisy `working` state. -4. Require `human-checkpoint` before publishing. -5. Sync the shared repo before export so dedupe/conflict checks run against the latest branch tip. +1. do not export during main task execution +2. export only verified/high-value records +3. never export noisy `working` state +4. require `human-checkpoint` before publishing +5. sync the shared repo before export so dedupe/conflict checks run against the latest branch tip ## Shared Repository Contract When exporting: -1. Target `https://github.com/TenureAI/open-research-memory`. -2. Use pull-based flow: local export -> `codex/*` branch -> PR -> review -> merge. -3. Never push directly to `main`. -4. Enforce schema and required sections. +1. target `https://github.com/TenureAI/open-research-memory` +2. use pull-based flow: local export -> `codex/*` branch -> PR -> review -> merge +3. never push directly to `main` +4. enforce schema and required sections ## Shared Retrieval Helper @@ -235,9 +266,11 @@ python3 .agents/skills/memory-manager/scripts/shared_memory_retrieval.py \ For each memory operation, emit: 1. `Run` -2. `Action` (retrieve/write/promote/deprecate/export) +2. `Action` (`retrieve|write|promote|deprecate|export`) 3. `Target` -4. `Rationale` -5. `Evidence` -6. `Result` -7. `Trigger` (`bootstrap|stage-change|replan|error|high-resource|periodic|close-out`) +4. `Layers` +5. `Rationale` +6. `Query` +7. `Hits` +8. `Working Update` +9. `memory_skip_reason` when applicable diff --git a/.agents/skills/memory-manager/agents/openai.yaml b/.agents/skills/memory-manager/agents/openai.yaml index efad7ac..08d7434 100644 --- a/.agents/skills/memory-manager/agents/openai.yaml +++ b/.agents/skills/memory-manager/agents/openai.yaml @@ -1,4 +1,4 @@ interface: display_name: "Memory Manager" - short_description: "Maintain working todo memory and reusable research records." - default_prompt: "Use memory manager to keep working state fresh, track active/done/blocked todos, write evidence-linked episode/procedure/insight records, and use the shared memory repo only as a read-only retrieval source unless an approved export is happening." + short_description: "Retrieve reusable experience aggressively and keep working continuity durable." + default_prompt: "Use memory manager to retrieve procedures, episodes, and insights aggressively before acting, keep working state durable for resume/recovery, write concise working deltas after each batch, and use the shared memory repo only as a read-only retrieval source unless an approved export is happening." diff --git a/.agents/skills/memory-manager/references/memory-layout.md b/.agents/skills/memory-manager/references/memory-layout.md index 7d18c52..81617a8 100644 --- a/.agents/skills/memory-manager/references/memory-layout.md +++ b/.agents/skills/memory-manager/references/memory-layout.md @@ -7,6 +7,8 @@ Use this layout during research execution: working/ state.yaml todo.yaml + actions/ + .yaml reports/ index.md stage-*.md @@ -34,3 +36,4 @@ Notes: 2. Keep long-term memory in `.project_local//memory/` plus index metadata in `index.db`. 3. Treat old `memory/` and `.agent/memory.db` layouts as legacy and migrate when touched. 4. Shared memory repos live outside `.project_local` and are treated as read-only retrieval sources, not as run-scoped state. +5. Long-running action records in `actions/` are part of continuity recovery and should be considered when rebuilding `working`. diff --git a/.agents/skills/memory-manager/scripts/memory_store.py b/.agents/skills/memory-manager/scripts/memory_store.py new file mode 100644 index 0000000..261bb2f --- /dev/null +++ b/.agents/skills/memory-manager/scripts/memory_store.py @@ -0,0 +1,230 @@ +#!/usr/bin/env python3 +"""Helpers for local memory retrieval and working-state persistence.""" + +from __future__ import annotations + +import ast +import json +import os +import re +import uuid +from dataclasses import dataclass +from datetime import datetime, timezone +from pathlib import Path +from typing import Any, Dict, Iterable, List, Tuple + + +ROOT_FOLDERS = { + "episodes": "episode", + "procedures": "procedure", + "insights": "insight", +} + + +@dataclass +class Record: + path: Path + metadata: Dict[str, Any] + body: str + + +def utc_now() -> str: + return datetime.now(timezone.utc).replace(microsecond=0).isoformat() + + +def parse_scalar(value: str) -> Any: + if not value: + return "" + if value[0] in {'"', "'"} and value[-1] == value[0]: + return value[1:-1] + if value.startswith("[") and value.endswith("]"): + try: + parsed = ast.literal_eval(value) + except (SyntaxError, ValueError): + return value + return parsed if isinstance(parsed, list) else value + lowered = value.lower() + if lowered == "true": + return True + if lowered == "false": + return False + try: + if "." in value: + return float(value) + return int(value) + except ValueError: + return value + + +def parse_frontmatter(text: str) -> Tuple[Dict[str, Any], str]: + if not text.startswith("---\n"): + return {}, text + lines = text.splitlines() + end_idx = None + for idx in range(1, len(lines)): + if lines[idx].strip() == "---": + end_idx = idx + break + if end_idx is None: + return {}, text + metadata: Dict[str, Any] = {} + for line in lines[1:end_idx]: + if not line.strip() or ":" not in line: + continue + key, raw_value = line.split(":", 1) + metadata[key.strip()] = parse_scalar(raw_value.strip()) + body = "\n".join(lines[end_idx + 1 :]).strip() + return metadata, body + + +def normalize_terms(text: str) -> List[str]: + return [term for term in re.split(r"[^a-z0-9_+-]+", text.lower()) if len(term) >= 2] + + +def read_structured(path: Path) -> Dict[str, Any]: + if not path.exists(): + return {} + text = path.read_text(encoding="utf-8").strip() + if not text: + return {} + try: + payload = json.loads(text) + return payload if isinstance(payload, dict) else {} + except json.JSONDecodeError: + try: + import yaml # type: ignore + except ImportError: + raise RuntimeError(f"Structured file is not JSON and PyYAML is unavailable: {path}") + payload = yaml.safe_load(text) + return payload if isinstance(payload, dict) else {} + + +def write_structured(path: Path, payload: Dict[str, Any]) -> None: + path.parent.mkdir(parents=True, exist_ok=True) + tmp = path.with_name(f"{path.name}.{os.getpid()}.{uuid.uuid4().hex}.tmp") + with tmp.open("w", encoding="utf-8") as handle: + json.dump(payload, handle, ensure_ascii=True, indent=2, sort_keys=True) + handle.write("\n") + os.replace(tmp, path) + + +def resolve_memory_root(memory_root: str, project_root: str, project_slug: str) -> Path: + if memory_root: + return Path(memory_root).resolve() + root = Path(project_root).resolve() + if not project_slug: + project_local = root / ".project_local" + if project_local.exists(): + children = [child for child in project_local.iterdir() if child.is_dir()] + if len(children) == 1: + project_slug = children[0].name + if not project_slug: + raise ValueError("Unable to resolve project slug; pass --memory-root or --project-slug") + return root / ".project_local" / project_slug / "memory" + + +def load_records(memory_root: Path) -> List[Record]: + records: List[Record] = [] + for folder, expected_type in ROOT_FOLDERS.items(): + root = memory_root / folder + if not root.exists(): + continue + for path in sorted(root.rglob("*.md")): + text = path.read_text(encoding="utf-8") + metadata, body = parse_frontmatter(text) + if not metadata: + continue + declared_type = str(metadata.get("type", "")).strip() + if declared_type and declared_type != expected_type: + continue + metadata.setdefault("type", expected_type) + metadata.setdefault("tags", []) + records.append(Record(path=path, metadata=metadata, body=body)) + return records + + +def matches_filters(record: Record, filters: Dict[str, Any]) -> bool: + metadata = record.metadata + record_type = str(metadata.get("type", "")).strip() + if filters.get("type") and record_type != filters["type"]: + return False + if filters.get("task_type") and str(metadata.get("task_type", "")).strip() != filters["task_type"]: + return False + if filters.get("project") and str(metadata.get("project", "")).strip() != filters["project"]: + return False + if filters.get("error_signature"): + if filters["error_signature"].lower() not in str(metadata.get("error_signature", "")).lower(): + return False + tags = {str(tag).lower() for tag in metadata.get("tags", []) if str(tag).strip()} + requested_tags = {str(tag).lower() for tag in filters.get("tags", [])} + if requested_tags and not requested_tags.issubset(tags): + return False + return True + + +def score_record(record: Record, query_terms: Iterable[str]) -> Tuple[int, List[str]]: + metadata = record.metadata + title = str(metadata.get("title", "")).lower() + tags = " ".join(str(tag).lower() for tag in metadata.get("tags", [])) + error_signature = str(metadata.get("error_signature", "")).lower() + project = str(metadata.get("project", "")).lower() + task_type = str(metadata.get("task_type", "")).lower() + body = record.body.lower() + record_type = str(metadata.get("type", "")).lower() + status = str(metadata.get("status", "")).lower() + + score = 0 + matched: List[str] = [] + for term in query_terms: + term_score = 0 + if term in title: + term_score += 4 + if term in tags: + term_score += 3 + if term in error_signature: + term_score += 3 + if term in project or term in task_type: + term_score += 2 + if term in body: + term_score += 1 + if term_score: + matched.append(term) + score += term_score + + if record_type == "procedure": + score += 3 + elif record_type == "episode": + score += 2 + elif record_type == "insight": + score += 1 + + if status == "active": + score += 2 + elif status == "draft": + score += 1 + + if not list(query_terms): + score = max(score, 1) + + return score, sorted(set(matched)) + + +def format_record(record: Record, score: int, matched_terms: List[str], memory_root: Path) -> Dict[str, Any]: + body_preview = " ".join(record.body.split()) + if len(body_preview) > 220: + body_preview = body_preview[:217] + "..." + return { + "id": record.metadata.get("id", ""), + "title": record.metadata.get("title", ""), + "type": record.metadata.get("type", ""), + "status": record.metadata.get("status", ""), + "task_type": record.metadata.get("task_type", ""), + "project": record.metadata.get("project", ""), + "tags": record.metadata.get("tags", []), + "error_signature": record.metadata.get("error_signature", ""), + "score": score, + "matched_terms": matched_terms, + "path": str(record.path.relative_to(memory_root)), + "preview": body_preview, + "source": "project-local-memory", + } diff --git a/.agents/skills/memory-manager/scripts/retrieve_local_memory.py b/.agents/skills/memory-manager/scripts/retrieve_local_memory.py new file mode 100644 index 0000000..c36fb44 --- /dev/null +++ b/.agents/skills/memory-manager/scripts/retrieve_local_memory.py @@ -0,0 +1,70 @@ +#!/usr/bin/env python3 +"""Retrieve local experience memory from project-local memory storage.""" + +from __future__ import annotations + +import argparse +import json +from pathlib import Path + +from memory_store import format_record, load_records, matches_filters, normalize_terms, resolve_memory_root, score_record + + +def build_parser() -> argparse.ArgumentParser: + parser = argparse.ArgumentParser(description="Search local project memory") + parser.add_argument("--memory-root", default="", help="Explicit path to memory root") + parser.add_argument("--project-root", default="", help="Project root for inferring .project_local memory") + parser.add_argument("--project-slug", default="", help="Project slug for inferring memory path") + parser.add_argument("--query", default="", help="Free-text query") + parser.add_argument("--type", choices=["episode", "procedure", "insight"], default="") + parser.add_argument("--task-type", default="") + parser.add_argument("--project", default="") + parser.add_argument("--error-signature", default="") + parser.add_argument("--tag", action="append", default=[]) + parser.add_argument("--limit", type=int, default=5) + parser.add_argument("--json", action="store_true") + return parser + + +def main() -> int: + args = build_parser().parse_args() + memory_root = resolve_memory_root(args.memory_root, args.project_root, args.project_slug) + if not memory_root.exists(): + payload = {"memory_root": str(memory_root), "results": []} + print(json.dumps(payload, ensure_ascii=True, indent=2)) + return 0 + + filters = { + "type": args.type, + "task_type": args.task_type, + "project": args.project, + "error_signature": args.error_signature, + "tags": args.tag, + } + query_terms = normalize_terms(args.query) + ranked = [] + for record in load_records(memory_root): + if not matches_filters(record, filters): + continue + score, matched_terms = score_record(record, query_terms) + if score <= 0: + continue + ranked.append((score, matched_terms, record)) + + ranked.sort(key=lambda item: str(item[2].metadata.get("updated_at", "")), reverse=True) + ranked.sort(key=lambda item: item[0], reverse=True) + results = [ + format_record(record, score, matched_terms, memory_root) + for score, matched_terms, record in ranked[: max(args.limit, 1)] + ] + payload = { + "memory_root": str(memory_root), + "query": args.query, + "results": results, + } + print(json.dumps(payload, ensure_ascii=True, indent=2)) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/.agents/skills/memory-manager/scripts/retrieve_working_state.py b/.agents/skills/memory-manager/scripts/retrieve_working_state.py new file mode 100644 index 0000000..cb71695 --- /dev/null +++ b/.agents/skills/memory-manager/scripts/retrieve_working_state.py @@ -0,0 +1,35 @@ +#!/usr/bin/env python3 +"""Read durable working state for a run.""" + +from __future__ import annotations + +import argparse +import json +from pathlib import Path + +from memory_store import read_structured + + +def build_parser() -> argparse.ArgumentParser: + parser = argparse.ArgumentParser(description="Read working state for a run") + parser.add_argument("--run-root", required=True, help="Path to logs/runs/") + return parser + + +def main() -> int: + args = build_parser().parse_args() + run_root = Path(args.run_root).resolve() + payload = { + "run_root": str(run_root), + "state": read_structured(run_root / "working" / "state.yaml"), + "todo": read_structured(run_root / "working" / "todo.yaml"), + } + actions_root = run_root / "actions" + if actions_root.exists(): + payload["active_actions"] = sorted(path.name for path in actions_root.glob("*.yaml")) + print(json.dumps(payload, ensure_ascii=True, indent=2)) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/.agents/skills/memory-manager/scripts/write_working_delta.py b/.agents/skills/memory-manager/scripts/write_working_delta.py new file mode 100644 index 0000000..620d546 --- /dev/null +++ b/.agents/skills/memory-manager/scripts/write_working_delta.py @@ -0,0 +1,85 @@ +#!/usr/bin/env python3 +"""Write a concise working-state delta for a run.""" + +from __future__ import annotations + +import argparse +import json +from pathlib import Path +from typing import Dict, List + +from memory_store import read_structured, utc_now, write_structured + + +def build_parser() -> argparse.ArgumentParser: + parser = argparse.ArgumentParser(description="Write working-state delta") + parser.add_argument("--run-root", required=True, help="Path to logs/runs/") + parser.add_argument("--goal", default="") + parser.add_argument("--stage", default="") + parser.add_argument("--hypothesis", default="") + parser.add_argument("--last-action", default="") + parser.add_argument("--last-observation", default="") + parser.add_argument("--next-step", default="") + parser.add_argument("--blocker", action="append", default=[]) + parser.add_argument("--evidence-ref", action="append", default=[]) + parser.add_argument("--active-action-id", action="append", default=[]) + parser.add_argument("--todo-active", action="append", default=[]) + parser.add_argument("--todo-done", action="append", default=[]) + parser.add_argument("--todo-blocked", action="append", default=[]) + return parser + + +def maybe_set(payload: Dict[str, object], key: str, value: str) -> None: + if value: + payload[key] = value + + +def replace_if_present(payload: Dict[str, object], key: str, values: List[str]) -> None: + if values: + payload[key] = values + + +def main() -> int: + args = build_parser().parse_args() + run_root = Path(args.run_root).resolve() + state_path = run_root / "working" / "state.yaml" + todo_path = run_root / "working" / "todo.yaml" + + state = read_structured(state_path) + todo = read_structured(todo_path) + + maybe_set(state, "goal", args.goal) + maybe_set(state, "stage", args.stage) + maybe_set(state, "hypothesis", args.hypothesis) + maybe_set(state, "last_action", args.last_action) + maybe_set(state, "last_observation", args.last_observation) + maybe_set(state, "next_step", args.next_step) + replace_if_present(state, "blockers", args.blocker) + replace_if_present(state, "evidence_refs", args.evidence_ref) + replace_if_present(state, "active_action_ids", args.active_action_id) + state["updated_at"] = utc_now() + + replace_if_present(todo, "todo_active", args.todo_active) + replace_if_present(todo, "todo_done", args.todo_done) + replace_if_present(todo, "todo_blocked", args.todo_blocked) + todo["updated_at"] = utc_now() + + write_structured(state_path, state) + write_structured(todo_path, todo) + + print( + json.dumps( + { + "run_root": str(run_root), + "state": state, + "todo": todo, + }, + ensure_ascii=True, + indent=2, + ) + ) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/.agents/skills/research-workflow/SKILL.md b/.agents/skills/research-workflow/SKILL.md index e9b9e80..ca340e6 100644 --- a/.agents/skills/research-workflow/SKILL.md +++ b/.agents/skills/research-workflow/SKILL.md @@ -10,7 +10,7 @@ description: |- ## Mission -Drive AI R&D tasks with small, testable, evidence-first steps while respecting the selected interaction mode. +Drive AI R&D tasks with small, testable, evidence-first steps while supporting durable long-running execution, aggressive experience retrieval, and DR-first external search. ## Orchestration Order @@ -18,13 +18,13 @@ Drive AI R&D tasks with small, testable, evidence-first steps while respecting t For non-trivial tasks, run this order: -1. **`Skill(skill: "run-governor")`** — Initialize run policy (mode + run_id). MUST call before any execution. +1. **`Skill(skill: "run-governor")`** — Initialize run policy (mode + run_id) and reconcile active long actions before any execution. 2. **`Skill(skill: "project-context")`** — Resolve runtime context before experiment/report/eval execution. MUST call when env setup is needed. 3. Understand user objective and current code/evidence state. 4. **`Skill(skill: "human-checkpoint")`** — Clarify ambiguous requirements. MUST call when intake is incomplete. 5. Complete intake checkpoint before planning or decomposition. -6. **`Skill(skill: "memory-manager")`** — Run one bootstrap (`retrieve/init-working`). MUST call before planning. -7. **`Skill(skill: "deep-research")`** — Run deep research. MUST call when user message contains any research-intent keyword (调研/研究/对比/综述/文献/证据/机制/根因/为什么/可行性/路线图/分析/探索 or English equivalents). Do NOT answer research questions yourself — invoke the skill. +6. **`Skill(skill: "memory-manager")`** — Run bootstrap retrieval and initialize durable working state. MUST call before planning. +7. **`Skill(skill: "deep-research")`** — Use as the default external search gateway whenever outside evidence is needed, and always when research-intent keywords appear. 8. **`Skill(skill: "research-plan")`** — Build execution plan. MUST call for planning-heavy requests or after deep-research scoping. 9. Confirm plan as required by mode. 10. **`Skill(skill: "experiment-execution")`** — Execute experiment. MUST call for any actual run/launch/monitor. @@ -63,72 +63,237 @@ Route required user interactions through `human-checkpoint`: On each new user message: 1. Re-evaluate objective and skill routing before executing the next pending action. -2. If user intent shifts to research/scoping/comparison/root-cause inquiry, call `Skill(skill: "deep-research")` immediately — do NOT answer the research question yourself. -3. Do not continue stale execution plans when the objective changed materially. -4. If `deep-research` is skipped, emit `dr_skip_reason` with freshness evidence (date/timestamp and source coverage), then continue. -5. Cooldown: +2. Retrieve relevant memory before new planning or execution continues. +3. If user intent shifts to research/scoping/comparison/root-cause inquiry, call `Skill(skill: "deep-research")` immediately — do NOT answer the research question yourself. +4. Do not continue stale execution plans when the objective changed materially. +5. If `deep-research` is skipped, emit `dr_skip_reason` with freshness evidence (date/timestamp and source coverage), then continue. +6. Cooldown: - no more than one non-forced deep-research call per stage. - bypass cooldown when objective changed, contradiction appears, or high-impact uncertainty remains unresolved. -## Default Execution Loop +## Goal Compilation Gate (Mandatory) + +Before planning the first execution batch, translate the user request into machine-checkable success gates. Do not leave stopping conditions as prose only. + +Compile at least: + +1. `objective_summary`: one-sentence task goal +2. `primary_target`: final target metric or observable end state +3. `promotion_gates`: staged thresholds that must be passed before larger or costlier evaluation +4. `non_regression_guards`: what must not degrade materially +5. `backup_policy`: when to preserve best-so-far variants +6. `stop_allowed_only_if`: exact conditions that permit `done` +7. `hard_blockers`: the small set of conditions that justify asking the user + +Examples of required compilation: + +1. "keep iterating" or "do not stop" -> `completion_policy=until-target-or-hard-blocker` +2. "target 100% accuracy" -> stretch target plus current promotion gate +3. "use about 30 examples first, then try 100" -> ordered evaluation ladder +4. "do not overfit a few cases" -> non-regression guard on held-out or previously-correct samples +5. "backup strong versions" -> preserve best-so-far artifact before riskier changes + +If any of these are measurable but omitted from the compiled goal state, treat that as a workflow defect and repair it before continuing. + +## Durable Execution Loop Repeat this loop until completion: -1. Update success criteria. -2. Collect or refresh evidence. -3. Plan the smallest useful next action. -4. Refresh working todo state only when memory trigger conditions are met. -5. Act. -6. Observe outputs. -7. Evaluate result quality and risk. -8. Decide: iterate, replan, checkpoint, or done. +1. Compile or refresh success criteria, promotion gates, and done guards. +2. Retrieve prior experience: + - `procedure` retrieval is mandatory before each execution batch + - add `episode` retrieval when a problem, repeated attempt, or new error signature appears + - add `insight` retrieval for planning, tradeoffs, or final answer shaping +3. Route external search through `deep-research` when fresh outside evidence is needed. +4. Plan the smallest useful next batch. +5. If the batch contains a `long_action`, persist an action record before waiting. +6. If the batch contains a `long_action`, switch to `watch mode`. +7. Act or poll an active action. +8. Observe outputs or liveness state. +9. If a poll occurred, immediately branch into post-poll handling rather than stopping at a status update. +10. Write working delta. +11. Evaluate result quality, target progress, and non-regression status against the compiled gates. +12. Decide: iterate, replan, checkpoint, or done. + +## Short Iterative Optimization Policy + +Treat local edit-evaluate loops with the same ownership standard as long-running jobs. They are not "one batch and done" tasks. + +For iterative optimization tasks: + +1. establish a baseline score before the first risky change when practical +2. define the primary regression set and any promotion gate set before broadening evaluation +3. after each batch: + - compare against baseline and best-so-far + - check non-regression guards + - update which promotion gate is next +4. if a new best result appears and the user requested preservation or the route is high-risk, snapshot the best-so-far variant before proceeding +5. if targets are still unmet and a safe next step exists, default to `iterate`, not `done` +6. if repeated attempts plateau or regress, default to `replan` +7. ask the user only for a true blocker, safety/resource gate, or objective ambiguity that cannot be resolved locally + +## Long-Running Action Policy + +Treat an action as long-running when it is expected to exceed 5 minutes, launches a remote or async job, or likely outlives the current model turn. + +Before launching a long-running action: + +1. persist an action record through `run-governor` +2. record: + - command + - cwd + - log path + - success/failure signals + - expected duration + - poll interval + - resume step +3. update working state with the active `action_id` +4. choose an initial poll interval, but keep the schedule under model control +5. if the action is likely to outlive the current turn, start an external watcher loop before leaving the turn + +While the current session remains active: + +1. use poll loops rather than blocking waits +2. the loop should be `model chooses sleep -> run watcher/poll -> inspect -> model chooses next sleep -> continue` +3. if no new output appears, classify as `running` or `stalled`, not `done` +4. the model may shorten the interval when there is fresh progress, warning signs, or expected completion is near +5. the model may lengthen the interval when the process is healthy but idle +6. watcher scripts should report status, not own the polling strategy +7. after each poll, update working state and next poll time +8. after each poll, immediately choose one of: + - continue-watch + - collect-results + - diagnose-stall + - diagnose-failure + - replan +9. do not hand the task back to the user merely because the job is long-running +10. replan only after explicit lack-of-progress evidence, not because output paused briefly + +## Post-Poll Handling Policy + +After every long-action poll, the agent must keep ownership of the task and continue from the observed state. + +Required branches: + +1. `running` + - if the process is healthy, choose the next sleep interval and continue watch mode + - if progress appeared, update working state and keep monitoring +2. `completed` + - inspect outputs, logs, checkpoints, or metrics immediately + - run downstream validation, analysis, or report generation instead of waiting for the user to come back +3. `stalled` + - inspect logs and latest outputs + - retrieve `procedure` and `episode` memory before the next recovery attempt + - attempt the smallest safe recovery or replan +4. `failed` + - inspect failure evidence immediately + - retrieve relevant memory + - attempt the smallest safe recovery, or replan if needed +5. `checkpoint` + - ask the user only if a true hard blocker, major safety risk, or explicit decision point remains + +Default rule: + +1. Never respond with the equivalent of "the job is running, come back later" unless the user explicitly wants fire-and-forget behavior. +2. Long-running work stays inside the watch loop until one of these is true: + - success criteria are met + - a hard blocker requires the user + - a safety/resource gate requires explicit approval + - an external automation takeover was explicitly established + +## Watch-Loop Prompt Template + +When running in watch mode, use this template after every poll so followup handling stays consistent instead of ad hoc. + +Template: + +1. poll the action +2. read: + - `status` + - `status_changed` + - `progress_changed` + - `followup_action` + - `last_log_tail` +3. execute the branch for `followup_action` +4. update working delta +5. either choose the next sleep interval or continue into diagnosis/result handling immediately + +Branch template: + +1. `followup_action=continue-watch` + - summarize what changed + - decide the next sleep interval + - remain in watch mode +2. `followup_action=wait-and-poll` + - confirm the process is still healthy + - choose the next sleep interval + - remain in watch mode +3. `followup_action=collect-results` + - inspect outputs, checkpoints, logs, metrics, or artifacts immediately + - validate against success criteria + - continue to downstream analysis/reporting rather than stopping at "completed" +4. `followup_action=diagnose-stall` + - inspect logs and latest outputs + - retrieve `procedure` plus relevant `episode` memory + - attempt the smallest safe recovery or replan +5. `followup_action=diagnose-failure` + - inspect failure evidence + - retrieve relevant memory + - run the smallest safe recovery path, or replan if needed +6. `followup_action=replan` + - update hypothesis and next step + - restart the loop with a new execution batch + +Compact prompt form: + +```text +Watch loop: +1. Poll the active action. +2. Read status, status_changed, progress_changed, followup_action, and last_log_tail. +3. If followup_action=continue-watch or wait-and-poll, choose the next sleep interval and keep ownership. +4. If followup_action=collect-results, inspect outputs immediately and continue analysis/validation. +5. If followup_action=diagnose-stall or diagnose-failure, inspect evidence, retrieve memory, and attempt the smallest safe recovery. +6. If followup_action=replan, update the route and continue execution. +7. Do not stop at a mere status update, and do not hand the task back to the user unless a true blocker remains. +``` ## Search, Memory, and Deep-Research Policy Use these in combination: -1. `memory-manager` bootstrap is mandatory before planning/execution for non-trivial runs — call `Skill(skill: "memory-manager")`. -2. Between bootstrap and close-out, memory operations are trigger-based and non-aggressive. -3. Trigger memory operation when one of the following occurs: - - stage transition - - replan - - significant error or new error signature - - the current task modifies `memory-manager` or another Memory-related skill/policy - - state/context files show compaction markers such as `Compact`, `压缩`, `Summary`, or equivalent summary/compression techniques - - memory auto-compression/summarization completed - - before high-resource action - - before final answer/report handoff -4. In memory-skill-edit or compaction cases, call `memory-manager` to read prior Memory before planning, editing, or resuming execution. -5. Periodic `working` memory refresh is required when either holds: - - at least 15 minutes since last memory operation - - at least 3 execution cycles since last memory operation -6. Command-gap fallback: if 5 consecutive commands/actions finish without a memory update, force one concise `working` refresh. -7. Cooldown: no more than one non-forced memory operation per cycle. -8. Avoid per-command memory writes; batch observations into one delta update. -9. Use search/deep research directly when topic is time-sensitive, new, or currently blocked. -10. If project-local memory retrieval is low-yield, shared-memory retrieval may query the configured local shared repo as a read-only source. -11. Do not sync the shared repo on every cycle; prefer the current local checkout and sync only on explicit gap handling or before export. -12. For open-ended research/scoping requests, call `Skill(skill: "deep-research")` before giving decomposition or roadmap recommendations — do NOT synthesize research yourself. -12.1 For mid-run new research requests, call `Skill(skill: "deep-research")` re-entry before further execution. -13. For unknown errors, use this branch: +1. `memory-manager` bootstrap is mandatory before planning/execution for non-trivial runs. +2. Every new user turn must retrieve relevant memory before planning continues. +3. Every execution batch must retrieve `procedure` memory before acting. +4. If a problem appears, a new error signature is observed, or repeated attempts are failing, retrieve `episode` memory before the next fix attempt. +5. Before high-resource or irreversible actions, retrieve `procedure` plus relevant `episode` memory. +6. During planning, tradeoff analysis, contradiction resolution, or final answer shaping, retrieve `insight` memory. +7. `working` memory reads are mandatory for resume/reconcile, compaction recovery, long-action resume, and final handoff; they are not the only retrieval path. +8. After every execution batch and every long-action poll cycle, write a concise `working` delta. +9. After every stalled or failed poll result, retrieve `procedure` plus relevant `episode` memory before the next fix attempt. +10. After every completed poll result, inspect outputs and retrieve `insight` memory if interpretation, report writing, or tradeoff analysis is needed. +11. All external search must route through `deep-research`; do not bypass it with ad hoc search. +12. `deep-research` may choose light/default/deep execution depth internally, but it may not silently skip actual search. +13. If project-local memory retrieval is low-yield, shared-memory retrieval may query the configured local shared repo as a read-only source. +14. If memory or DR is skipped, record `memory_skip_reason` or `dr_skip_reason` with concrete evidence. +15. For unknown errors, use this branch: - local evidence triage (logs, stack trace, recent changes) + - local memory retrieval (`procedure`, then `episode`) - shared-memory retrieval when reusable SOPs or prior debug cases are likely relevant - - targeted search - - deep research (`Skill(skill: "deep-research")` with debug-investigation type) if still unresolved + - `deep-research` search if the issue is still unresolved or freshness-sensitive - minimal fix validation -14. If compaction is detected, treat missing memory retrieval as a workflow violation and recover by reading prior Memory before continuing. -15. If skipping memory due to cooldown or low-value delta outside the memory-skill-edit or compaction cases, record reason in the stage report. -16. If intake information is missing, call `Skill(skill: "human-checkpoint")` before deep research or planning. +16. If compaction is detected, treat missing working reread as a workflow violation and recover before continuing. 17. If deep research was used for open-ended scoping, call `Skill(skill: "research-plan")` to convert findings into an execution-ready plan. Skip only if the user explicitly opts out. ## Replanning Policy Trigger replan when: -1. Major assumption fails. -2. Repeated attempts show no improvement. -3. New evidence changes route significantly. -4. Resource/risk profile changes. +1. major assumption fails +2. repeated attempts show no improvement +3. new evidence changes route significantly +4. resource/risk profile changes +5. a long-running action remains `stalled` across multiple polls +6. post-poll handling repeatedly produces no safe next step Mode controls whether replan confirmation is required. @@ -140,7 +305,8 @@ At each stage completion or major todo completion: 2. Update `reports/index.md` with status and timestamp. 3. In chat, provide a detailed summary plus report path. 4. Do not block execution only because a stage report was emitted. -5. If the stage delivers data-analysis results, include visualization outputs and saved figure paths (default: `/logs/runs//reports/figures/`). +5. If the stage depends on background work, include active action ids, latest liveness state, and next poll plan. +6. If the stage delivers data-analysis results, include visualization outputs and saved figure paths (default: `/logs/runs//reports/figures/`). ## Mandatory Visualization Policy @@ -148,11 +314,11 @@ At each stage completion or major todo completion: 1. When writing any report (stage report, final report, evaluation summary) that contains numerical results, tables, or comparisons, you MUST generate matplotlib/code-based figures before finalizing the report. 2. At minimum, generate: - - A **ranking/comparison chart** (bar chart) when multiple strategies, methods, or configurations are compared. - - A **breakdown chart** (grouped bar or heatmap) when per-category/per-subject/per-level data is available. - - A **trend/line chart** when results vary across an ordered dimension (difficulty level, training step, etc.). + - a ranking/comparison chart when multiple strategies, methods, or configurations are compared + - a breakdown chart when per-category/per-subject/per-level data is available + - a trend chart when results vary across an ordered dimension 3. Save all figures to `/logs/runs//reports/figures/` and embed them in the report markdown with relative paths. -4. Prefer code-generated assets (matplotlib, seaborn) that can be regenerated. Save the generation script alongside the figures. +4. Prefer code-generated assets that can be regenerated. Save the generation script alongside the figures. 5. If the report scope is large or the visualizations need polished formatting, invoke `Skill(skill: "paper-writing")` to handle the report writing and figure integration. 6. If you are unsure whether a report needs visualization, it does. Over-visualizing is acceptable; under-visualizing is a violation. @@ -169,34 +335,45 @@ Do not export shared memory during core task execution. Use this order: -1. `done`: success criteria met with evidence. +1. `done`: all compiled hard gates are satisfied with evidence. 2. `checkpoint`: decision requires mode-based confirmation or safety gating. -3. `iterate`: validated small next step exists. -4. `replan`: current route is weak or stale. +3. `iterate`: target is still unmet and a validated next step exists. +4. `replan`: current route is weak, stale, or repeatedly failing to improve. 5. `blocked`: hard blocker requires user input. +Done guard: + +1. If `completion_policy=until-target-or-hard-blocker`, `done` is forbidden while the active promotion gate or hard target remains unmet. +2. A single successful batch, a clean run, or a partial fix is not sufficient for `done`. +3. If the final stretch target is not yet met but the current promotion gate is met, advance to the next gate instead of stopping. + ## Evidence Standard Treat conclusions as valid only when backed by one or more: -1. Reproducible command output. -2. Measurable metric movement. -3. File diff tied to behavior change. -4. Corroborated external source. +1. reproducible command output +2. measurable metric movement +3. file diff tied to behavior change +4. corroborated external source +5. durable liveness evidence for long-running actions ## Required Cycle Output At end of each cycle, emit: -1. `Run`: run_id, mode, current stage. -2. `State`: what is true now. -3. `Evidence`: key observations. -4. `Todo`: active/done/blocked highlights. -5. `Next Step`: smallest safe action. -6. `Replan Need`: yes or no, with reason. -7. `Checkpoint Need`: yes or no, with reason. -8. `Report Path`: stage report path or pending path. -9. `Interaction Channel`: `request_user_input|plain-text-fallback|none`. +1. `Run`: run_id, mode, current stage +2. `State`: what is true now +3. `Evidence`: key observations +4. `Todo`: active/done/blocked highlights +5. `Next Step`: smallest safe action +6. `Replan Need`: yes or no, with reason +7. `Checkpoint Need`: yes or no, with reason +8. `Report Path`: stage report path or pending path +9. `Interaction Channel`: `request_user_input|plain-text-fallback|none` +10. `Goal Status`: primary target, active promotion gate, non-regression status, and whether `done` is currently allowed +11. `Memory`: retrieved layers, key hits, and `memory_skip_reason` when relevant +12. `Deep Research`: `dr_used=YES|NO`, depth, and `dr_skip_reason` or downgrade reason +13. `Liveness`: active action ids, current status, next poll time, and current poll interval when background work exists ## Violation Recovery Policy @@ -205,3 +382,10 @@ If user interaction was handled outside required routing in non-`full-auto` mode 1. Acknowledge non-compliance. 2. Re-run the missed checkpoint using `human-checkpoint` and channel policy. 3. Re-evaluate downstream conclusions that depended on the missed checkpoint. + +If a cycle ended while a long-running action had no watch or resume plan: + +1. Acknowledge the missing liveness protection. +2. Persist or reconstruct the action record. +3. Establish a poll or watcher plan before resuming normal work. +4. Record the recovery step in the next stage report. diff --git a/.agents/skills/research-workflow/agents/openai.yaml b/.agents/skills/research-workflow/agents/openai.yaml index 43d86c5..0f75fc2 100644 --- a/.agents/skills/research-workflow/agents/openai.yaml +++ b/.agents/skills/research-workflow/agents/openai.yaml @@ -1,4 +1,4 @@ interface: display_name: "Research Workflow" - short_description: "Mode-aware evidence loop for AI R&D execution." - default_prompt: "Use research workflow to understand requirements, plan, execute with working todo tracking, replan on major issues, and emit stage reports." + short_description: "Durable evidence loop for AI R&D execution with long-action polling and experience retrieval." + default_prompt: "Use research workflow to understand requirements, compile the user request into machine-checkable success criteria, promotion gates, non-regression guards, and done guards, then retrieve prior experience aggressively, route external search through deep research, and keep ownership until those compiled gates are satisfied or a true blocker remains. Treat short edit-evaluate loops with the same persistence as long-running actions. In watch mode always branch on followup_action: continue-watch or wait-and-poll means choose the next sleep and keep ownership; collect-results means inspect outputs immediately and validate against the compiled gates; diagnose-stall or diagnose-failure means inspect evidence, retrieve memory, and attempt the smallest safe recovery; replan means update the route and continue execution. Write working deltas, replan on major issues, and do not mark done while the active gate is unmet." diff --git a/.agents/skills/run-governor/SKILL.md b/.agents/skills/run-governor/SKILL.md index 353cc0f..c824839 100644 --- a/.agents/skills/run-governor/SKILL.md +++ b/.agents/skills/run-governor/SKILL.md @@ -1,7 +1,7 @@ --- name: run-governor description: |- - Govern run-level execution policy: mode selection, run_id, directory layout, stage reporting, safety allowances. + Govern run-level execution policy: mode selection, durable run tracking, long-action watch/resume policy, stage reporting, and safety allowances. TRIGGER when: starting a non-trivial research task (set mode + run_id), switching local/remote target, creating a new run, or mode-aware policy decisions needed. DO NOT TRIGGER when: trivial single-step tasks, pure info queries, or run already initialized with no policy change needed. --- @@ -10,7 +10,7 @@ description: |- ## Mission -Set and enforce run-level execution policy so research runs stay consistent, auditable, and mode-aware. +Set and enforce run-level execution policy so research runs stay consistent, auditable, durable across long waits, and mode-aware. ## Mode Selection Policy @@ -19,6 +19,7 @@ At the start of a research run, ask the user to choose one mode: 1. `full-auto` - Minimize user interruptions. - Ask only for hard blockers or major safety risks. + - Keep ownership until compiled success criteria are met, or a true blocker/safety gate requires interruption. 2. `moderate` (recommended) - Confirm during plan finalization. - Confirm before high-resource actions. @@ -33,6 +34,41 @@ Additional rules: 3. If mode selection is pending, keep the run in `pending-confirmation` and do not initialize run artifacts. 4. Never auto-default mode from timeout. A mode must be explicitly confirmed by the user before initialization. 5. After mode is selected, do not auto-continue after confirmation timeouts in non-`full-auto` modes. +6. While confirmation is pending, read-only intake, code inspection, and evidence gathering are allowed; creating run artifacts or launching jobs is not. + +## Persistent Completion Policy + +When the user expresses persistence or target-seeking intent, compile it into run policy rather than treating it as stylistic wording. + +Trigger phrases include examples such as: + +1. "keep iterating" +2. "do not stop" +3. "until target" +4. "try many iterations" +5. "optimize until it works" +6. "reach 90%/100%" or similar target metrics + +Required policy fields: + +1. `execution_intent`: `single-pass|persistent-optimization` +2. `completion_policy`: `normal|until-target-or-hard-blocker` +3. `done_guard`: `normal|forbid_done_if_target_unmet` +4. `promotion_gates`: ordered thresholds such as "pass 30-case regression before 100-case sweep" +5. `non_regression_guards`: conditions that must not degrade materially +6. `backup_policy`: when to snapshot best-so-far prompts/configs/code/results + +Rules: + +1. `full-auto` controls interruption frequency; it does not weaken persistence requirements. +2. If user intent is `persistent-optimization`, one edit/test cycle is never sufficient reason to stop. +3. In `full-auto` plus `persistent-optimization`, remain in the execution loop until one of these is true: + - compiled hard targets are met + - a true hard blocker remains after reasonable recovery attempts + - a major safety/resource gate requires approval + - the user explicitly stops or changes the objective +4. If the target is aspirational but measurable, store both the stretch target and the current promotion gate. +5. If the user asked to preserve strong variants, snapshot best-so-far artifacts before higher-risk changes. ## Interaction Transport Policy @@ -75,7 +111,7 @@ Before creating any run files or directories, collect and confirm both fields fr Hard constraints: 1. If either confirmation is missing, mark status `blocked-awaiting-user-confirmation`. -2. While blocked, do not create `run_id`, run directories, manifests, policy files, working files, reports, or runtime snapshots. +2. While blocked, do not create `run_id`, run directories, manifests, policy files, working files, reports, runtime snapshots, or background watchers. 3. `moderate` is only a recommendation label and cannot be applied unless user-confirmed. 4. For `moderate` or `detailed`, ask via built-in question tool first; if unavailable, use plain-text fallback. 5. If user asks to proceed without specifying values, ask a direct clarification question and remain blocked. @@ -86,11 +122,95 @@ Hard constraints: Before transitioning from initialization to execution workflow: -1. Set `memory_policy=balanced-triggered` unless user explicitly overrides. +1. Set `memory_policy=experience-first-continuous` unless user explicitly overrides. 2. Ensure one `memory-manager` bootstrap operation is complete: - `retrieve` or `init-working` for current project/task context. 3. If bootstrap is missing, mark status `blocked-awaiting-memory-bootstrap`. -4. This gate enforces only the bootstrap, not per-step memory writes. +4. This gate enforces bootstrap plus retrieval policy declaration; ongoing retrieval cadence is enforced by `research-workflow` and `memory-manager`. + +## Durable Execution Policy + +Treat long waits as first-class runtime state, not as something the model should silently "keep waiting" for. + +Classify an action as `long_action` when any is true: + +1. expected duration is over 5 minutes +2. action launches training, evaluation, inference batches, indexing, or remote jobs +3. action uses `sleep`, async polling, or scheduler submission +4. action is high-resource (`L2` or `L3`) +5. action likely outlives the current model turn + +For every `long_action`, do all of the following before waiting: + +1. create an action record under `actions/.yaml` +2. persist: + - `action_id` + - `status` + - `kind` + - `command` + - `cwd` + - `expected_duration_seconds` + - `poll_interval_seconds` + - `launch_time` + - `last_heartbeat` + - `next_poll_at` + - `success_signal` + - `failure_signal` + - `resume_step` +3. if the command is launched locally, capture `pid` and log path +4. immediately enter `watch mode` + - if the current session remains active, use a poll loop (`model chooses sleep -> watch/poll -> inspect state -> model chooses next sleep`) + - if the action may outlive the current turn, start an external watch loop or automation before ending the turn +5. never mark a `long_action` done without an explicit poll/reconcile step +6. choose an initial poll interval, but keep control with the model + - scripts may store the chosen interval and suggested next poll time + - scripts must not become the primary owner of polling strategy + - the model may shorten or lengthen the next sleep after each poll +7. keep ownership after launch + - launching a job is not sufficient completion + - after every poll, the agent must continue with monitoring, diagnosis, recovery, result collection, or replan + - do not hand the task back to the user only because the job is long-running + +Allowed liveness states: + +1. `pending` +2. `running` +3. `stalled` +4. `failed` +5. `completed` +6. `cancelled` + +## Active Run Registry + +Maintain these durable records in `/logs/runs//`: + +1. `actions/index.json` + - list of active and historical `action_id`s + - last sweep time +2. `actions/.yaml` + - one persisted record per long action +3. `actions/watch.log` + - optional watcher loop output + +Registry rules: + +1. scan the registry before each new planning cycle +2. if `next_poll_at <= now`, reconcile action state before planning unrelated work +3. do not assume a long-running command completed only because chat output paused +4. if a process ended without a success or failure signal, mark `stalled` and inspect before continuing +5. treat `next_poll_at` as the model-provided wake-up hint for watchers; do not busy-wait +6. do not treat a `running` record as a reason to stop work; it is a reason to stay in watch mode + +## Resume Gate + +At the start of every resumed run or fresh model turn: + +1. read active run metadata and the latest `working/state.yaml` +2. reconcile all `pending|running|stalled` actions +3. update each action with fresh liveness evidence +4. restore the next safe step from `resume_step` or `working.next_step` +5. immediately branch into watch continuation, result collection, diagnosis, or replan based on reconciled action state +6. only continue unrelated planning after active-action handling is complete or explicitly delegated ## Run Identity and Directories @@ -119,9 +239,13 @@ Maintain these files in `/logs/runs//`: 1. `run_policy.yaml` - mode + - execution intent and completion policy + - compiled target gates and non-regression guards + - done guard and backup policy - high-resource heuristic bands - safety policy notes - per-run action allowances + - watch policy 2. `run_manifest.yaml` - local project root - runtime project root @@ -132,10 +256,14 @@ Maintain these files in `/logs/runs//`: - optional additional project roots - output directory mapping 3. `working/state.yaml` - - objective, current phase, hypothesis, blockers, next step + - objective, current phase, hypothesis, blockers, next step, active action ids 4. `working/todo.yaml` - `todo_active`, `todo_done`, `todo_blocked` -5. `reports/index.md` +5. `actions/index.json` + - active action registry +6. `actions/.yaml` + - durable long-action state +7. `reports/index.md` - stage list with status (`done|running|blocked`), file paths, and last update time ## Multi-Project and New-Topic Policy @@ -166,8 +294,14 @@ Use approximate decision bands: 2. `L2` medium: roughly 2-10 GPU-hours or 20-100 USD equivalent. 3. `L3` high: over 10 GPU-hours, over 100 USD equivalent, or long multi-node runs. -In `moderate` and `detailed`, confirm before L2/L3 actions. -In `full-auto`, proceed unless major safety risk is present. +Before any `L2` or `L3` action: + +1. run targeted memory retrieval for relevant procedures and episodes +2. create a long-action record if the job will not finish immediately +3. persist a watch plan and a model-selected poll interval +4. in `moderate` and `detailed`, confirm before launch +5. in `full-auto`, proceed unless major safety risk is present +6. after launch, remain responsible for monitoring and post-poll handling until a true blocker or completion condition exists ## Stage Reporting Contract @@ -177,21 +311,27 @@ At each stage completion: 2. Update `reports/index.md` status and timestamp. 3. In chat, provide a detailed stage summary plus report path. 4. Do not require user reply just because a stage report was emitted. +5. Include active long-action status when the stage depends on background work. ## Required Output for Run-Governor Operations For each run-governor action, emit: 1. `Run`: run_id and active mode -2. `Action`: initialize, switch-mode, update-policy, new-topic-check, or stage-report +2. `Action`: initialize, switch-mode, update-policy, new-topic-check, resume, watch-update, or stage-report 3. `Decision`: what was chosen and why 4. `Execution`: local/remote choice and reuse decision 5. `Paths`: affected control/output/context paths -6. `Next`: next actionable step -7. `Confirmation`: `user_confirmed_mode`, `user_confirmed_execution_target`, and whether initialization is permitted (`YES|NO`) -8. `Compliance`: `gate_status=pass|blocked`, with blocked reason when applicable -9. `Interaction`: `interaction_transport` and optional `fallback_reason` -10. `Memory`: `memory_policy` and `memory_bootstrap_done=YES|NO` +6. `Completion`: execution intent, completion policy, done guard, and active promotion gate +7. `Next`: next actionable step +8. `Confirmation`: `user_confirmed_mode`, `user_confirmed_execution_target`, and whether initialization is permitted (`YES|NO`) +9. `Compliance`: `gate_status=pass|blocked`, with blocked reason when applicable +10. `Interaction`: `interaction_transport` and optional `fallback_reason` +11. `Memory`: `memory_policy` and `memory_bootstrap_done=YES|NO` +12. `Liveness`: active action count, newly launched action ids, and whether reconciliation is clean (`YES|NO`) +13. `Watch`: `watch_mode=session-loop|external-loop|none` and next poll time when applicable +14. `Polling`: current interval, next poll time, and whether the model revised the interval this cycle +15. `Followup`: `continue-watch|collect-results|diagnose-stall|diagnose-failure|replan|checkpoint` ## Violation Recovery Policy @@ -201,3 +341,10 @@ If initialization occurred before required confirmation: 2. Ask whether to keep or clean the created artifacts. 3. Do not continue execution until user re-confirms `mode` and `execution_target`. 4. Record the incident and recovery choice in the next stage report. + +If a long action was launched without a persisted action record: + +1. Stop normal planning. +2. Reconstruct and persist the missing action record immediately. +3. Reconcile log, pid, and latest output before continuing. +4. Record the recovery step in the next stage report. diff --git a/.agents/skills/run-governor/agents/openai.yaml b/.agents/skills/run-governor/agents/openai.yaml index 89cb221..d9d06de 100644 --- a/.agents/skills/run-governor/agents/openai.yaml +++ b/.agents/skills/run-governor/agents/openai.yaml @@ -1,4 +1,4 @@ interface: display_name: "Run Governor" - short_description: "Control run mode, paths, stage reports, and per-run safety allowances." - default_prompt: "Use run governor to initialize run_id, choose interaction mode, enforce stage reporting, and manage per-run safety policy for research execution." + short_description: "Control run mode, durable action tracking, stage reports, and per-run safety allowances." + default_prompt: "Use run governor to initialize run policy, choose interaction mode, reconcile active long-running actions, and compile persistent user intent into execution_intent, completion_policy, promotion gates, non-regression guards, backup policy, and done guards. Full-auto reduces interruptions; it does not authorize premature completion. Enforce watch/resume policy, emit stage reports, and manage per-run safety policy for research execution." diff --git a/.agents/skills/run-governor/references/run-layout.md b/.agents/skills/run-governor/references/run-layout.md index e67e311..11a2829 100644 --- a/.agents/skills/run-governor/references/run-layout.md +++ b/.agents/skills/run-governor/references/run-layout.md @@ -9,6 +9,10 @@ Use this layout for each research run. working/ state.yaml todo.yaml + actions/ + index.json + .yaml + watch.log reports/ index.md stage-01-.md @@ -34,3 +38,4 @@ Notes: 4. In local execution, `runtime_project_root` can equal `local_project_root`. 5. In remote execution, `runtime_project_root` should be remote and explicit. 6. `run_manifest.yaml` may also record resolved shared-memory source metadata such as repo path, URL, branch, and sync policy. +7. `actions/` is the durable registry for long-running actions that may outlive the current model turn. diff --git a/.agents/skills/run-governor/references/stage-report-schema.md b/.agents/skills/run-governor/references/stage-report-schema.md index d739ebf..1d7487a 100644 --- a/.agents/skills/run-governor/references/stage-report-schema.md +++ b/.agents/skills/run-governor/references/stage-report-schema.md @@ -9,6 +9,8 @@ Stage report detail should scale with task complexity, but keep these minimum fi 5. Decisions made 6. Next step 7. Paths to artifacts and logs +8. Active long-action status, if any +9. Pending watch or resume actions, if any ## reports/index.md entry diff --git a/.agents/skills/run-governor/scripts/launch_long_action.py b/.agents/skills/run-governor/scripts/launch_long_action.py new file mode 100644 index 0000000..7274d13 --- /dev/null +++ b/.agents/skills/run-governor/scripts/launch_long_action.py @@ -0,0 +1,97 @@ +#!/usr/bin/env python3 +"""Launch a long-running action and persist durable state.""" + +from __future__ import annotations + +import argparse +import json +import os +import subprocess +from pathlib import Path + +from state_io import ensure_run_layout, next_poll_timestamp, save_action, utc_now + + +def build_parser() -> argparse.ArgumentParser: + parser = argparse.ArgumentParser(description="Launch and register a long-running action") + parser.add_argument("--run-root", required=True, help="Path to logs/runs/") + parser.add_argument("--action-id", default="", help="Stable action id; auto-generated if omitted") + parser.add_argument("--kind", default="long_action", help="Action kind label") + parser.add_argument("--shell-command", default="", help="Command to launch via shell") + parser.add_argument("--cwd", default="", help="Working directory for launch") + parser.add_argument("--expected-duration-seconds", type=int, default=600) + parser.add_argument("--poll-interval-seconds", type=int, default=120) + parser.add_argument("--success-signal", default="", help="Substring indicating success in logs") + parser.add_argument("--failure-signal", default="", help="Substring indicating failure in logs") + parser.add_argument("--resume-step", default="", help="Suggested resume step after polling") + parser.add_argument("--log-path", default="", help="Optional explicit log path") + parser.add_argument("--no-launch", action="store_true", help="Persist only; do not launch a command") + return parser + + +def main() -> int: + args = build_parser().parse_args() + run_root = Path(args.run_root).resolve() + actions_root, _ = ensure_run_layout(run_root) + + timestamp = utc_now().replace(":", "").replace("-", "").replace("+00:00", "Z") + action_id = args.action_id.strip() or f"act_{timestamp.lower()}" + cwd = str(Path(args.cwd).resolve()) if args.cwd else os.getcwd() + log_path = Path(args.log_path).resolve() if args.log_path else (actions_root / f"{action_id}.log") + + record = { + "action_id": action_id, + "status": "pending", + "kind": args.kind, + "command": args.shell_command, + "cwd": cwd, + "expected_duration_seconds": max(args.expected_duration_seconds, 1), + "poll_interval_seconds": max(args.poll_interval_seconds, 1), + "launch_time": utc_now(), + "last_heartbeat": "", + "last_poll_at": "", + "next_poll_at": next_poll_timestamp(max(args.poll_interval_seconds, 1)), + "success_signal": args.success_signal, + "failure_signal": args.failure_signal, + "resume_step": args.resume_step, + "log_path": str(log_path), + "pid": None, + } + + if args.shell_command and not args.no_launch: + log_path.parent.mkdir(parents=True, exist_ok=True) + with log_path.open("a", encoding="utf-8") as handle: + process = subprocess.Popen( + args.shell_command, + shell=True, + cwd=cwd, + stdout=handle, + stderr=subprocess.STDOUT, + start_new_session=True, + executable=os.environ.get("SHELL") or "/bin/sh", + ) + record["pid"] = process.pid + record["status"] = "running" + record["last_heartbeat"] = utc_now() + + save_action(run_root, record) + print( + json.dumps( + { + "run_root": str(run_root), + "action_id": action_id, + "status": record["status"], + "pid": record["pid"], + "poll_interval_seconds": record["poll_interval_seconds"], + "next_poll_at": record["next_poll_at"], + "log_path": record["log_path"], + }, + ensure_ascii=True, + indent=2, + ) + ) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/.agents/skills/run-governor/scripts/poll_long_action.py b/.agents/skills/run-governor/scripts/poll_long_action.py new file mode 100644 index 0000000..3d14629 --- /dev/null +++ b/.agents/skills/run-governor/scripts/poll_long_action.py @@ -0,0 +1,41 @@ +#!/usr/bin/env python3 +"""Poll a long-running action and refresh durable liveness state.""" + +from __future__ import annotations + +import argparse +import json +from pathlib import Path + +from state_io import load_action, next_poll_timestamp, reconcile_action, save_action + + +def build_parser() -> argparse.ArgumentParser: + parser = argparse.ArgumentParser(description="Poll a long-running action") + parser.add_argument("--run-root", required=True, help="Path to logs/runs/") + parser.add_argument("--action-id", required=True, help="Action id to poll") + parser.add_argument("--tail-lines", type=int, default=20, help="Number of log lines to retain") + parser.add_argument("--poll-interval-seconds", type=int, default=0, help="Optional model-selected interval for the next poll") + parser.add_argument("--next-poll-seconds", type=int, default=0, help="Optional explicit delay for the next poll") + return parser + + +def main() -> int: + args = build_parser().parse_args() + run_root = Path(args.run_root).resolve() + payload = load_action(run_root, args.action_id) + if args.poll_interval_seconds > 0: + payload["poll_interval_seconds"] = max(args.poll_interval_seconds, 1) + updated, summary = reconcile_action(payload, tail_lines=max(args.tail_lines, 0)) + if args.next_poll_seconds > 0 and str(updated.get("status", "")) in {"pending", "running", "stalled"}: + updated["poll_interval_seconds"] = max(args.next_poll_seconds, 1) + updated["next_poll_at"] = next_poll_timestamp(max(args.next_poll_seconds, 1)) + summary["poll_interval_seconds"] = updated["poll_interval_seconds"] + summary["next_poll_at"] = updated["next_poll_at"] + save_action(run_root, updated) + print(json.dumps(summary, ensure_ascii=True, indent=2)) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/.agents/skills/run-governor/scripts/resume_run.py b/.agents/skills/run-governor/scripts/resume_run.py new file mode 100644 index 0000000..16a9989 --- /dev/null +++ b/.agents/skills/run-governor/scripts/resume_run.py @@ -0,0 +1,53 @@ +#!/usr/bin/env python3 +"""Reconcile active long-running actions and summarize resume state.""" + +from __future__ import annotations + +import argparse +import json +from pathlib import Path +from typing import Any, Dict, List + +from state_io import ACTIVE_ACTION_STATES, action_due, load_action, load_index, load_working_state, reconcile_action, save_action, save_index, utc_now + + +def build_parser() -> argparse.ArgumentParser: + parser = argparse.ArgumentParser(description="Resume a durable research run") + parser.add_argument("--run-root", required=True, help="Path to logs/runs/") + parser.add_argument("--tail-lines", type=int, default=20) + return parser + + +def main() -> int: + args = build_parser().parse_args() + run_root = Path(args.run_root).resolve() + index = load_index(run_root) + active: List[Dict[str, Any]] = [] + + for action_id in index.get("action_ids", []): + payload = load_action(run_root, action_id) + if str(payload.get("status", "")) in ACTIVE_ACTION_STATES or action_due(payload): + payload, summary = reconcile_action(payload, tail_lines=max(args.tail_lines, 0)) + save_action(run_root, payload) + active.append(summary) + + index["last_sweep_at"] = utc_now() + save_index(run_root, index) + working = load_working_state(run_root) + print( + json.dumps( + { + "run_root": str(run_root), + "active_actions": active, + "working_state": working.get("state", {}), + "working_todo": working.get("todo", {}), + }, + ensure_ascii=True, + indent=2, + ) + ) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/.agents/skills/run-governor/scripts/state_io.py b/.agents/skills/run-governor/scripts/state_io.py new file mode 100644 index 0000000..c85293b --- /dev/null +++ b/.agents/skills/run-governor/scripts/state_io.py @@ -0,0 +1,222 @@ +#!/usr/bin/env python3 +"""Shared helpers for durable run and long-action state.""" + +from __future__ import annotations + +import json +import os +import uuid +from datetime import datetime, timedelta, timezone +from pathlib import Path +from typing import Any, Dict, List, Tuple + + +ACTIVE_ACTION_STATES = {"pending", "running", "stalled"} +FINAL_ACTION_STATES = {"failed", "completed", "cancelled"} + + +def utc_now() -> str: + return datetime.now(timezone.utc).replace(microsecond=0).isoformat() + + +def parse_iso(value: str) -> datetime: + normalized = value.strip() + if normalized.endswith("Z"): + normalized = normalized[:-1] + "+00:00" + return datetime.fromisoformat(normalized) + + +def next_poll_timestamp(seconds: int) -> str: + return (datetime.now(timezone.utc) + timedelta(seconds=max(seconds, 1))).replace(microsecond=0).isoformat() + + +def read_structured(path: Path) -> Dict[str, Any]: + if not path.exists(): + return {} + text = path.read_text(encoding="utf-8").strip() + if not text: + return {} + try: + payload = json.loads(text) + return payload if isinstance(payload, dict) else {} + except json.JSONDecodeError: + try: + import yaml # type: ignore + except ImportError: + raise RuntimeError(f"Structured file is not JSON and PyYAML is unavailable: {path}") + payload = yaml.safe_load(text) + return payload if isinstance(payload, dict) else {} + + +def write_structured(path: Path, payload: Dict[str, Any]) -> None: + path.parent.mkdir(parents=True, exist_ok=True) + tmp = path.with_name(f"{path.name}.{os.getpid()}.{uuid.uuid4().hex}.tmp") + with tmp.open("w", encoding="utf-8") as handle: + json.dump(payload, handle, ensure_ascii=True, indent=2, sort_keys=True) + handle.write("\n") + os.replace(tmp, path) + + +def ensure_run_layout(run_root: Path) -> Tuple[Path, Path]: + actions_root = run_root / "actions" + working_root = run_root / "working" + actions_root.mkdir(parents=True, exist_ok=True) + working_root.mkdir(parents=True, exist_ok=True) + return actions_root, working_root + + +def action_file(run_root: Path, action_id: str) -> Path: + return run_root / "actions" / f"{action_id}.yaml" + + +def load_index(run_root: Path) -> Dict[str, Any]: + _, _ = ensure_run_layout(run_root) + index_path = run_root / "actions" / "index.json" + payload = read_structured(index_path) + payload.setdefault("action_ids", []) + payload.setdefault("last_sweep_at", "") + return payload + + +def save_index(run_root: Path, payload: Dict[str, Any]) -> None: + write_structured(run_root / "actions" / "index.json", payload) + + +def register_action(run_root: Path, action_id: str) -> None: + payload = load_index(run_root) + action_ids = [item for item in payload.get("action_ids", []) if isinstance(item, str)] + if action_id not in action_ids: + action_ids.append(action_id) + payload["action_ids"] = sorted(action_ids) + save_index(run_root, payload) + + +def load_action(run_root: Path, action_id: str) -> Dict[str, Any]: + payload = read_structured(action_file(run_root, action_id)) + if not payload: + raise FileNotFoundError(f"Missing action record: {action_id}") + return payload + + +def save_action(run_root: Path, payload: Dict[str, Any]) -> None: + action_id = str(payload.get("action_id", "")).strip() + if not action_id: + raise ValueError("Action payload missing action_id") + register_action(run_root, action_id) + write_structured(action_file(run_root, action_id), payload) + + +def pid_alive(pid: int) -> bool: + try: + os.kill(pid, 0) + except OSError: + return False + return True + + +def tail_text(path: Path, tail_lines: int) -> str: + if not path.exists(): + return "" + lines = path.read_text(encoding="utf-8", errors="replace").splitlines() + if tail_lines <= 0: + return "" + return "\n".join(lines[-tail_lines:]) + + +def signal_hit(text: str, signal_text: str) -> bool: + return bool(signal_text and signal_text in text) + + +def followup_action_for_status(status: str, progress_changed: bool) -> str: + if status == "completed": + return "collect-results" + if status == "failed": + return "diagnose-failure" + if status == "stalled": + return "diagnose-stall" + if progress_changed: + return "continue-watch" + return "wait-and-poll" + + +def reconcile_action(payload: Dict[str, Any], tail_lines: int = 20) -> Tuple[Dict[str, Any], Dict[str, Any]]: + record = dict(payload) + now = utc_now() + previous_status = str(record.get("status", "pending")) + status = previous_status + poll_interval_seconds = int(record.get("poll_interval_seconds", 120) or 120) + previous_log_tail = str(record.get("last_log_tail", "")) + previous_heartbeat = str(record.get("last_heartbeat", "")) + log_path = Path(str(record.get("log_path", "")).strip()) if str(record.get("log_path", "")).strip() else None + log_tail = tail_text(log_path, tail_lines) if log_path else "" + success_signal = str(record.get("success_signal", "")).strip() + failure_signal = str(record.get("failure_signal", "")).strip() + pid = record.get("pid") + + alive = False + if isinstance(pid, int) and pid > 0: + alive = pid_alive(pid) + + if signal_hit(log_tail, failure_signal): + status = "failed" + elif signal_hit(log_tail, success_signal): + status = "completed" + elif alive: + status = "running" + elif status in FINAL_ACTION_STATES: + status = status + elif pid: + status = "stalled" + + record["status"] = status + record["last_poll_at"] = now + if log_path and log_path.exists(): + record["last_heartbeat"] = datetime.fromtimestamp(log_path.stat().st_mtime, tz=timezone.utc).replace(microsecond=0).isoformat() + elif alive: + record["last_heartbeat"] = now + + progress_changed = log_tail != previous_log_tail or str(record.get("last_heartbeat", "")) != previous_heartbeat + next_interval = int(record.get("poll_interval_seconds", poll_interval_seconds) or poll_interval_seconds) + record["poll_interval_seconds"] = max(next_interval, 1) + status_changed = status != previous_status + followup_action = followup_action_for_status(status, progress_changed) + + if status in ACTIVE_ACTION_STATES: + record["next_poll_at"] = next_poll_timestamp(record["poll_interval_seconds"]) + else: + record["next_poll_at"] = "" + + record["last_log_tail"] = log_tail + summary = { + "action_id": record.get("action_id", ""), + "status": status, + "pid": pid, + "alive": alive, + "status_changed": status_changed, + "progress_changed": progress_changed, + "followup_action": followup_action, + "poll_interval_seconds": record["poll_interval_seconds"], + "next_poll_at": record.get("next_poll_at", ""), + "log_path": str(log_path) if log_path else "", + "last_log_tail": log_tail, + } + return record, summary + + +def action_due(payload: Dict[str, Any], now: datetime | None = None) -> bool: + if str(payload.get("status", "")) not in ACTIVE_ACTION_STATES: + return False + next_poll_at = str(payload.get("next_poll_at", "")).strip() + if not next_poll_at: + return True + current = now or datetime.now(timezone.utc) + return parse_iso(next_poll_at) <= current + + +def load_working_state(run_root: Path) -> Dict[str, Any]: + working_state = read_structured(run_root / "working" / "state.yaml") + todo_state = read_structured(run_root / "working" / "todo.yaml") + return { + "state": working_state, + "todo": todo_state, + } diff --git a/.agents/skills/run-governor/scripts/watch_active_runs.py b/.agents/skills/run-governor/scripts/watch_active_runs.py new file mode 100644 index 0000000..46c1ba2 --- /dev/null +++ b/.agents/skills/run-governor/scripts/watch_active_runs.py @@ -0,0 +1,91 @@ +#!/usr/bin/env python3 +"""Background watcher for active durable run actions.""" + +from __future__ import annotations + +import argparse +import json +import time +from datetime import datetime, timezone +from pathlib import Path +from typing import Dict, List, Optional + +from state_io import ACTIVE_ACTION_STATES, action_due, load_action, load_index, parse_iso, reconcile_action, save_action, save_index, utc_now + + +def build_parser() -> argparse.ArgumentParser: + parser = argparse.ArgumentParser(description="Watch active durable run actions") + parser.add_argument("--logs-root", default="", help="Path to logs/runs root") + parser.add_argument("--run-root", default="", help="Optional specific run root") + parser.add_argument("--sleep-seconds", type=int, default=120, help="Loop interval when not using --once") + parser.add_argument("--tail-lines", type=int, default=20) + parser.add_argument("--once", action="store_true", help="Run one sweep and exit") + return parser + + +def iter_run_roots(args: argparse.Namespace) -> List[Path]: + if args.run_root: + return [Path(args.run_root).resolve()] + logs_root = Path(args.logs_root).resolve() + if not logs_root.exists(): + return [] + return sorted(path for path in logs_root.iterdir() if path.is_dir()) + + +def sweep_run(run_root: Path, tail_lines: int) -> Dict[str, object]: + index = load_index(run_root) + updates = [] + now = datetime.now(timezone.utc) + for action_id in index.get("action_ids", []): + payload = load_action(run_root, action_id) + if str(payload.get("status", "")) not in ACTIVE_ACTION_STATES and not action_due(payload, now): + continue + if not action_due(payload, now): + continue + payload, summary = reconcile_action(payload, tail_lines=tail_lines) + save_action(run_root, payload) + updates.append(summary) + index["last_sweep_at"] = utc_now() + save_index(run_root, index) + return { + "run_root": str(run_root), + "updates": updates, + } + + +def compute_global_sleep_seconds(run_roots: List[Path], fallback_seconds: int) -> int: + now = datetime.now(timezone.utc) + next_due_seconds: List[int] = [] + for run_root in run_roots: + index = load_index(run_root) + for action_id in index.get("action_ids", []): + payload = load_action(run_root, action_id) + status = str(payload.get("status", "")) + if status not in ACTIVE_ACTION_STATES: + continue + next_poll_at = str(payload.get("next_poll_at", "")).strip() + if not next_poll_at: + next_due_seconds.append(1) + continue + delta = int((parse_iso(next_poll_at) - now).total_seconds()) + next_due_seconds.append(max(delta, 1)) + if not next_due_seconds: + return fallback_seconds + return max(1, min(fallback_seconds, min(next_due_seconds))) + + +def main() -> int: + args = build_parser().parse_args() + sleep_seconds = max(args.sleep_seconds, 1) + + while True: + run_roots = iter_run_roots(args) + summaries = [sweep_run(run_root, max(args.tail_lines, 0)) for run_root in run_roots] + print(json.dumps({"timestamp": utc_now(), "runs": summaries}, ensure_ascii=True, indent=2)) + if args.once: + return 0 + time.sleep(compute_global_sleep_seconds(run_roots, sleep_seconds)) + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/AGENTS.md b/AGENTS.md index 094de09..aae299d 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -33,12 +33,12 @@ Before producing any substantive response, you MUST run this mental checklist: ## Default Operating Rules 1. Start each non-trivial research task with `run-governor`, but do not initialize `run_id` paths before explicit user confirmation of both `mode` and execution target (`local|remote`). 2. Use `research-workflow` as the default orchestration loop. -3. Use `memory-manager` to maintain working todo state and long-term memory. +3. Use `memory-manager` in `experience-first` mode: reusable `procedure/episode/insight` retrieval comes before relying on `working` state alone. 4. If you modify `memory-manager` or any Memory-related skill, or detect compaction markers in state/context files such as `Compact`, `压缩`, `Summary`, or similar summary/compression techniques, invoke `memory-manager` to read prior Memory before continuing so key context is not dropped. 5. Trigger `human-checkpoint` using mode-aware policy, always for major safety risks and shared-memory publication. -6. Use `experiment-execution` only for actual run execution. +6. Use `experiment-execution` only for actual run execution, and keep ownership after launch for monitoring, diagnosis, recovery, and result collection. 7. Use `project-context` to collect and persist per-project private runtime context before experiments or report/eval execution. -8. Use `deep-research` for deep external investigation and evidence synthesis, including early-stage project scoping when a user wants to write a research study or paper on a topic, unless the user is explicitly asking for a paper-writing deliverable right now. +8. Use `deep-research` as the default gateway for external search and deep external investigation, including early-stage project scoping when a user wants to write a research study or paper on a topic, unless the user is explicitly asking for a paper-writing deliverable right now. 9. Use `research-plan` when the user asks for a proposal, roadmap, ablation/evaluation plan, study design, or pre-implementation research decomposition. 10. After open-ended scoping in `deep-research`, hand off findings into `research-plan` by default; skip only if the user explicitly opts out. 11. Use `paper-writing` only when the user explicitly asks for a paper-writing deliverable such as drafting or revising a paper, section, or rebuttal. Do not use it for topic scoping, literature investigation, feasibility analysis, experiment design, or experiment execution. @@ -47,23 +47,89 @@ Before producing any substantive response, you MUST run this mental checklist: 14. Follow `REPO_CONVENTIONS.md` for artifact placement and commit hygiene. 15. If a run was initialized before confirmation, stop and run violation recovery: acknowledge, ask whether to keep/clean artifacts, and wait for explicit reconfirmation before continuing. 16. **Mandatory Visualization**: Every report with quantitative results MUST include code-generated visualizations (matplotlib). Always generate figures when writing stage reports or final reports. If the report is complex, invoke `paper-writing` for polished formatting. Under-visualizing is a violation. +17. For long-running work, do not treat launch as completion: persist an action record, enter watch mode, poll on a model-chosen cadence, and continue until success criteria, a true blocker, or a gated approval point is reached. +18. Do not respond with the equivalent of "the job is running, come back later" unless the user explicitly requested fire-and-forget behavior. -## Memory Invocation Guardrails (Balanced) -1. `memory-manager` is mandatory for non-trivial runs, but only as a control-plane step, not per command. -2. Mandatory calls per non-trivial run: +## Persistent Optimization Guardrails +1. Interpret `full-auto` as an interruption policy, not a completion policy. +2. If the user says things such as “keep iterating”, “do not stop”, “try many iterations”, “until target”, or gives explicit target metrics like `90%` or `100%`, compile that into `persistent-optimization` behavior. +3. For persistent-optimization tasks, compile the user request into machine-checkable fields before execution: + - `primary_target` + - `promotion_gates` + - `non_regression_guards` + - `backup_policy` + - `stop_allowed_only_if` +4. Do not leave stopping conditions as prose only when they can be converted into measurable gates. +5. If the user asks to preserve strong variants, snapshot best-so-far prompts/configs/code/results before higher-risk changes. +6. `full-auto` plus explicit persistence means the agent keeps ownership until one of these is true: + - compiled hard targets are met + - a true hard blocker remains after reasonable recovery attempts + - a major safety/resource gate requires approval + - the user explicitly changes or stops the objective + +## Goal and Done Guardrails +1. At the start of each non-trivial execution loop, refresh the compiled goal state and active promotion gate. +2. `done` is allowed only when all compiled hard gates are satisfied with evidence. +3. If `completion_policy=until-target-or-hard-blocker`, `done` is forbidden while the active promotion gate or hard target remains unmet. +4. A single clean run, a partial fix, or one successful batch is not sufficient reason to stop. +5. If the current promotion gate is met but the final target is not, promote to the next gate instead of stopping. +6. If targets remain unmet and a safe next step exists, default to `iterate`. +7. If repeated attempts plateau or regress materially, default to `replan`. + +## Short Iterative Execution Guardrails +1. Apply the same ownership standard to short local edit-evaluate loops as to long-running jobs. +2. For iterative optimization tasks, define an evaluation ladder before broadening scope: + - baseline or previous-best reference + - representative regression set + - promotion gate for larger evaluation + - final target evaluation +3. Prefer broader representative sets over a few hand-picked cases. +4. After each batch: + - compare against baseline and best-so-far + - inspect regressions, not only aggregate score + - check non-regression guards + - choose `iterate`, `replan`, or `promote-to-next-gate` +5. Do not stop after a single iteration merely because execution completed cleanly. +6. For prompts like “先用 30 个左右的题目集合测效果,再考虑上 100”, treat the smaller set as a required promotion gate rather than a suggestion. + +## Long-Running Execution Guardrails +1. Classify an action as long-running when it is expected to exceed 5 minutes, launches async or remote work, is high-resource, or is likely to outlive the current model turn. +2. Before waiting on a long-running action: + - persist `actions/.yaml` + - record command, cwd, expected duration, poll interval, log path, success/failure signals, and resume step + - update working state with the active `action_id` +3. While the current session is active, use a watch loop: + - model chooses sleep + - poll the action + - inspect `status`, `progress_changed`, `followup_action`, and recent logs + - choose the next sleep or branch into diagnosis/result handling +4. Allowed liveness states are `pending`, `running`, `stalled`, `failed`, `completed`, and `cancelled`. +5. After every poll, keep ownership and branch immediately: + - `continue-watch` or `wait-and-poll` + - `collect-results` + - `diagnose-stall` + - `diagnose-failure` + - `replan` +6. At the start of every resumed turn, reconcile active actions before unrelated planning. + +## Memory Invocation Guardrails (Experience-First) +1. `memory-manager` is mandatory for non-trivial runs, but retrieval should center on reusable experience, not only `working` state. +2. Mandatory per non-trivial run: - one bootstrap `retrieve/init-working` before planning or execution - one close-out writeback before task completion -3. Conditional calls between bootstrap and close-out are trigger-based only: - - stage change - - replan - - significant failure or new error signature - - before high-resource action - - before final report/answer handoff -4. Periodic refresh is allowed when either is true: - - at least 15 minutes since last memory operation - - at least 3 execution cycles since last memory operation -5. Cooldown rule: do not invoke `memory-manager` more than once in a cycle unless forced by safety/high-resource/failure triggers. -6. If memory is skipped due to cooldown or low delta, record `memory_skip_reason` in the stage report. +3. Mandatory per turn and per batch: + - retrieve relevant memory on every new user turn + - retrieve `procedure` before every execution batch + - write a concise `working` delta after every execution batch +4. Mandatory trigger-based retrieval: + - retrieve `episode` on significant failure, repeated attempt, stalled job, or new error signature + - retrieve `insight` during planning, replanning, contradiction handling, tradeoff analysis, or final answer shaping + - retrieve `procedure` plus relevant `episode` before high-resource or irreversible actions + - reread `working` during resume, compaction recovery, long-action reconciliation, and final handoff +5. After long-action polls: + - on `stalled` or `failed`, retrieve `procedure` plus relevant `episode` before the next fix attempt + - on `completed`, retrieve `insight` when interpretation or next-step selection is needed +6. If memory is skipped due to duplicate retrieval, freshness, or low yield, record `memory_skip_reason`. ## Deep-Research Re-entry Guardrails 1. On every new user message, re-run skill routing before continuing prior stage actions. @@ -71,10 +137,29 @@ Before producing any substantive response, you MUST run this mental checklist: 3. Research-intent signals include (semantic match, Chinese or English): - 调研/研究/对比/综述/文献/证据/机制/根因/为什么/可行性/路线图 - research/investigate/compare/survey/literature/evidence/mechanism/root-cause/why/feasibility/roadmap -4. If skipping `deep-research`, emit `dr_skip_reason` with concrete evidence freshness info (source date / timestamp), not a generic statement. -5. Cooldown for non-forced deep-research calls: +4. All external search for non-trivial research runs must route through `deep-research`; do not bypass it with ad hoc search. +5. Every `deep-research` run must begin with a frontier-first scout before final depth selection. +6. Default depth is `default-auditable`; `light` is a downgrade path only after scout and may not be the silent default. +7. Do not claim deep-research completion without actual WebSearch calls and an auditable query trail. +8. If skipping `deep-research`, emit `dr_skip_reason` with concrete evidence freshness info (source date / timestamp), not a generic statement. +9. Cooldown for non-forced deep-research calls: - at most once per stage unless objective changed or new contradiction/high-impact uncertainty appears. +## Experiment Watch Guardrails +1. When `experiment-execution` launches a long-running training, evaluation, benchmark, or inference job, it must enter watch mode by default. +2. After each experiment poll: + - if `running`, choose the next sleep interval and keep monitoring + - if `completed`, inspect outputs, checkpoints, metrics, and artifacts immediately + - if `stalled`, inspect evidence, retrieve memory, and attempt the smallest safe recovery or replan + - if `failed`, diagnose immediately, retrieve memory, and attempt the smallest safe recovery +3. Unknown execution errors should follow this branch: + - local evidence triage + - `procedure` and `episode` retrieval + - targeted search + - `deep-research` if still unresolved or freshness-sensitive + - minimal fix validation +4. Only allow fire-and-forget experiment behavior when the user clearly requested it. + ## Paper-Writing Trigger Guardrails 1. Activate `paper-writing` only when the user explicitly asks for a paper-writing output. 2. Valid triggers include drafting or revising a paper, a named paper section, or rebuttal text. diff --git a/demos/phd_zero_demo_e2e_prompting_tricks_v0_0316_480p.mov b/demos/phd_zero_demo_e2e_prompting_tricks_v0_0316_480p.mov deleted file mode 100644 index bab147e..0000000 Binary files a/demos/phd_zero_demo_e2e_prompting_tricks_v0_0316_480p.mov and /dev/null differ