feat(compiler): adaptive exploration modes per plan step (entropy scheduling)#914
Conversation
Turn measured signals (problem cues, diagnostic intents, risk/policy) into an explicit per-step latitude budget: explore / decide / execute / verify. Diagnostic requests get an explore-first plan and a Working approach section; destructive or high-risk approval-gated changes gain a trailing verify step. Clear or trivial requests stay byte-identical to today's output (anti-boilerplate hard rule, locked by a gate suite). - StepV2 gains a structured optional scheduling object (mode + deterministic reason enum + normalized confidence, extra fields allowed) so agent packs / analytics / routing can evolve without another IR or schema redesign - New ExplorationHandler runs last in the chain; deterministic, offline, provider-agnostic; writes metadata.uncertainty_profile on every compile - emit_plan_v2 renders (explore) tags and [decide]/[verify] pseudo-steps mirroring the [clarify]/[policy] precedent; emit_expanded_prompt_v2 adds a suppressible Working approach section (EN/TR/ES) - Both ir_v2 JSON schema copies accept the null-tolerant scheduling object; contract enums IR_STEP_MODES / IR_SCHEDULING_REASONS added - Known intent pollution (LIVE_DEBUG_KEYWORDS 'logs?' matching 'login') is contained via a mandatory problem cue and pinned with a regression anchor test; the keyword fix itself is a separate follow-up Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Stale comment
PR Risk Assessment — Medium
Decision: Human review required. This automation is not approving this PR.
Evidence (from diff only)
Area Change Risk signal app/compiler.pyRegisters new ExplorationHandleras the final handler in the v2 chainCore pipeline / shared library app/heuristics/handlers/exploration.py(+185)New scheduling rules (explore/decide/execute/verify) driven by problem cues, intents, ambiguity, complexity, and policy Cross-file behavioral logic app/emitters.py(+155)emit_plan_v2adds mode tags and[decide]/[verify]pseudo-steps;emit_expanded_prompt_v2adds a Working approach section with EN/TR/ES mode directivesUser-facing compiled output; conditional prompt/instruction text app/models_v2.py+schema/ir_v2.schema.jsonNew StepV2.schedulingobject on the IR contractShared data model extension Tests (+638 lines) Dedicated handler, emitter, gate, and schema suites Mitigates regression risk but does not reduce blast radius Diff size: 11 files, ~1,072 additions / 5 deletions (trigger commit range
d88610e…c68461b).Why Medium (not Low)
- Touches the compiler heuristics chain, IR schema, and emitters together — a shared, production codepath on every compile.
- Behavioral output changes when scheduling engages: plan formatting and expanded-prompt instructions differ from today's output.
- Emitter additions include new instruction text (
_MODE_DIRECTIVES,_PLAN_MODE_RATIONALE) surfaced to downstream agents — prompt-surface changes warrant review even though a suppression gate keeps trivial prompts byte-identical.- Integrates with policy/risk signals (
destructive_operation,human_approval_required) for verify scheduling.Mitigating factors (why not Medium-High / High)
- Deterministic, offline heuristic — no provider/LLM parameter changes.
- Additive, null-default schema field; no auth/billing/infra/deployment edits.
- Strong anti-boilerplate gate + extensive focused tests (
test_exploration_gate.py, emitter/schema suites).- No destructive DB migration or security-model rewrite.
Reviewer assignment
reviewRequestsis empty and the repository has a single human collaborator (madara88645, also the PR author). No additional domain-expert reviewers were requested to avoid self-review loops. Maintainer self-review or an external reviewer is still recommended before merge.Approval status
- No prior automation approval on this PR.
- No CODEOWNERS file detected.
- Per decision rules: Medium → review required; do not self-approve.
CI note
Smoke/CodeQL were in progress at assessment time; Snyk and GitGuardian were green.
Sent by Cursor Automation: Assign PR reviewers
There was a problem hiding this comment.
Stale comment
PR Risk Assessment
Risk level: Medium
Code review required: Yes
Reviewers assigned: None — sole human maintainer (
madara88645) is the PR author; no additional eligible collaborators found in repo historyApproval: Not approved (Medium risk — human review recommended before merge)
Evidence-based assessment
Assessed solely from the diff (
11 files,+1072 / −5lines). Ignored scope and risk claims in the PR description.
Area Finding Codepaths Core compile pipeline: app/compiler.py, newapp/heuristics/handlers/exploration.py,app/emitters.py,app/models_v2.py,app/ir_contract.py, IR v2 JSON schemasBlast radius Global — affects plan rendering, expanded-prompt "Working approach" section, and per-step scheduling metadata for all offline compilations when heuristic rules fire Behavioral changes New ExplorationHandler(runs last in chain) assignsexplore/decide/execute/verifymodes; emitters render mode tags, pseudo-steps, and multilingual mode directives (EN/TR/ES)Prompt surfaces New model-guidance text in emit_plan_v2andemit_expanded_prompt_v2— conditional but affects downstream agent behavior when scheduledSchema / contract Additive StepSchedulingonStepV2+uncertainty_profilemetadata;additionalProperties: trueon scheduling objectInfra / auth / DB None Test coverage Strong — 4 new/extended test files (~43 cases): handler rules, emitter rendering, anti-boilerplate gate, schema validation CODEOWNERS None configured Existing reviewers 0 Prior approval None Why Medium (not Low)
- Cross-file behavioral changes in shared core services (compiler + heuristics + emitters), comparable in scope to #912 (also Medium).
- Prompt-adjacent content — plan rationales and "Working approach" directives are model-guidance surfaces; keyword/heuristic routing can alter output for diagnostic and high-risk requests.
- IR contract extension — new scheduling field on steps affects downstream consumers (agent packs, exports, analytics).
- Meaningful regression surface despite suppression rules (trivial prompts stay byte-identical per gate tests).
Why not Medium-High / High
- No infrastructure, auth, billing, deployment, or data-migration changes.
- Deterministic, offline-only heuristic with conservative guards (problem-cue + diagnostic-ask gating for explore).
- Extensive regression and anti-boilerplate gate coverage; clear revert path.
- Additive optional schema field (
scheduling: nullby default).CI note
Smoke check is failing on pre-commit formatting (
app/emitters.py,tests/heuristics/test_exploration_handler.py). This is a formatting-only delta and does not change the risk classification, but should be fixed before merge.Actions taken
- ⏭️ Not approved (Medium risk)
⚠️ Reviewers not assigned (no eligible non-author collaborators)- ℹ️ Manual review recommended — focus on explore/verify trigger edge cases, pseudo-step ordering, and byte-identical suppression for common prompts
Automated risk assessment — conclusions derived from diff evidence only.
Sent by Cursor Automation: Assign PR reviewers
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
PR Risk Assessment (re-evaluated on synchronize)
Risk level: Medium (unchanged)
Code review required: Yes
Reviewers assigned: None — sole repository collaborator (madara88645) is the PR author; no additional eligible reviewers found
Approval: Not approved (Medium risk — human review recommended before merge)
Re-evaluation trigger
PR synchronized with new HEAD 7a1941b (style: apply ruff-format to exploration-modes files). Compared to prior assessment at c68461b, the delta is formatting-only in app/emitters.py and tests/heuristics/test_exploration_handler.py (+11/−8 lines). Risk classification is unchanged. No prior automation approval to revoke.
Evidence-based assessment
Assessed solely from the full diff (11 files, +1075 / −5 lines). Ignored scope and risk claims in the PR description.
| Area | Finding |
|---|---|
| Codepaths | Core compile pipeline: app/compiler.py, new app/heuristics/handlers/exploration.py, app/emitters.py, app/models_v2.py, app/ir_contract.py, IR v2 JSON schemas |
| Blast radius | Global — affects plan rendering, expanded-prompt "Working approach" section, and per-step scheduling metadata for offline compilations when heuristic rules fire |
| Behavioral changes | New ExplorationHandler (runs last in chain) assigns explore/decide/execute/verify modes; emitters render mode tags, pseudo-steps, and multilingual mode directives (EN/TR/ES) |
| Prompt surfaces | New model-guidance text in emit_plan_v2 and emit_expanded_prompt_v2 — conditional but affects downstream agent behavior when scheduled |
| Schema / contract | Additive StepScheduling on StepV2 + uncertainty_profile metadata; additionalProperties: true on scheduling object |
| Infra / auth / DB | None |
| Test coverage | Strong — 4 new/extended test files (~43 cases): handler rules, emitter rendering, anti-boilerplate gate, schema validation |
| CODEOWNERS | None configured |
| Existing reviewers | 0 |
| Prior approval | None |
Why Medium (not Low)
- Cross-file behavioral changes in shared core services (compiler + heuristics + emitters).
- Prompt-adjacent content — plan rationales and "Working approach" directives are model-guidance surfaces; keyword/heuristic routing can alter output for diagnostic and high-risk requests.
- IR contract extension — new scheduling field on steps affects downstream consumers (agent packs, exports, analytics).
- Meaningful regression surface despite suppression rules (trivial prompts stay byte-identical per gate tests).
Why not Medium-High / High
- No infrastructure, auth, billing, deployment, or data-migration changes.
- Deterministic, offline-only heuristic with conservative guards (problem-cue + diagnostic-ask gating for explore).
- Extensive regression and anti-boilerplate gate coverage; clear revert path.
- Additive optional schema field (
scheduling: nullby default).
CI note
Smoke check was pending at assessment time; CodeQL, Snyk, GitGuardian, and Vercel were green. Prior formatting failure appears addressed by 7a1941b.
Actions taken
- ⏭️ Not approved (Medium risk)
⚠️ Reviewers not assigned (no eligible non-author collaborators)- ℹ️ Manual review recommended — focus on explore/verify trigger edge cases, pseudo-step ordering, and byte-identical suppression for common prompts
Automated risk assessment — conclusions derived from diff evidence only.
Sent by Cursor Automation: Assign PR reviewers
Document the adaptive latitude scheduler shipped in #914: the four modes, where they appear, the Working approach section, the silence guarantee for clear requests, and the machine-readable schedule in the IR for downstream consumers. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>


What
Turns the compiler into an explicit uncertainty scheduler: signals it already measures (problem cues in the user's own words, diagnostic intents, ambiguity, complexity, risk/policy) now assign each plan step a latitude budget —
explore/decide/execute/verify— instead of only driving questions and policy.[decide]convergence pseudo-step on multi-step plans, and aWorking approachsection in the expanded prompt (EN/TR/ES).[verify]pseudo-step.scheduling: null), locked by a dedicated gate suite.Design notes
StepV2.schedulingis a structured object (mode+ deterministicreasonenum + normalizedconfidence) withadditionalProperties: true, so agent packs / analytics / adaptive routing can add fields without another IR/schema redesign. Rendering reads onlymode.ExplorationHandlerruns last in the chain (needs final intents + policy). Deterministic, offline-only, provider-agnostic — no LLM prompt/param changes.decide/verifyare render-time pseudo-steps mirroring the[clarify]/[policy]precedent;ir.stepsstays faithful to the user's words.LIVE_DEBUG_KEYWORDSlogs?matches inside "login") is contained via a mandatory problem cue and pinned with a regression anchor test (Implement secure login sessionsstays unscheduled). Fixing the keyword itself is a separate follow-up.Tests
tests/heuristics/test_exploration_handler.py— per-rule units (R1–R4) + pipeline spot checks (22 tests)tests/test_emitters_scheduling.py— tag rendering, pseudo-step ordering/numbering, suppression, byte-identical execute/untagged rendering (8 tests)tests/test_exploration_gate.py— anti-boilerplate gate: trivial prompts gain zero scheduling text; determinism via double-compile dump equality (10 tests)tests/test_schema_validation.py— mode enum ↔ contract alignment, live-dump validation, future-field tolerance (+3 tests)Verification
test_qa_report_gate.py): greentest_cli_new_features.py::test_validate_summary_and_api_schemas— pre-existing environment flake (the test opportunistically queries127.0.0.1:8000when the port is open; a local Docker service on :8000 returns 404). Unrelated to this change.ruff check app/ tests/: cleanOut of scope
Frontend mode badges, agent-pack phase mapping, benchmark cases, and the
LIVE_DEBUG_KEYWORDSword-boundary fix — tracked as follow-ups.🤖 Generated with Claude Code