Skip to content

feat(compiler): adaptive exploration modes per plan step (entropy scheduling)#914

Merged
madara88645 merged 3 commits into
mainfrom
feat/exploration-modes
Jul 2, 2026
Merged

feat(compiler): adaptive exploration modes per plan step (entropy scheduling)#914
madara88645 merged 3 commits into
mainfrom
feat/exploration-modes

Conversation

@madara88645

Copy link
Copy Markdown
Owner

What

Turns the compiler into an explicit uncertainty scheduler: signals it already measures (problem cues in the user's own words, diagnostic intents, ambiguity, complexity, risk/policy) now assign each plan step a latitude budget — explore / decide / execute / verify — instead of only driving questions and policy.

  • Diagnostic requests ("X is broken; help me fix it") get an explore-first plan step, a [decide] convergence pseudo-step on multi-step plans, and a Working approach section in the expanded prompt (EN/TR/ES).
  • Destructive / high-risk approval-gated changes keep their existing policy gates untouched and gain a trailing [verify] pseudo-step.
  • Clear or trivial requests are byte-identical to today's output. The scheduler stays silent (all scheduling: null), locked by a dedicated gate suite.

Design notes

  • StepV2.scheduling is a structured object (mode + deterministic reason enum + normalized confidence) with additionalProperties: true, so agent packs / analytics / adaptive routing can add fields without another IR/schema redesign. Rendering reads only mode.
  • New ExplorationHandler runs last in the chain (needs final intents + policy). Deterministic, offline-only, provider-agnostic — no LLM prompt/param changes.
  • decide/verify are render-time pseudo-steps mirroring the [clarify]/[policy] precedent; ir.steps stays faithful to the user's words.
  • Known intent pollution (LIVE_DEBUG_KEYWORDS logs? matches inside "login") is contained via a mandatory problem cue and pinned with a regression anchor test (Implement secure login sessions stays unscheduled). Fixing the keyword itself is a separate follow-up.

Tests

  • tests/heuristics/test_exploration_handler.py — per-rule units (R1–R4) + pipeline spot checks (22 tests)
  • tests/test_emitters_scheduling.py — tag rendering, pseudo-step ordering/numbering, suppression, byte-identical execute/untagged rendering (8 tests)
  • tests/test_exploration_gate.py — anti-boilerplate gate: trivial prompts gain zero scheduling text; determinism via double-compile dump equality (10 tests)
  • tests/test_schema_validation.py — mode enum ↔ contract alignment, live-dump validation, future-field tolerance (+3 tests)

Verification

  • Focused targets + existing QA gate (test_qa_report_gate.py): green
  • Full suite: 1748 passed, 5 skipped, 1 failure in test_cli_new_features.py::test_validate_summary_and_api_schemas — pre-existing environment flake (the test opportunistically queries 127.0.0.1:8000 when the port is open; a local Docker service on :8000 returns 404). Unrelated to this change.
  • ruff check app/ tests/: clean

Out of scope

Frontend mode badges, agent-pack phase mapping, benchmark cases, and the LIVE_DEBUG_KEYWORDS word-boundary fix — tracked as follow-ups.

🤖 Generated with Claude Code

Turn measured signals (problem cues, diagnostic intents, risk/policy)
into an explicit per-step latitude budget: explore / decide / execute /
verify. Diagnostic requests get an explore-first plan and a Working
approach section; destructive or high-risk approval-gated changes gain
a trailing verify step. Clear or trivial requests stay byte-identical
to today's output (anti-boilerplate hard rule, locked by a gate suite).

- StepV2 gains a structured optional scheduling object (mode +
  deterministic reason enum + normalized confidence, extra fields
  allowed) so agent packs / analytics / routing can evolve without
  another IR or schema redesign
- New ExplorationHandler runs last in the chain; deterministic, offline,
  provider-agnostic; writes metadata.uncertainty_profile on every compile
- emit_plan_v2 renders (explore) tags and [decide]/[verify] pseudo-steps
  mirroring the [clarify]/[policy] precedent; emit_expanded_prompt_v2
  adds a suppressible Working approach section (EN/TR/ES)
- Both ir_v2 JSON schema copies accept the null-tolerant scheduling
  object; contract enums IR_STEP_MODES / IR_SCHEDULING_REASONS added
- Known intent pollution (LIVE_DEBUG_KEYWORDS 'logs?' matching 'login')
  is contained via a mandatory problem cue and pinned with a regression
  anchor test; the keyword fix itself is a separate follow-up

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@vercel

vercel Bot commented Jul 2, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
compiler Ready Ready Preview, Comment Jul 2, 2026 10:16am

@madara88645 madara88645 marked this pull request as ready for review July 2, 2026 10:12

@cursor cursor Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale comment

PR Risk Assessment — Medium

Decision: Human review required. This automation is not approving this PR.

Evidence (from diff only)

Area Change Risk signal
app/compiler.py Registers new ExplorationHandler as the final handler in the v2 chain Core pipeline / shared library
app/heuristics/handlers/exploration.py (+185) New scheduling rules (explore/decide/execute/verify) driven by problem cues, intents, ambiguity, complexity, and policy Cross-file behavioral logic
app/emitters.py (+155) emit_plan_v2 adds mode tags and [decide]/[verify] pseudo-steps; emit_expanded_prompt_v2 adds a Working approach section with EN/TR/ES mode directives User-facing compiled output; conditional prompt/instruction text
app/models_v2.py + schema/ir_v2.schema.json New StepV2.scheduling object on the IR contract Shared data model extension
Tests (+638 lines) Dedicated handler, emitter, gate, and schema suites Mitigates regression risk but does not reduce blast radius

Diff size: 11 files, ~1,072 additions / 5 deletions (trigger commit range d88610e…c68461b).

Why Medium (not Low)

  • Touches the compiler heuristics chain, IR schema, and emitters together — a shared, production codepath on every compile.
  • Behavioral output changes when scheduling engages: plan formatting and expanded-prompt instructions differ from today's output.
  • Emitter additions include new instruction text (_MODE_DIRECTIVES, _PLAN_MODE_RATIONALE) surfaced to downstream agents — prompt-surface changes warrant review even though a suppression gate keeps trivial prompts byte-identical.
  • Integrates with policy/risk signals (destructive_operation, human_approval_required) for verify scheduling.

Mitigating factors (why not Medium-High / High)

  • Deterministic, offline heuristic — no provider/LLM parameter changes.
  • Additive, null-default schema field; no auth/billing/infra/deployment edits.
  • Strong anti-boilerplate gate + extensive focused tests (test_exploration_gate.py, emitter/schema suites).
  • No destructive DB migration or security-model rewrite.

Reviewer assignment

reviewRequests is empty and the repository has a single human collaborator (madara88645, also the PR author). No additional domain-expert reviewers were requested to avoid self-review loops. Maintainer self-review or an external reviewer is still recommended before merge.

Approval status

  • No prior automation approval on this PR.
  • No CODEOWNERS file detected.
  • Per decision rules: Medium → review required; do not self-approve.

CI note

Smoke/CodeQL were in progress at assessment time; Snyk and GitGuardian were green.

Open in Web View Automation 

Sent by Cursor Automation: Assign PR reviewers

@cursor cursor Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale comment

PR Risk Assessment

Risk level: Medium

Code review required: Yes

Reviewers assigned: None — sole human maintainer (madara88645) is the PR author; no additional eligible collaborators found in repo history

Approval: Not approved (Medium risk — human review recommended before merge)


Evidence-based assessment

Assessed solely from the diff (11 files, +1072 / −5 lines). Ignored scope and risk claims in the PR description.

Area Finding
Codepaths Core compile pipeline: app/compiler.py, new app/heuristics/handlers/exploration.py, app/emitters.py, app/models_v2.py, app/ir_contract.py, IR v2 JSON schemas
Blast radius Global — affects plan rendering, expanded-prompt "Working approach" section, and per-step scheduling metadata for all offline compilations when heuristic rules fire
Behavioral changes New ExplorationHandler (runs last in chain) assigns explore/decide/execute/verify modes; emitters render mode tags, pseudo-steps, and multilingual mode directives (EN/TR/ES)
Prompt surfaces New model-guidance text in emit_plan_v2 and emit_expanded_prompt_v2 — conditional but affects downstream agent behavior when scheduled
Schema / contract Additive StepScheduling on StepV2 + uncertainty_profile metadata; additionalProperties: true on scheduling object
Infra / auth / DB None
Test coverage Strong — 4 new/extended test files (~43 cases): handler rules, emitter rendering, anti-boilerplate gate, schema validation
CODEOWNERS None configured
Existing reviewers 0
Prior approval None

Why Medium (not Low)

  • Cross-file behavioral changes in shared core services (compiler + heuristics + emitters), comparable in scope to #912 (also Medium).
  • Prompt-adjacent content — plan rationales and "Working approach" directives are model-guidance surfaces; keyword/heuristic routing can alter output for diagnostic and high-risk requests.
  • IR contract extension — new scheduling field on steps affects downstream consumers (agent packs, exports, analytics).
  • Meaningful regression surface despite suppression rules (trivial prompts stay byte-identical per gate tests).

Why not Medium-High / High

  • No infrastructure, auth, billing, deployment, or data-migration changes.
  • Deterministic, offline-only heuristic with conservative guards (problem-cue + diagnostic-ask gating for explore).
  • Extensive regression and anti-boilerplate gate coverage; clear revert path.
  • Additive optional schema field (scheduling: null by default).

CI note

Smoke check is failing on pre-commit formatting (app/emitters.py, tests/heuristics/test_exploration_handler.py). This is a formatting-only delta and does not change the risk classification, but should be fixed before merge.

Actions taken

  • ⏭️ Not approved (Medium risk)
  • ⚠️ Reviewers not assigned (no eligible non-author collaborators)
  • ℹ️ Manual review recommended — focus on explore/verify trigger edge cases, pseudo-step ordering, and byte-identical suppression for common prompts

Automated risk assessment — conclusions derived from diff evidence only.

Open in Web View Automation 

Sent by Cursor Automation: Assign PR reviewers

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Risk Assessment (re-evaluated on synchronize)

Risk level: Medium (unchanged)

Code review required: Yes

Reviewers assigned: None — sole repository collaborator (madara88645) is the PR author; no additional eligible reviewers found

Approval: Not approved (Medium risk — human review recommended before merge)


Re-evaluation trigger

PR synchronized with new HEAD 7a1941b (style: apply ruff-format to exploration-modes files). Compared to prior assessment at c68461b, the delta is formatting-only in app/emitters.py and tests/heuristics/test_exploration_handler.py (+11/−8 lines). Risk classification is unchanged. No prior automation approval to revoke.

Evidence-based assessment

Assessed solely from the full diff (11 files, +1075 / −5 lines). Ignored scope and risk claims in the PR description.

Area Finding
Codepaths Core compile pipeline: app/compiler.py, new app/heuristics/handlers/exploration.py, app/emitters.py, app/models_v2.py, app/ir_contract.py, IR v2 JSON schemas
Blast radius Global — affects plan rendering, expanded-prompt "Working approach" section, and per-step scheduling metadata for offline compilations when heuristic rules fire
Behavioral changes New ExplorationHandler (runs last in chain) assigns explore/decide/execute/verify modes; emitters render mode tags, pseudo-steps, and multilingual mode directives (EN/TR/ES)
Prompt surfaces New model-guidance text in emit_plan_v2 and emit_expanded_prompt_v2 — conditional but affects downstream agent behavior when scheduled
Schema / contract Additive StepScheduling on StepV2 + uncertainty_profile metadata; additionalProperties: true on scheduling object
Infra / auth / DB None
Test coverage Strong — 4 new/extended test files (~43 cases): handler rules, emitter rendering, anti-boilerplate gate, schema validation
CODEOWNERS None configured
Existing reviewers 0
Prior approval None

Why Medium (not Low)

  • Cross-file behavioral changes in shared core services (compiler + heuristics + emitters).
  • Prompt-adjacent content — plan rationales and "Working approach" directives are model-guidance surfaces; keyword/heuristic routing can alter output for diagnostic and high-risk requests.
  • IR contract extension — new scheduling field on steps affects downstream consumers (agent packs, exports, analytics).
  • Meaningful regression surface despite suppression rules (trivial prompts stay byte-identical per gate tests).

Why not Medium-High / High

  • No infrastructure, auth, billing, deployment, or data-migration changes.
  • Deterministic, offline-only heuristic with conservative guards (problem-cue + diagnostic-ask gating for explore).
  • Extensive regression and anti-boilerplate gate coverage; clear revert path.
  • Additive optional schema field (scheduling: null by default).

CI note

Smoke check was pending at assessment time; CodeQL, Snyk, GitGuardian, and Vercel were green. Prior formatting failure appears addressed by 7a1941b.

Actions taken

  • ⏭️ Not approved (Medium risk)
  • ⚠️ Reviewers not assigned (no eligible non-author collaborators)
  • ℹ️ Manual review recommended — focus on explore/verify trigger edge cases, pseudo-step ordering, and byte-identical suppression for common prompts

Automated risk assessment — conclusions derived from diff evidence only.

Open in Web View Automation 

Sent by Cursor Automation: Assign PR reviewers

@madara88645 madara88645 merged commit 9b31355 into main Jul 2, 2026
12 checks passed
@madara88645 madara88645 deleted the feat/exploration-modes branch July 2, 2026 10:19
madara88645 added a commit that referenced this pull request Jul 2, 2026
Document the adaptive latitude scheduler shipped in #914: the four
modes, where they appear, the Working approach section, the silence
guarantee for clear requests, and the machine-readable schedule in the
IR for downstream consumers.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant