Skip to content

feat(TP-196 + TP-197): multi-segment engine hardening + dashboard segment-progress indicators#577

Merged
HenryLach merged 32 commits into
mainfrom
feat/tp-196-tp-197-segment-followups
May 13, 2026
Merged

feat(TP-196 + TP-197): multi-segment engine hardening + dashboard segment-progress indicators#577
HenryLach merged 32 commits into
mainfrom
feat/tp-196-tp-197-segment-followups

Conversation

@HenryLach
Copy link
Copy Markdown
Owner

Bundles two task packets run in parallel because their file scopes don't overlap. Closes 5 segment-related issues that survived the #51 closure as polish/hardening work.

Issues closed

TP-196 — Multi-segment engine hardening (#462 + #502 + #503 + #508)

Size M, Review Level 2, 9 in-batch reviews (1 plan + 8 code, all APPROVE).

#502 (foundational, Step 2): SegmentScopeMode unification

Promoted SegmentScopeMode to a first-class type in types.ts. Lane-runner now computes it once via computeSegmentScopeMode(...) and threads via lane config. Key sites unified on isSegmentScoped:

  • Prompt segment overlay
  • "Active segment ID" prompt line
  • System prompt segment overlay
  • TASKPLANE_ACTIVE_SEGMENT_ID / TASKPLANE_SEGMENT_ID env vars (now cleared in FULL_TASK mode)
  • request_segment_expansion tool registration (gated via env-driven extension logic)

#462 (Step 3): .DONE authority guards built atop #502's flag

Three defense-in-depth guards:

  • Monitor guard in resolveTaskMonitorState: demotes .DONE when active segment is known non-final (segmentIds.length > 1 && activeSegmentId != lastSegment)
  • Resume guard in collectDoneTaskIdsForResume: requires all persisted segments to be terminal-success before trusting .DONE
  • Discovery safeguard (warn-only): doctor-style warning for .DONE + incomplete-frontier inconsistency
  • NEW extensions/tests/done-authority-multi-segment.test.ts (372 LOC): non-final unlink failure, transient .DONE race, resume with stale marker

#508 (Step 4): Early-exit optimization

Pre-spawn segment-completion check in lane-runner.ts iteration loop — skips wasted spawn when all segment checkboxes are already complete. NEW extensions/tests/early-exit-segment-spawn-skip.test.ts (351 LOC) covers the optimization + soft-progress interaction.

#503 (Step 5): SegmentScopeMode prompt-injection regression tests

NEW extensions/tests/segment-scope-mode-prompt.test.ts (855 LOC) covering:

  • FULL_TASK prompt content assertions
  • SEGMENT_SCOPED prompt content assertions
  • Polyrepo single-segment regression (worker proceeds past Step 0)
  • Legacy/partial-marker fallback case (no silent one-step scoping)

TP-197 — Dashboard segment-progress indicators (#464)

Size S-M, Review Level 1, 2 plan reviews (1 REVISE → 1 APPROVE).

  • dashboard/public/app.js (+69 LOC): taskSegmentPillRow() renderer; segment-aware progress bar; current-segment highlight
  • dashboard/public/style.css (+73 LOC): .seg-pill styles for ✅ / ⏳ / ⬚ states + responsive layout
  • dashboard/server.cjs: no changes (segment data was already exposed via the API)
  • NEW extensions/tests/dashboard-segment-pill-row.test.ts (231 LOC): rendering assertions + XSS guards + malformed-input resilience
  • docs/tutorials/use-the-dashboard.md: documents the new indicators

Sage post-merge fold (commit 533a6fa)

Sage spot-check caught one real regression: the in-batch TP-197 implementation changed .task-row's grid-template-rows from auto auto to auto auto auto unconditionally. With row-gap: 8px, this added a visible 8px gap below row 2 for single-segment tasks where row 3 is empty. Violated the PROMPT's "non-segmented batches render identically to today" contract.

Fix (sage's recommended pattern):

  • Default .task-row stays at grid-template-rows: auto auto (pre-TP-197 layout)
  • New rule .task-row.has-segments adds the 3rd row only when set
  • JS computes taskRowClass = hasSegmentPillRow ? "task-row has-segments" : "task-row" and renders <div class="${taskRowClass}">

Layout-parity regression tests added (+2 in dashboard-segment-pill-row.test.ts):

  • Source check: default .task-row has 2-row grid; .has-segments opts into 3
  • Source check: app.js conditionally adds the class via ternary

Sage's other findings (no fold needed):

Validation

Gate Result
npm run typecheck exit 0
npm run lint exit 0 / 284 warnings / 671 infos
npm run format:check exit 0
npm run test:fast 3691 passing / 1 skipped / 0 failed (+64 from 3627 baseline)
First batch under post-TP-194 hard gates ✅ All four gates green throughout

What this does NOT do

  • Doesn't change .DONE suppression behavior (TP-145 stays as-is)
  • Doesn't address other open polish items not in the 5-issue triage list
  • Doesn't add dashboard/public/ to Biome lint scope (intentionally out per spec section 3)

Release framing

Both packets are small targeted hardening/UX. Suggest v0.30.1 patch release after merge (CHANGELOG entries from both packets already in [Unreleased]).

Alternatively, hold for v0.31.0 with Tier-1.5 follow-ups (TS strictness ratchet, CHANGELOG fragments) — your call.

HenryLach added 30 commits May 10, 2026 19:38
…ify scope decision

- Add 'export type SegmentScopeMode = FULL_TASK | SEGMENT_SCOPED' to types.ts
- Add computeSegmentScopeMode() helper in lane-runner.ts \u2014 single source of
  truth for the FULL_TASK vs SEGMENT_SCOPED decision (TP-196 / #502).
- Iteration loop now derives both 'segmentScopeMode' (typed) and the legacy
  'isSegmentScoped' boolean alias from one computation, preventing the
  multi-condition drift documented in #502.
- 16 new tests (11 unit + 5 source-analysis contracts) in
  segment-scoped-lane-runner.test.ts cover the helper's truth table and
  guard the unified gating sites.
- TASKPLANE_ACTIVE_SEGMENT_ID and the segment system-prompt overlay are
  already gated via the (now unified) isSegmentScoped alias; the
  request_segment_expansion tool registration inherits this gating through
  the env var.

Targeted suite: 62/62 pass. Full fast suite: 3643 pass / 0 fail.
Per R002 code review: the segment-scoped prompt injection branch was still
keyed on the raw 'stepSegmentMap && currentRepoId && repoStepNumbers &&
remainingSteps.length > 0' composite, leaving the very drift surface #502
is meant to eliminate.

- lane-runner.ts: outer gate is now 'if (isSegmentScoped)'; an inner
  defensive guard logs a WARN and skips when currentStepMapping/mySegment
  is unexpectedly absent (should never trip in practice because
  computeSegmentScopeMode already verifies those preconditions).
- segment-scoped-lane-runner.test.ts: update 4.1 + 7.3 to assert the
  mode-driven gate + the absence of the legacy composite-condition string.

Targeted: 62/62 pass. Full fast suite: 3643 pass / 0 fail. Gates green.
Adds a horizontal pill row of per-segment status badges to each multi-segment
task row in the dashboard. Each pill shows an icon (succeeded / running /
pending / failed / stalled / skipped) plus the repoId for that segment, with
the currently-executing segment visually emphasized.

This closes the operator visibility gap left by TP-145's .DONE suppression
for non-final segments: previously, multi-segment lanes sat 'running' with
no segment-level signal during the suppression window. Now operators can
see at a glance which segments have finished, which is running, and which
remain.

Implementation:
- dashboard/public/app.js: new taskSegmentPillRow() helper; integrated
  into renderLanesTasks() as grid row 3 of .task-row (mirrors the
  task-title-subtitle pattern from TP-485). Returns '' for single-
  segment tasks so non-segmented tasks render byte-identical DOM to today.
- dashboard/public/style.css: extends .task-row grid-template-rows to
  'auto auto auto'; adds .task-segment-row container plus .seg-pill +
  status variants. Pill row lives OUTSIDE .task-step intentionally so
  the existing @media (max-width: 900px) { .task-step { display: none } }
  rule does NOT hide segment context on narrow viewports.
- extensions/tests/dashboard-segment-pill-row.test.ts: 11 new unit tests
  cover multi-segment rendering, current-segment emphasis, single-segment
  fallback (no regression), malformed input handling, and XSS-safe escaping.

No server.cjs change required: batch.segments[], task.segmentIds, and
runtimeLaneSnapshots[*].segmentId were already exposed (verified in Step 0).
The progress bar is unchanged \u2014 TP-174 already segment-scoped it via
v2Progress; the new pill row provides the missing context that makes the
existing bar legible as 'current segment's progress.'

Tests: 3638 pass / 0 fail / 1 skipped (was 3627; +11 new).
Gates: typecheck 0, lint 0, format:check 0.
…ery)

Three defense-in-depth guards harden the .DONE authority model against
stale or premature markers in multi-segment tasks:

1. Monitor guard (execution.ts::resolveTaskMonitorState):
   - New optional 'multiSegmentContext: { isFinalSegment, segmentId }' parameter.
   - When isFinalSegment === false and .DONE is present, log a WARN via
     execLog and SKIP Priority 1, falling through to lower priorities.
   - monitorLanes populates the context from task.task.segmentIds +
     task.task.activeSegmentId (active === last in deterministic order).

2. Resume guard (resume.ts::collectDoneTaskIdsForResume):
   - New internal helper isSegmentFrontierCompleteForResume().
   - When persisted segments exist AND any segment is not succeeded/skipped,
     refuse to add the taskId to the done set. Emit console.warn carrying
     '#462 guard' so operators see the inconsistency. The on-disk .DONE is
     left alone; resume re-reconciles instead of silently marking complete.

3. Discovery safeguard (discovery.ts::checkDoneAuthoritySafeguard):
   - New exported helper warning when .DONE coexists with unchecked
     checkboxes in STATUS.md. Wired into scanAreaForTasks (no behaviour
     change to the scan itself \u2014 doctor-style warning only).

Tests:
- New: extensions/tests/done-authority-multi-segment.test.ts (14 tests).
- Updated: resume-segment-frontier.test.ts 'keeps .DONE authoritative...'
  test now asserts the inverted (post-#462) contract.
- Adjusted: engine-runtime-v2-routing.test.ts 14.5 widens the slice window
  to accommodate the new monitor-guard prelude in resolveTaskMonitorState.

Full fast suite: 3657 pass / 0 fail. Typecheck / lint / format:check clean.
…ted iteration

Adds an explicit pre-spawn check in the lane-runner iteration loop. When
the lane is iterating a segment-scoped task AND all of its repoStepNumbers
are already isSegmentComplete (e.g., a worker already checked every box
on the final iteration but exited slightly before the loop's existing
'remainingSteps.length === 0' break could fire), the engine now logs a
'Pre-spawn segment-completion check' execution-log line and breaks out
of the loop \u2014 saving the ~30-60s + token cost of spawning a worker that
would find nothing to do.

Defense properties:
- Lives AFTER the existing 'remainingSteps.length === 0' break and BEFORE
  'totalIterations++' so a fully-complete segment skips the spawn without
  incrementing the iteration counter.
- Gated on
  so FULL_TASK iterations remain unaffected.
- Falls through to the existing post-loop completion handling \u2014 no new
  branching.

5 new source-analysis tests in segment-scoped-lane-runner.test.ts (sections
10.0\u201310.4) verify the check exists at the spawn boundary, uses
isSegmentComplete on every repoStepNumber, breaks out of the loop, and
is properly gated.

Full fast suite: 3662 pass / 0 fail. Gates green.
…havioural test (R005)

Per R005 code review: Step 4 needed an end-to-end behavioural test
verifying that 'spawnAgent' is NEVER called when a segment-scoped task's
checkboxes are already all complete.

Changes:
- Extract the inline iteration-loop check into a pure helper:
  'shouldSkipSpawnForCompleteSegment(statusContent, repoStepNumbers,
  currentRepoId): boolean'. The lane-runner now delegates to this helper.
- New test file 'extensions/tests/early-exit-segment-spawn-skip.test.ts'
  with two describe blocks:
    1. Helper-level behavioural tests (6 cases) covering the decision
       contract directly with realistic STATUS.md fixtures.
    2. End-to-end executeTaskV2 test that mocks spawnAgent via
       'mock.module("../taskplane/agent-host.ts", ...)' and asserts
       'spawnAgentCallCount === 0' and 'iterations === 0' for a fixture
       worktree whose segment checkboxes are all checked.
- Update segment-scoped-lane-runner.test.ts 10.x source-analysis tests
  to assert the new helper-based wiring.

Full fast suite: 3669 pass / 0 fail. Gates green.
Adds extensions/tests/segment-scope-mode-prompt.test.ts \u2014 9 behavioural
tests across 4 describe blocks, mocking spawnAgent to capture the worker
prompt and env, then driving executeTaskV2 with realistic fixtures:

1. FULL_TASK (3 tests): prompt does NOT contain 'Active segment ID',
   'Your checkboxes for this step:', 'Other segments in this step (NOT
   yours)', or 'Segment-scoped context'; TASKPLANE_ACTIVE_SEGMENT_ID and
   TASKPLANE_SEGMENT_ID env vars are hard-cleared to ''; system prompt
   is BASE only (no segment overlay).

2. SEGMENT_SCOPED (3 tests): prompt DOES contain 'Active segment ID:
   <segmentId>', 'Your checkboxes for this step:', 'Other segments in
   this step (NOT yours', 'Segment-scoped context'; env carries the
   active segment ID; system prompt appends the segment overlay AFTER
   the base prompt.

3. Polyrepo single-segment regression (1 test): task with segment
   markers for only ONE repo still injects the segment-scoped block,
   proving the worker is not silently scoped to step 0.

4. Legacy / partial-marker fallback (2 tests): (a) task with NO segment
   markers at all falls back to FULL_TASK; (b) task with markers for
   OTHER repos but not the active repo also falls back to FULL_TASK.

Architectural note: the original #503 wording asks for assertions that
the prompt includes 'SegmentScopeMode: FULL_TASK' literal text. That
prose was deliberately removed in commit 97816c0 ("hard mode separation
for worker segment scoping") in favor of separate system-prompt files
("the prompt IS the mode"). These tests assert the architecturally-
current contract \u2014 prompt content + env vars + system-prompt overlay
reflect the mode \u2014 which preserves the intent of #503 while honoring
the post-#502 design.

Full fast suite: 3678 pass / 0 fail. Gates green.
… (R008)

Per R008: test 3.1 'proves the worker proceeds beyond Step 0' was relying
on a too-aggressive mock that checked off ALL unchecked boxes per spawn,
collapsing the loop to a single iteration and making the test pass even
if the engine silently scoped to Step 0.

Fix:
- Added 'workerAdvanceMode' toggle on the mock spawnAgent: 'all' (default)
  for the FULL_TASK/SEGMENT_SCOPED prompt-content tests that just need
  one spawn captured, and 'first' for the polyrepo regression test which
  needs the loop to iterate step-by-step.
- Test 3.1 now asserts capturedSpawns.length >= 2 AND that the second
  prompt mentions 'Step 1' + 'Create endpoint'. A silent-scope-to-Step-0
  regression would fail this test.
- Reviewer style: use optional chaining for env access in the mock.

Full fast suite: 3678 pass / 0 fail. Gates green.
…ATUS finalize

Adds a 4-paragraph Fixed entry under [Unreleased] in CHANGELOG.md covering:
1. .DONE authority guards (#462 \u2014 monitor + resume + discovery)
2. SegmentScopeMode unification (#502 + #503)
3. Wasted-iteration elimination (#508)
4. Validation summary (gates green, 3678 pass / 0 fail / 1 skip, +51 tests)

Adds four issue-close comment drafts (#462, #502, #503, #508) to the STATUS.md
Notes section for the operator to post after merge.

Marks STATUS.md Status: ✅ Complete (task done).
…t-merge fold)

Sage caught that the in-batch implementation changed .task-row's
'grid-template-rows' from 'auto auto' to 'auto auto auto'
unconditionally. 'auto' rows don't fully collapse when 'row-gap' is
set, so a non-segmented task got an 8px visible gap below row 2
where the empty pill row sat. Violates the PROMPT's 'non-segmented
batches render identically to today' contract.

## Fix (sage's recommended pattern)

dashboard/public/style.css:
  - Default .task-row keeps the pre-TP-197 'grid-template-rows: auto auto'
  - New rule '.task-row.has-segments' adds the 3rd row only when needed

dashboard/public/app.js:
  - Compute taskRowClass = hasSegmentPillRow ? 'task-row has-segments' : 'task-row'
  - Template renders <div class="${taskRowClass}">

## Layout-parity regression tests

extensions/tests/dashboard-segment-pill-row.test.ts (+2 tests):
  - source check: default .task-row has 2-row grid; .has-segments has 3
  - source check: app.js conditionally adds the class via ternary

These source-pattern guards prevent re-introduction of the
unconditional 3-row template or the dropped conditional-class
pattern in future edits.

## Validation

- Format pass auto-fixed my test code (1 file, Biome's wrap+quote-style preferences)
- All 4 hard gates exit 0
- 3691 tests passing (was 3689 before this fold; +2 new sage-fold tests)
- Dashboard tests: 13/13 pass (was 11; +2 layout-parity guards)
…ontrast bump (polyrepo-test folds)

Two user-flagged folds caught during operator polyrepo test:

## Fold 1: Wave-chip lane parallelization across all waves

User observed that wave chips in the top-of-dashboard 'Waves' bar
showed different separators based on which wave was active:
  - Active wave: parallel/serial separators worked correctly (e.g.,
    'W2 [TP-004 | TP-005]')
  - Inactive waves: fell back to comma-separated ('W3 [TP-004, TP-005]')

The separator visibly 'flickered' as waves transitioned active —
W1 used '|' when active, then reverted to ',' when W2 became active.

### Root cause

formatWaveLaneBreakdown() built its task→lane map only from
batch.lanes, which is Runtime V2 live state and only populated for
the currently-active wave. For past/future waves, taskToLane.has(t)
returned false → hasAnyLaneData false → comma fallback.

### Fix

Derive task→lane map from batch.tasks[].laneNumber (persisted for
the entire batch lifecycle), with batch.lanes as a fallback. The
parallelization indicator is now stable across all wave states.

Call-site updated to pass batch.tasks as the new third argument.
formatWaveLaneBreakdown signature changed:
  (taskIds, lanes, waveNumber) →
  (taskIds, lanes, tasks, waveNumber)

### Tests (NEW: extensions/tests/dashboard-wave-lane-breakdown.test.ts, 8 cases)

  - multi-lane wave uses '|' separator
  - single-lane wave uses '→' separator
  - REGRESSION: derives task→lane from tasks when lanes is empty
  - REGRESSION: derives task→lane from tasks when lanes is unrelated
  - backward-compat: falls back to lanes when tasks is empty
  - no lane data: comma-separated legacy fallback
  - mixed: same lane '→', different lanes '|'
  - source: pending pill CSS uses --text-muted (fold 2 lock-down)

## Fold 2: Pending segment pill contrast bump

User observed that the pending state of the new TP-197 segment pills
(⬚ web-client in screenshot) had very low contrast on the dark theme,
making the repo name hard to read.

### Root cause

.seg-pill.seg-pending used:
  color: var(--text-faint)   /* #484f58 on dark = ~3.7:1 contrast */

Below WCAG AA (4.5:1) for normal text.

### Fix

color: var(--text-muted)     /* #8b949e on dark = ~6.5:1 contrast */

Also: border-color: var(--border-default, var(--border-subtle)) so
the pill picks up a slightly stronger border if --border-default is
ever defined; falls back to --border-subtle (the current behavior)
otherwise.

Visually: pending pills are now readable while remaining clearly
muted vs the saturated success (green) and running (accent) states.

## Validation

  - npm run typecheck → exit 0
  - npm run lint → exit 0
  - npm run format:check → exit 0 (Biome auto-formatted 1 file in this commit)
  - npm run test:fast → 3699 passing (+8 from prior baseline 3691)
  - Manual visual confirm pending in polyrepo test will be operator's job
HenryLach added 2 commits May 11, 2026 17:52
…p rendering (polyrepo-test fold 2)

User's polyrepo retest after the previous fold caught a new issue:
future-wave chips were showing '→' (serial) for tasks that would
actually run in parallel once their wave started. Specifically W2
and W3 chips both showed 'TP-004 → TP-005' before W2 became active,
then switched to 'TP-004 | TP-005' when W2 started.

## Root cause

The previous fold (commit 46d7b69) made formatWaveLaneBreakdown
derive task→lane from batch.tasks[].laneNumber. But the persistence
layer assigns laneNumber=0 as a SENTINEL for unallocated tasks
(persistence.ts:1378 + 2538: 'lane?.laneNumber ?? outcome?.laneNumber ?? 0').
Real lane numbers start at 1.

My previous code accepted 0 as a real lane → all pending tasks got
grouped under fake 'lane 0' → rendered as serial '→' display. Worse
than the original behavior, which at least fell back to comma
('I don't know').

## Fix

Treat laneNumber >= 1 as a real lane assignment; skip 0 (sentinel).
Per-task allocation:

  if (
    t && t.taskId &&
    typeof t.laneNumber === 'number' &&
    t.laneNumber >= 1 &&
    !taskToLane.has(t.taskId)
  ) {
    taskToLane.set(t.taskId, t.laneNumber);
  }

Now:
  - Future-wave tasks with sentinel laneNumber=0 → not added to map
  - hasAnyLaneData becomes false for that wave
  - Comma fallback display ('TP-004, TP-005')
  - When the wave starts and real lanes are allocated, laneNumber
    updates to 1/2/3/... → wave chip switches to '|' display

The display goes from comma (unknown) → '|' (parallel) when allocation
happens. No more misleading '→' for pending tasks.

## Regression tests (+2)

Added 2 tests to extensions/tests/dashboard-wave-lane-breakdown.test.ts:

  - 'laneNumber=0 sentinel does NOT count as a real lane'
    Both tasks with sentinel 0 → comma fallback (not '→')

  - 'mixed sentinel-0 and real laneNumbers — only real ones count'
    Tasks with mixed values: real ones render in their lanes, sentinel-0
    tasks fall into the 'unassigned' trailing group

## Validation

  - typecheck/lint/format:check all exit 0
  - 3701 tests passing (+2 from previous baseline 3699)
  - Sentinel-0 behavior locked down via source-pattern + behavioral tests
…header contrast (polyrepo-test fold 3)

User caught three things in the final-state merge-agents panel:

## 1. Removed SESSION ID and DETAILS columns (dead weight)

SESSION ID was HARDCODED to '\u2014' in every row \u2014 the dashboard had a
column placeholder but never populated it. DETAILS was only used for
mr.failureReason (rare failure cases); for the common all-merges-
succeeded case it was always '\u2014' too. User has never seen real
values in either column in actual runs.

Both columns removed from the merge-table header and all three row
templates (main merge result, per-repo sub-row, active session).
When a real failure happens, the operator sees status='failed' in
the Status column (still present) and can dig into engine logs for
the reason. If structured failure data warrants a dedicated column
in the future, easy to add back.

## 2. Bumped merge-table header contrast

.merge-table th was using --text-faint (#484f58 on dark = ~3.7:1
contrast, below WCAG AA). Bumped to --text-muted (#8b949e = ~6.5:1).
Matches other section headers in the dashboard (style.css lines
224, 248, 458 already use --text-muted for the same role) \u2014 the
merge table was an inconsistency, now fixed.

## 3. Off-by-one wave attribution (#509) NOT addressed here

User also observed (3rd time confirmed) that Wave 2 chip displays
telemetry referencing W1's merge-result and Wave 4 displays W3's
\u2014 the off-by-one wave attribution from issue #509. That's an
existing tracked issue with a separate deeper diagnostic path. Will
comment on #509 with the latest reproduction evidence separately
from this fold (out of scope for a cosmetic-cleanup fold).

## Regression tests (+2)

extensions/tests/dashboard-wave-lane-breakdown.test.ts:

  - 'merge-agents table header has 4 columns; Session ID + Details
    removed' \u2014 locks down the column removal so future edits don't
    bring back the dead columns
  - '.merge-table th uses --text-muted' \u2014 locks down the contrast
    bump so future style consolidation doesn't revert

## Validation

  - typecheck/lint/format:check all exit 0
  - 3703 tests passing (+2 from 3701 baseline)
  - Visual fix: 4 columns instead of 6, headers readable on dark theme
@HenryLach HenryLach merged commit 931dedb into main May 13, 2026
1 check passed
@HenryLach HenryLach deleted the feat/tp-196-tp-197-segment-followups branch May 13, 2026 02:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant