Skip to content

refactor(troubleshoot): tiered signature-index investigator (v2 architecture)#1820

Draft
dmorosanu wants to merge 6 commits into
mainfrom
feat/troubleshoot-v2
Draft

refactor(troubleshoot): tiered signature-index investigator (v2 architecture)#1820
dmorosanu wants to merge 6 commits into
mainfrom
feat/troubleshoot-v2

Conversation

@dmorosanu

@dmorosanu dmorosanu commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Why

Eval data across 184 replay scenarios showed the previous 7-sub-agent orchestration (triage → scope-checker → generator → sequential testers → depth-verifier → presenter) cost 25–55 min per investigation while its marginal value sits in the playbook knowledge, not the choreography: 159/184 scenarios pass without the skill, 12 more flake-pass on retry, and the 13 stable-fails are all cases where a playbook fact or decision-tree discriminator is required. This PR keeps the knowledge and removes the choreography.

What changed

  • SKILL.md rewritten as a single-context tiered protocol (~145 lines): anchor → extract signals (mandatory AggregateException/inner-exception unwrap) → route via a greppable signature index → walk the matched playbook's decision tree → mandatory format-forced verification checklist (cause named verbatim, evidence pinned vs sibling causes, runtime-evidence gate, resolution-branch alignment, causal precedence) → present. The previous Critical Rules section is restructured into ## 1. Invariants — every load-bearing rule is preserved or strengthened (no-CLI-discovery, retry caps, empty≠absent, live≠historical, correlation, raw-file redirect, symptom≠cause, fix-approval gate); the removed rules were orchestration mechanics (never run uip yourself, sequential testers, verbatim presenter hand-off) that no longer have a boundary to protect.
  • agents/ (7 files) and schemas/ (4 files) deleted. Their content is redistributed: tester gates → SKILL.md walk rules; depth-verifier → inline checklist + escalation verifier prompt; presenter.md → references/presenting.md (near-verbatim, including the interactive Healing-Agent apply-flow and the approval gate: user source files are never modified without explicit approval, decline/non-answer = no edit).
  • Tier 0 — signature index. All 215 playbooks now declare signatures: frontmatter (631 signatures; 22 justified silent: true); scripts/build-signature-index.py generates references/signature-index.md (grep-only routing table + no-signature symptom routing + signal-extraction cheatsheet) and lints: every playbook routable or silent, duplicate (kind,value) claims require discriminating notes, exclusion targets must exist. Playbook bodies untouched.
  • Tier 2 — escalation (references/escalation.md, loaded only on 6 defined triggers): 2–4 parallel read-only hypothesis probes + adjudication + a conditional fresh-eyes verifier; bounded spawn budget.
  • Investigation state simplified to .local/investigations/raw/*.json + notes.md (state.json/hypotheses.json/needs_input.json removed; generate_scenario.py needs no code change — it globs by basename).
  • Test-side docs updated (tests CLAUDE.md input table + forbidden-criteria rationale). Three excel fixture project.json descriptions neutralized — they spelled out the scenario's root cause in agent-visible text.

Validation

  • Signature lint + index freshness clean; description-length hook and skill-status check pass.
  • Local sentinel — 15/15 pass (13 stable-fails + 2 known flaky, judge-graded, skill loaded from this branch): 13 at 1.000, all ≥0.8. Per-scenario wall time 5.7–17.7 min, avg ~9.4 min vs 25–55 min under v1. Three first-pass misses (no-healing-agent, replace-text-silent, excel-rr-sheet-bytes) were transcript-diagnosed, fixed (6422dcea8), and re-run to 3× 1.000.
  • CI full suite (185 tasks, ubuntu, parallelism 4) — first pass 173/185 = 93.5% in a single 2h15m job (matches v1's ~93% nightly rate at ~1/10 the per-task cost). After the fixes above, every one of the 185 tasks has ≥1 passing CI run; the only test to fail both CI tracks (replace-text-silent) is fixed and CI-verified.

Speed: v1 vs v2

v1 (old) v2 (this branch)
Per scenario, local (same proxy/model, parallelism 3) 25–55 min 5.7–17.7 min, avg ~9.4
Per task, CI (185-task suite, j=4) 25–55 min avg 3.5 · median 3.2 · p90 5.1 · max 16 min
Full suite, CI overnight-cron only (~20–40 h at v1 pace) one 2h15m job

3–5× faster locally, ~7–10× per task on CI, quality up not down (15/15 on the former hard core).

Follow-ups

  • .github/workflows/validate-signature-index.yml (index lint PR gate) is authored but not in this branch — the push credential lacks workflow scope. Will be added once granted.
  • Suite is green; lower per-task run_limits (5400s task_timeout is now ~10× headroom) in a separate PR.

@dmorosanu

Copy link
Copy Markdown
Contributor Author

Sentinel validation complete (local, judge-graded, skill loaded from this branch):

15/15 pass — the 13 previously stable-failing scenarios plus getasset-activity-silent-failure and healing-agent-no-license. 12 passed first-pass; 3 misses were diagnosed from transcripts and fixed in 6422dce:

  • no-healing-agent (judge 0.2 → 1.0): source-acquisition failure — the playbook forbade checking the working directory for the project while the workflow source sat in cwd. Restored cwd-first discovery (SKILL.md §5.4 precedence + playbook step 5), matching the old tester's auto-discovery order.
  • replace-text-silent (0.5 → 1.0): agent bundled a speculative "second bug" fix for a property the failing run never evaluated. New checklist rule §6.6: fixes must trace to the confirmed cause; unexercised code paths are unverified observations only.
  • excel-rr-sheet-bytes (0.2 → 1.0): ground-truth precision — the raw Get Workbook Sheets payload preserves the NBSP bytes, so byte-verified branch identification is correct client behavior; RESOLUTION.md now accepts it alongside the byte-compare recommendation.

Wall time per scenario: 6–18 min (vs 25–55 min under the previous architecture), all at parallelism 3 through the local proxy. Full-suite CI run on this branch: https://github.com/UiPath/skills/actions/runs/28592769248

@dmorosanu

Copy link
Copy Markdown
Contributor Author

Full-suite validation complete — dual CI coverage (every task ran on two independent CI tracks) plus a fix round:

Track Result Per-task wall time
Full-suite run (185 tasks, one job) 173/185 avg 3.5 min, median 3.2, p90 5.1, max 16 (total 2h15m)
Sequential 6-task batches (31 runs, same 185 tasks) 178/185 ~10 min per 6-task batch incl. setup
Re-run of all 8 remaining failures after fixes (64a9a33) 8/8, scores 0.925–1.000 2–5 min

Every one of the 185 tasks now has at least one passing CI run on this branch; the only scenario that failed both tracks (replace-text-silent) is fixed and re-verified. Fix commits since the sentinel round: c34bafc (XAML expression-binding false-positive guard, post-apply resolution restatement, fix-scope hardening) and 64a9a33 (solution-root resource glob — the IS overview documented the wrong layout, argument-null source-read mandate, contradiction-terminal rule, getasset manifest order-tolerant mock rules + allowed_tools cleanup). No coder-eval or judge changes.

Known residual (pre-existing, not this PR): skill activation is a coin flip on some "why did my job fail" phrasings — 4 single-track failures were correct diagnoses (judge 0.8–1.0) that activated uipath-platform instead of uipath-troubleshoot and lost only skill_triggered. Recommended follow-up outside this PR's scope: add the faulted-job→uipath-troubleshoot redirect to uipath-platform's when_to_use.

@dmorosanu

Copy link
Copy Markdown
Contributor Author

Routing-fix validation (e7e1c9b — front-loaded the faulted-job→uipath-troubleshoot redirect in uipath-platform's when_to_use; the previous redirects sat past the 1536-char listing truncation and never reached the model):

3 CI rounds × the 4 routing-affected tasks = 12 samples:

Task R1 R2 R3
gsuite-connection-invalid pass pass pass
connector-general-disabled pass pass pass
connector-general-no-access pass misroute pass
uia-alter-if-disabled pass pass pass

Activation success: 11/12 (92%) vs ~50% observed pre-fix on these prompts (each of the four misrouted on one of its two pre-fix track runs). The residual miss is the most connection-token-heavy prompt in the suite — the coin flip is attenuated, not eliminated. All 11 correctly-routed runs scored 1.000.

@dmorosanu dmorosanu force-pushed the feat/troubleshoot-v2 branch from e7e1c9b to 1ac0ae4 Compare July 3, 2026 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant