Skip to content

Fix PR-triage eval trigger: dispatch evaluation.yml instead of bot-applied label#746

Merged
JanKrivanek merged 1 commit into
dotnet:mainfrom
JanKrivanek:fix/triage-eval-dispatch
Jun 12, 2026
Merged

Fix PR-triage eval trigger: dispatch evaluation.yml instead of bot-applied label#746
JanKrivanek merged 1 commit into
dotnet:mainfrom
JanKrivanek:fix/triage-eval-dispatch

Conversation

@JanKrivanek

Copy link
Copy Markdown
Member

Problem

When the PR-triage worker decides a PR is ready-for-eval, it added the
evaluate-now label and relied on evaluation.yml's
pull_request_target: [labeled] trigger to start evaluation. That trigger never
fires for the bot.

GitHub's recursion guard: events emitted by the default GITHUB_TOKEN do not
start new workflow runs
— and labeled is one of them. The worker applies the
label as github-actions[bot] (its GITHUB_TOKEN), so the labeled webhook is
suppressed and no evaluation runs.

Live repro — PR #745

Time (UTC) Actor Event Result
13:33:39Z github-actions[bot] added evaluate-now no run fired
15:39:24Z a maintainer /evaluate evaluation ran

The same worker run (workflow_dispatch, dispatched by the batch via
GITHUB_TOKEN) is itself an A/B proof of the guard:

  • workflow_dispatchfired (the worker executed)
  • its own labeled (the label it added) → did not fire

workflow_dispatch and repository_dispatch are the only token-initiated events
exempt from the recursion guard.

Fix

The worker now dispatches evaluation.yml directly via gh workflow run
(workflow_dispatch) with a pr_number input, instead of adding a label that
can't fire. The dispatch is routed through the existing gate job, so the path
is identical to /evaluate (same permission checks, PR fetch, fork handling,
concurrency group, commit status, and result comment).

  • evaluation.yml: new workflow_dispatch inputs pr_number + head_sha;
    gate.if and the concurrency group gain a third entry point; gate reads and
    numerically validates pr_number (via env, no interpolation); discover only
    runs a full/plugin eval for a no-pr_number dispatch.
  • pr-triage-act.sh: do_eval_trigger dispatches evaluation.yml rather
    than labeling. Idempotency (eval_run_exists_for_head) gains a second path:
    a dispatched run's head_sha is the default branch (not the PR head), so it is
    matched by the deterministic run name Evaluate PR #<n> @ <sha7>.
  • pr-triage.yml: worker granted actions: write (required for
    gh workflow run).
  • The evaluate-now label remains a valid human entry point (a human
    adding the label is not subject to the recursion guard).

Verification

  • Recursion-guard behavior is proven empirically from this repo's own history
    (see the A/B table above) — workflow_dispatch via GITHUB_TOKEN already
    drives the batch → worker chain.
  • evaluation.yml + pr-triage.yml parse cleanly; pr-triage-act.sh passes
    bash -n.

Note: workflow_dispatch always runs the default-branch copy of
evaluation.yml, so the new dispatch path can only be exercised end-to-end
once this is merged to main. The pre-merge evidence above plus syntax
validation is the available verification.

Docs

Updated pr-triage-workflows.md (architecture
diagram + the three evaluation.yml entry points).

The triage worker added the 'evaluate-now' label via GITHUB_TOKEN, but label events emitted by GITHUB_TOKEN do not start workflows (GitHub's recursion guard), so evaluation.yml's pull_request_target:[labeled] entry point never fired for the bot (repro: PR dotnet#745). workflow_dispatch and repository_dispatch are the only token-initiated events exempt from that guard.

The worker now dispatches evaluation.yml directly via 'gh workflow run' with a pr_number input, routed through the existing gate job so the path is identical to /evaluate. A dispatched run's head_sha is the default branch (not the PR head), so idempotency now matches the deterministic run name 'Evaluate PR #<n> @ <sha7>'. The 'evaluate-now' label remains a valid human entry point. Worker granted actions:write for 'gh workflow run'.
@github-actions

Copy link
Copy Markdown
Contributor

Note

This PR is from a fork and modifies infrastructure files (eng/ or .github/).

Changes to infrastructure typically need to be submitted from a branch in dotnet/skills (not a fork) so that CI workflows run with the correct permissions and secrets.

Please consider recreating this PR from an upstream branch. If you don't have push access to dotnet/skills, ask a maintainer to push your branch for you.

@JanKrivanek JanKrivanek marked this pull request as ready for review June 10, 2026 16:30
Copilot AI review requested due to automatic review settings June 10, 2026 16:30
@JanKrivanek JanKrivanek requested a review from dbreshears as a code owner June 10, 2026 16:30
@JanKrivanek JanKrivanek enabled auto-merge (squash) June 10, 2026 16:30

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the PR-triage automation so that when a PR is deemed ready-for-eval, the worker triggers evaluation.yml via workflow_dispatch (instead of relying on a bot-applied label that can’t emit workflow-triggering events under GitHub’s recursion guard).

Changes:

  • Add workflow_dispatch inputs to evaluation.yml and route pr_number dispatches through the existing gate PR pipeline/concurrency group.
  • Update the triage worker script to dispatch evaluation.yml via gh workflow run and enhance idempotency detection for dispatch-triggered runs.
  • Grant the triage worker actions: write so it can dispatch evaluation.yml, and update design docs to reflect the new trigger path.
Show a summary per file
File Description
docs/design/pr-triage-workflows.md Updates architecture + evaluation entry-point documentation to reflect dispatch-based triggering.
.github/workflows/pr-triage.yml Expands permissions to allow dispatching evaluation.yml from the worker.
.github/workflows/evaluation.yml Adds dispatch inputs and updates gate/concurrency logic to support triage dispatch entry point.
.github/scripts/pr-triage-act.sh Switches eval triggering from label application to workflow dispatch; adds dispatch-run idempotency lookup.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 4/4 changed files
  • Comments generated: 1

Comment thread .github/workflows/evaluation.yml
@github-actions github-actions Bot added the waiting-on-author PR state label label Jun 10, 2026
@github-actions

Copy link
Copy Markdown
Contributor

👋 @JanKrivanek — this PR has 1 unresolved review thread(s). When you're ready, please address the feedback and push an update; the triage bot will pick up the next state automatically. (Add the no-stale label to silence further pings.)

@JanKrivanek

Copy link
Copy Markdown
Member Author

/evaluate

@github-actions github-actions Bot added pr-state/ready-for-eval PR is mergeable and awaiting evaluation evaluate-now Trigger evaluation.yml for current PR head (transient) and removed waiting-on-author PR state label labels Jun 12, 2026
@JanKrivanek JanKrivanek merged commit 0c0f6f0 into dotnet:main Jun 12, 2026
23 checks passed
github-actions Bot added a commit that referenced this pull request Jun 12, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
technology-selection ML.NET classification on tabular data 3.0/5 → 4.0/5 🟢 ✅ technology-selection; tools: skill, stop_bash / ✅ technology-selection; tools: skill 🟡 0.39
technology-selection LLM integration with MEAI abstraction 1.0/5 → 1.0/5 ⚠️ NOT ACTIVATED 🟡 0.39 [1]
technology-selection Reject LLM for tabular classification 3.0/5 → 4.0/5 🟢 ✅ technology-selection; tools: skill 🟡 0.39
technology-selection Agentic workflow with guardrails 2.0/5 → 3.0/5 🟢 ✅ technology-selection; tools: skill 🟡 0.39
technology-selection Natural-language scenario decomposition — RAG chatbot 3.0/5 → 4.0/5 🟢 ✅ technology-selection; tools: skill / ✅ technology-selection; tools: skill, bash 🟡 0.39 [2]
technology-selection RAG pipeline with vector search 4.0/5 → 5.0/5 🟢 ✅ technology-selection; tools: skill 🟡 0.39
mcp-csharp-debug Debug an MCP server with MCP Inspector 4.0/5 → 4.0/5 ✅ mcp-csharp-debug; tools: report_intent, skill / ✅ mcp-csharp-debug; tools: skill ✅ 0.07 [3]
mcp-csharp-debug Configure VS Code to use an MCP server 4.0/5 → 4.0/5 ✅ mcp-csharp-debug; tools: skill, report_intent, view, glob / ✅ mcp-csharp-debug; tools: skill, report_intent ✅ 0.07
mcp-csharp-debug Debug a failing MCP server tool 5.0/5 → 4.0/5 🔴 ✅ mcp-csharp-debug; tools: report_intent, skill / ✅ mcp-csharp-debug; tools: skill ✅ 0.07
mcp-csharp-publish Publish an MCP server as a NuGet tool package 3.0/5 → 4.0/5 🟢 ✅ mcp-csharp-publish; tools: skill 🟡 0.21 [4]
mcp-csharp-publish Deploy an HTTP MCP server to Azure Container Apps 3.0/5 → 5.0/5 🟢 ✅ mcp-csharp-publish; tools: skill, report_intent, view 🟡 0.21
mcp-csharp-publish Publish to the MCP Registry 1.0/5 → 3.0/5 🟢 ✅ mcp-csharp-publish; tools: skill 🟡 0.21
mcp-csharp-create Implement MCP tools with proper attributes and DI 4.0/5 → 5.0/5 🟢 ✅ mcp-csharp-create; tools: skill, view ✅ 0.12
mcp-csharp-create Create an HTTP MCP server with tools and resources 4.0/5 → 5.0/5 🟢 ✅ mcp-csharp-create; tools: skill ✅ 0.12
mcp-csharp-create Create an MCP server with tools, prompts, and proper logging 4.0/5 → 5.0/5 🟢 ✅ mcp-csharp-create; tools: skill ✅ 0.12
mcp-csharp-test Write unit and integration tests for an MCP server 3.0/5 → 4.0/5 🟢 ✅ mcp-csharp-test; tools: skill, report_intent, view 🟡 0.21
mcp-csharp-test Test an HTTP MCP server with WebApplicationFactory 3.0/5 → 4.0/5 🟢 ✅ mcp-csharp-test; tools: skill, report_intent, view 🟡 0.21
mcp-csharp-test Create evaluations for an MCP server 2.0/5 → 5.0/5 🟢 ✅ mcp-csharp-test; tools: skill, view 🟡 0.21
exp-mock-usage-analysis Detect unused and unreachable mock setups 3.0/5 → 5.0/5 🟢 ✅ exp-mock-usage-analysis; tools: skill ✅ 0.09
exp-mock-usage-analysis Detect redundant mock configurations duplicated across tests 3.0/5 → 4.0/5 🟢 ✅ exp-mock-usage-analysis; tools: skill ✅ 0.09
exp-mock-usage-analysis Detect mocking of stable framework types 3.0/5 → 5.0/5 🟢 ✅ exp-mock-usage-analysis; tools: skill ✅ 0.09
exp-mock-usage-analysis Analyze mock usage in NSubstitute tests 3.0/5 → 5.0/5 🟢 ✅ exp-mock-usage-analysis; tools: skill ✅ 0.09
exp-mock-usage-analysis Analyze mock usage in FakeItEasy tests 4.0/5 → 5.0/5 🟢 ✅ exp-mock-usage-analysis; tools: skill ✅ 0.09
exp-mock-usage-analysis Detect excessive mock configuration sprawl 3.0/5 → 4.0/5 🟢 ✅ exp-mock-usage-analysis; tools: skill ✅ 0.09
exp-test-maintainability Recommend data-driven patterns with display names for unclear parameters 4.0/5 → 4.0/5 ⚠️ NOT ACTIVATED ✅ 0.13 [5]
exp-test-maintainability Recognize well-maintained tests that need minimal changes 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED ✅ 0.13 [6]
exp-test-maintainability Detect repeated object construction and setup across test methods 3.0/5 → 4.0/5 🟢 ✅ exp-test-maintainability; tools: skill ✅ 0.13
exp-test-maintainability Recognize tests with minimal boilerplate that need no refactoring 3.0/5 → 5.0/5 🟢 ✅ exp-test-maintainability; tools: skill ✅ 0.13
exp-simd-vectorization Optimize manual min/max with TensorPrimitives 1.0/5 → 4.0/5 🟢 ✅ exp-simd-vectorization; tools: skill, create, bash ✅ 0.17
exp-simd-vectorization Optimize manual product with TensorPrimitives 1.0/5 → 5.0/5 🟢 ✅ exp-simd-vectorization; tools: skill, glob, create, bash ✅ 0.17
exp-simd-vectorization No optimization opportunity — dictionary-based lookup service 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED ✅ 0.17 [7]
exp-simd-vectorization Optimize int array conditional increment with SIMD 3.0/5 → 3.0/5 ✅ exp-simd-vectorization; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.17
exp-simd-vectorization Optimize byte buffer bit reversal with SIMD 4.0/5 → 4.0/5 ✅ exp-simd-vectorization; tools: skill ✅ 0.17 [8]

[1] (Isolated) Quality unchanged but weighted score is -0.8% due to: efficiency metrics
[2] (Isolated) Quality improved but weighted score is -4.2% due to: tokens (54057 → 92250), time (36.5s → 46.3s)
[3] (Plugin) Quality unchanged but weighted score is -8.4% due to: tokens (12732 → 30244), tool calls (0 → 1), time (11.2s → 15.1s)
[4] (Plugin) Quality unchanged but weighted score is -5.1% due to: tokens (38459 → 65947), tool calls (3 → 5)
[5] (Isolated) Quality unchanged but weighted score is -0.1% due to: efficiency metrics
[6] (Plugin) Quality unchanged but weighted score is -0.4% due to: efficiency metrics
[7] (Isolated) Quality unchanged but weighted score is -2.4% due to: judgment
[8] (Isolated) Quality unchanged but weighted score is -8.8% due to: quality, tool calls (9 → 11)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

To investigate failures, paste this to your AI coding agent:

For PR 746 in dotnet/skills, download eval artifacts with gh run download 27418730270 --repo dotnet/skills --pattern "skill-validator-results-*" --dir ./eval-results, then fetch https://raw.githubusercontent.com/dotnet/skills/5cbaa520aa377f40dc753209ddb67edddb61c16b/eng/skill-validator/src/docs/InvestigatingResults.md and follow it to analyze the results.json files. Diagnose each failure, suggest fixes to the eval.yaml and skill content, and tell me what to fix first.

▶ Sessions Visualisation -- interactive replay of all evaluation sessions
📊 Session Analytics (preview) -- aggregated metrics across evaluation sessions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

evaluate-now Trigger evaluation.yml for current PR head (transient) pr-state/ready-for-eval PR is mergeable and awaiting evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants