[Epic][Security-MCP] - feat: automatic-migration skill by patrykkopycinski · Pull Request #32 · elastic/example-mcp-app-security

patrykkopycinski · 2026-05-15T11:28:37Z

Summary

Add a vendor-agnostic SIEM-rule migration feature (host-side skill + MCP tools + inline React workbench) to example-mcp-app-security, plus a Vitest-native eval harness that certifies skill activation and tool sequencing across LLM providers.

The migration feature lets a SOC engineer move detection rules from Splunk (and, behind a vendor gate, QRadar / Sentinel-One when those translators mature) into Elastic Security without leaving their MCP-aware host (Claude Desktop, Claude Code, Cursor, etc.). The eval harness ships in the same PR because the only way to certify "the skill activates when the user asks for migration and not when they don't" is to actually run the host-side activation loop against an in-process MCP server — no harness existed before this PR.

This PR is standalone with respect to elastic/kibana#269353: the migration tools call Kibana's existing /internal/siem_migrations/* REST routes directly, so this MCP app feature works against any Kibana 9.x deployment that already exposes the SIEM migrations service — no Kibana plugin change is required.

What ships

Migration feature

Layer	File(s)	What it does
Service	`src/elastic/service/migrationsService.ts` + `.test.ts`	Thin wrapper around 14 Kibana `/internal/siem_migrations/*` routes (`create-migration`, `start-translation`, `get-translated-rules`, `update-translated-rule`, `upsert-resource`, `install-rules`, …).
Tools	`src/tools/migration.ts` + `.test.ts`	1 model-facing tool (`migrate-rules`) + 10 app-only tools the React workbench drives via `app.callServerTool()`.
View	`src/views/migration/` (App.tsx, mcp-app.tsx, mcp-app.html, styles.css, monaco-environment.ts)	Single workbench React app, 8-stage state machine, built into a `< 1 MB` singlefile HTML bundle via `vite-plugin-singlefile`. Drives upload → translating → review (3-column SPL/generated/editable Monaco diff) → per-rule edit drawer → fix-resources drawer (macros + lookups) → install → done.
Skill	`.agents/skills/automatic-migration/SKILL.md`	Activation contract for the host-side LLM. Tells the model: user asks to migrate Splunk → call `migrate-rules` exactly once; the workbench takes over from there.
Wiring	`src/server.ts`, `manifest.json` (1.1.0), `docs/features.md`, `README.md`	Registers `MigrationsService` + `migration` tool group; lists the skill in the features table.

Eval harness

Layer	File(s)	What it does
Types	`evals/types.ts`	`Dataset`, `Example`, `Trajectory`, `Evaluator`, `EvaluatorResult`, `ExpectedBehavior`.
Runner	`evals/runner.ts` + `evals/vitest.config.ts`	Vitest-native orchestrator. `describe.skipIf(!RUN_LLM_EVALS)` so every CI run that doesn't set the flag passes for free; nightly + label-gated CI runs actually exercise the LLM.
Host loop	`evals/runMcpHostLoop.ts` + `evals/helpers/evalServer.ts`	Wires `InMemoryTransport.createLinkedPair()` between an MCP `Client` and our `createServer()` — no subprocess, no network, deterministic. Captures every tool call into a `Trajectory`.
LLM provider	`evals/llm/{openai,anthropic,types,index}.ts`	Provider-agnostic adapter. OpenAI client speaks both OpenAI and any LiteLLM-compatible proxy (Anthropic on Vertex, Bedrock, local Llama, etc.); Anthropic native client is the default when `ANTHROPIC_API_KEY` is set.
Evaluators	`evals/evaluators/{skill-activation,tool-selection,negative-activation,trajectory,criteria}.ts`	5 evaluators: 3 deterministic (skill activation = SKILL.md loaded? tool selection = precision/recall vs `expected.tools`; negative activation = distractors stay silent), 1 code-judged (trajectory = LCS over expected sequence), 1 LLM-judged (criteria).
Datasets	`evals/datasets/{detection-rule-management,automatic-migration}.dataset.ts`	`detection-rule-management` (8 examples: 4 positives + 4 distractors) certifies an existing skill; `automatic-migration` (12 examples: 6 positives covering Splunk SPL ingest, partial translations, resource-fix, install + 6 distractors).
Spec wiring	`evals/{detection-rule-management,automatic-migration}.eval.test.ts`	Calls `runDataset()`, sets per-skill thresholds (≥80% tool selection, 100% negative activation).
Smoke	`evals/harness.test.ts`	Mock LLM provider runs the full host loop without API keys — guarantees `npm run test:evals` works in unit-test mode too.
Docs	`docs/evals.md`	Harness design, dataset shape, evaluator catalog, CI gating, how to add a new skill suite.
CI	`.github/workflows/evals.yml`	Two triggers: `workflow_dispatch` (manual) + `pull_request` filtered by the `evals` label. Reads `OPENAI_API_KEY` / `ANTHROPIC_API_KEY` from secrets; never runs by default to keep PR cost = $0.

Diff stats

43 files changed, 5390 insertions(+), 11 deletions(-)

Of those 11 deletions: tsconfig.json was extended to include evals/**/*, and manifest.json bumped to 1.1.0. The remainder is net-new code organized in dedicated evals/ and src/{tools,views,elastic/service}/migration* slices — no cross-cutting refactor of the existing alert-triage / case-management / threat-hunt views.

Surface model: where skills live vs where tools live

Surface	Audience	Skills accessible?	Tools accessible?	Where activation happens
Agent Builder in-product chat (`POST /api/agent_builder/converse` in Kibana)	Logged-in Kibana user	Yes — `SKILL.md` is loaded into the agent loop	Yes	The agent loop loads the `SKILL.md` file matching the request and orchestrates the tool calls
Agent Builder MCP (`POST /api/agent_builder/mcp` in Kibana)	External MCP client (Claude Desktop, Cursor, etc.)	No — MCP exposes tools only	Yes	The host-side LLM (in the client app) decides which tools to call; it has no knowledge of Kibana's skills
This MCP app (`elastic-security` stdio / Streamable HTTP MCP server)	External MCP client	No — same constraint as above	Yes — including the model-facing `migrate-rules` tool that opens the workbench	The host-side LLM (Claude Desktop's Sonnet, Cursor's Claude, etc.) decides; its SKILL.md lives in `.agents/skills/automatic-migration/SKILL.md` and is mirrored into Claude Desktop's settings + Cursor's settings via the existing `install-skills.sh` path

So when this PR says "automatic-migration skill", it means the SKILL.md file that the host's LLM loads — not anything that Kibana's agent builder MCP server exposes. The host calls our MCP server's tools; the host's own skill registry decides the prompt material around those tool calls. This is why the skill activation eval runs the host loop in-process (runMcpHostLoop.ts), not the Kibana agent.

How a SOC engineer experiences this end-to-end

They've installed this MCP server in Claude Desktop / Cursor / Claude Code (standard procedure documented under docs/setup-*.md).
They've installed the automatic-migration skill via ./scripts/install-skills.sh add -s automatic-migration -a {cursor|claude-desktop|claude-code}.
They open their host, type: "Migrate my Splunk detection rules to Elastic."
The host's LLM (Claude Sonnet / GPT-4o / whatever the host is configured with) loads SKILL.md and recognizes the request maps to migrate-rules.
It calls migrate-rules exactly once with no arguments. The tool returns a compact summary + a _meta.ui.resourceUri of ui://migrate-rules/mcp-app.html.
The host renders the workbench (single HTML bundle, < 1 MB). From here, every state transition (upload → translating → review → fix-rule → fix-resources → install → done) is driven by the workbench calling app-only tools through app.callServerTool(). The LLM is out of the loop until the user types something new in chat.
When the workbench finishes, the installed rules show up disabled in the user's Kibana Security Solution UI. They enable them when ready.

The "vendor gate" in step 4 means: today, only the splunk vendor button is enabled; qradar and sentinel-one show "Coming soon" with opacity-50 cursor-not-allowed. The translators for those vendors are still maturing in Kibana, and we'd rather route the user back to the Splunk path than ship a degraded partial-translation experience for the first time they try the feature. Re-enabling each vendor is a one-line change to the SUPPORTED_VENDORS array in src/tools/migration.ts plus the workbench's vendor-select component.

Eval harness

The harness exists because we needed three things this repo didn't have:

Activation certification. Did the host load SKILL.md and call migrate-rules once, or did it freelance with start-translation and friends? The skill-activation evaluator checks the trajectory for exactly that handshake. Distractors ("What's the weather?", "List my SaaS apps", "Show alerts for endpoint X") hit the negative-activation evaluator to make sure the skill doesn't trigger when off-topic.
Tool sequencing. Once the workbench takes over, the LLM should stay quiet. The trajectory evaluator uses LCS against expected.tools so that adding more tools (e.g. follow-up Q&A) doesn't tank the score, but reordering the canonical sequence does.
Provider-agnostic. The same evaluators run against OpenAI (incl. LiteLLM-proxied open-source) and Anthropic. The OpenAI adapter accepts a baseURL env var so the suite can target a Vertex/Bedrock LiteLLM in CI without changing test code.

The CI is evals label-gated + workflow_dispatch-triggered: a normal PR never spends a token; a PR labelled evals triggers an actual run with whichever provider keys are set; a nightly workflow can run all suites against the OSS LiteLLM proxy to track drift without API costs.

docs/evals.md documents the full design (provider matrix, evaluator catalog, dataset format, how to add a new skill suite).

Known limitations (per `address-known-limitations.mdc` triage)

Limitation	Triage	Status
Real screenshot of the workbench in Claude Desktop running against real Kibana with real Splunk SPL	Discovery seam (requires a live Kibana + Splunk export + Claude Desktop)	Deferred per `no-fabricated-evidence.mdc` — will be captured against the user's environment and embedded in a follow-up commit on this branch before merge; a static mock is NOT a substitute
`./scripts/install-skills.sh add -s automatic-migration -a cursor` runtime verification	Known fix in treadmill (substance-check rejects verification-only tasks)	Manual run before merge; treadmill substance-check is a separate orchestrator bug, captured in skill-dev plugin
Eval pass-rate against Claude/GPT-class production models	Discovery seam shipped — local Ollama runs are documented (zero-cost, tool-calling quality varies by model); first nightly run captures the Anthropic / OpenAI baseline	Pending (post-merge nightly)
QRadar / Sentinel-One end-to-end	Permanent constraint until upstream Kibana translators reach parity with Splunk	Vendor gate is the fallback — UX shows "Coming soon" rather than degrading silently

Eval baseline (captured end-to-end, this PR)

The harness was validated end-to-end against the local Ollama daemon to prove the wire-up works against a real LLM (not just the deterministic mock):

Model	Migration positives	Migration distractors	DRM positives	DRM distractors	Overall
llama3.1:8b	6/6 (100%)	6/6 (100%)	2/4 (50%)	4/4 (100%)	18/20 (90%)
llama3.2:3b	5/6 (83%)	6/6 (100%)	—	—	11/12 (92%) ⁂

⁂ Migration suite only — 3B model is below the production target.

The migration feature scores 100% on llama3.1:8b for both activation and distractor rejection. The 2 DRM-positive failures on llama3.1:8b are ambiguous-query edge cases on the pre-existing manage-rules skill ("Show me my noisy rules", "PowerShell-related high-severity rules"); a Claude / GPT-4o class model handles them correctly — that's tracked as the post-merge nightly baseline.

Running this end-to-end surfaced and fixed three real harness bugs in three follow-up commits — these are post-treadmill changes I made by hand after the orchestrator wrapped up:

Commit	Fix	Impact on llama3.1:8b migration
`621b309` feat(evals): allow `OPENAI_MODEL` override for Ollama / LiteLLM proxies	`createDefaultLlmProvider()` now pipes `OPENAI_MODEL` through to `OpenAiProvider`, so the suite runs against any OpenAI-compatible endpoint (Ollama, LiteLLM, Anthropic via proxy). Default `gpt-4o-mini` behaviour preserved.	Enables zero-cost local validation
`2ebbf54` fix(evals): hide app-only tools from the LLM in `runMcpHostLoop`	The host loop was passing every tool from `client.listTools()` to the model, including 10 app-only tools per skill (`start-translation`, `install-rules`, `find-rules`, …) — but real MCP hosts hide tools marked `_meta.ui.visibility: ["app"]`. Filter now mirrors the host contract: visible if visibility is unset OR includes `"model"` OR doesn't include `"app"`.	positives: 67% → 100%
`0543e20` fix(evals): register all 7 model-facing tool groups in `createEvalServer`	The eval server only registered `migration` + `detection-rules`. A distractor like "Create a new case" had no `manage-cases` to land on, so the model forced a false positive on `manage-rules`. The server now mirrors `src/server.ts` exactly (alert-triage, attack-discovery, case-management, detection-rules, migration, sample-data, threat-hunt) with `vi.fn()` service stubs.	distractors stayed 100%; DRM distractors: 25% → 100%

How to reproduce locally:

# zero-cost local baseline
OPENAI_API_KEY=ollama \
  LITELLM_BASE_URL=http://localhost:11434/v1 \
  OPENAI_MODEL=llama3.1:8b \
  RUN_LLM_EVALS=1 npm run test:evals

Commit slices

Even though it's one PR, the commit history reads top-to-bottom as the design document with each commit being a single reviewable unit. Highlights from git log --oneline main..HEAD:

docs: add SIEM Migration section to features.md
docs: add SIEM Migration to README features table
chore: bump manifest to 1.1.0 and add migrate-rules tool entry
feat: wire MigrationsService and registerMigrationTools into server
feat: add automatic-migration eval spec (positives ≥80%, distractors 100%)
feat: add automatic-migration eval dataset (6 positives + 6 distractors)
feat: add automatic-migration SKILL.md with lifecycle and gotchas
feat: build migration view as singlefile HTML bundle (365 kB, < 1 MB)
feat: install step and done step with working back navigation
feat: fix-resources drawer with per-resource inline edit and unresolved highlighting
feat: per-rule drawer with ElasticRulePartial form and Re-validate button
feat: review step renders three-column diff (SPL | generated | editable Monaco)
feat: translating step now polls get-migration instead of get-stats
feat: implement upload step with file input, drag-and-drop, and start-translation call
feat: tighten vendor-select gate to use opacity-50 cursor-not-allowed
feat: add migration workbench view with WorkbenchState machine
test: add migration tool tests (tool registrations + vendor gating)
feat: register migration tools (1 model-facing + 10 app-only)
test: add MigrationsService tests covering all 14 route methods and error handling
feat: add MigrationsService wrapping 14 /internal/siem_migrations/* Kibana routes
docs: add evals.md — harness design, dataset shape, evaluator catalog, CI gating
ci: add evals.yml GitHub Actions workflow
evals: add detection-rule-management.eval.test.ts; split dataset from test orchestration
evals: add detection-rule-management dataset (4 positives + 4 distractors)
evals: add criteria (LLM-as-judge) evaluator
evals: add trajectory evaluator (LCS-based sequence score)
evals: add tool-selection evaluator (precision/recall F1 against expected.tools)
evals: add negative-activation evaluator for distractor examples
evals: add skill-activation evaluator (binary score)
evals: add AnthropicProvider and wire it as the default when ANTHROPIC_API_KEY is set
evals: add OpenAiProvider with LiteLLM proxy support and wire default provider
evals: implement runMcpHostLoop with InMemoryTransport and LLM provider types
evals: add runner.ts orchestrator, runMcpHostLoop stub, and eval vitest config
evals: add types.ts with Dataset, Example, EvalResult and related types

Validation evidence (pre-merge)

npx tsc --noEmit — clean (0 errors) ✅
npm test — runs unit tests + evals/harness.test.ts (mock provider, no API keys); see CI badge
npm run test:evals — full LLM eval suite, gated by RUN_LLM_EVALS=1 + OPENAI_API_KEY / ANTHROPIC_API_KEY; 18/20 passing against local llama3.1:8b (migration suite 12/12, see Eval baseline table)
npm run build — singlefile workbench HTML bundle is < 1 MB (verified locally: 365 kB)

How this PR was authored

End-to-end via patryks-treadmill (a local orchestrator I maintain). A single description, dispatched via treadmill_generate_plan against this repo, produced the OpenSpec change (proposal.md, design.md, tasks.md, specs/main/spec.md) and then the 48-task plan that landed every file shown above. The orchestrator dispatched each task to a per-task claude subagent in the worktree, ran the substance check + verifier on every commit, and produced the 43-file diff at commit 4b86632. The author reviewed each substantive commit and intervened on 4 verification-only tasks (screenshot, install-skills smoke, commit reordering, PR-description self-audit) that the orchestrator's substance check can't currently handle — those are tracked under "Known limitations" above.

Introduces the canonical TypeScript type definitions for the eval pipeline: - `ToolCall` / `Trajectory` — MCP host loop output primitives - `ExpectedBehavior` — optional `tools`, `criteria`, `skill` fields (evaluators return `'N/A'` when a field they need is absent) - `Example` / `Dataset` — test-case and collection shapes - `EvaluatorResult` / `EvalResult` — per-evaluator and per-example results - `Evaluator` — async-compatible function contract all evaluator modules satisfy Also adds `evals/**/*` to tsconfig.json includes so tsc covers eval files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ls/types.ts` with TypeScript definitions for `Dataset`, `Exam Auto-committed by patryks-treadmill orchestrator. plan=automatic-migration-mcp-app job=64319163-2da8-44b5-b087-3dee6e9e4c14 attempt=1

…st config runner.ts exports `runDataset(dataset, evaluators, options?)` which: - Wraps all examples in `describe.skipIf(!process.env.RUN_LLM_EVALS)` so regular `npm test` never makes LLM calls or requires API keys - Creates one `it` per example: runs runMcpHostLoop, scores via evaluators, asserts numeric scores >= passingScore (default 0.5) - Emits a Markdown table summary via afterAll for CI job summaries runMcpHostLoop.ts is a typed stub (throws); full InMemoryTransport implementation comes in the next commit. evals/vitest.config.ts runs in node environment with 120 s timeout, scoped to evals/**/*.{test,spec,eval}.ts and *.dataset.ts patterns. Also: - Adds `test:evals` script to package.json (cross-env RUN_LLM_EVALS=1) - Adds evals/**/*.ts to eslint.config.js file patterns so eval files are linted and license-header-checked Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…er types runMcpHostLoop wires an MCP Client to the server via InMemoryTransport (in-process, no network), lists available tools, and drives a loop of up to MAX_TURNS=8 turns: LLM → tool calls → client.callTool() → result fed back → repeat Options allow callers to inject a pre-built McpServer (for mocked-service datasets) or a custom LlmProvider (for deterministic tests). Both default to the real implementations when omitted. evals/llm/types.ts introduces the LlmProvider interface and LlmMessage discriminated union (OpenAI-style, compatible with LiteLLM proxies). evals/llm/index.ts exposes createDefaultLlmProvider(), which auto-selects by env var (ANTHROPIC_API_KEY first, then OPENAI_API_KEY); the concrete adapters (anthropic.ts / openai.ts) land in the next commit — this stub surfaces a clear error until they do. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… provider OpenAiProvider (evals/llm/openai.ts): - Implements LlmProvider.chat() via the openai SDK (gpt-4o-mini default) - Accepts baseURL to point at a LiteLLM proxy for any compatible provider - Maps LlmMessage ↔ ChatCompletionMessageParam in both directions; narrows ChatCompletionMessageToolCall to FunctionToolCall before accessing .function - Strips tools argument when the list is empty (avoids API errors) evals/llm/index.ts: - createDefaultLlmProvider() now returns a real OpenAiProvider when OPENAI_API_KEY is set; picks up LITELLM_BASE_URL automatically - Preserves the ANTHROPIC_API_KEY branch with a clear "coming soon" error until evals/llm/anthropic.ts lands Adds openai@^6.37.0 as a devDependency (npm install --save-dev openai). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…C_API_KEY is set AnthropicProvider (evals/llm/anthropic.ts): - Implements LlmProvider.chat() via @anthropic-ai/sdk (claude-haiku-4-5-20251001) - toAnthropicMessages() handles the structural gap between OpenAI-style messages and Anthropic's API: no `tool` role exists; tool results go as `user` messages with `tool_result` content blocks; consecutive tool results are merged into a single user turn to avoid adjacent-user-turn API errors - Tool input is round-tripped JSON.parse (from LlmToolCallRequest.arguments) → object for the request, then JSON.stringify back for the response to maintain the OpenAI-compatible LlmToolCallRequest shape - input_schema is cast from LlmToolDefinition.parameters (already JSON Schema) evals/llm/index.ts: - createDefaultLlmProvider() now returns AnthropicProvider when ANTHROPIC_API_KEY is set (priority 1), falls back to OpenAiProvider for OPENAI_API_KEY (priority 2) Adds @anthropic-ai/sdk@^0.96.0 as a devDependency. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Makes per-example test names visible in CI output and in the GitHub Actions job summary, which is where the Markdown eval table lands. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Returns 1 if the trajectory contains at least one call to the skill's entry-point tool (expected.skill), 0 if not, or 'N/A' when expected.skill is absent so datasets that don't test skill routing can omit the field. The failure reason includes the full tool-name list from the trajectory to make CI output actionable without re-running the eval. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Binary complement of skill-activation: returns 1 when the skill's entry-point tool (expected.skill) is absent from the trajectory (correct — LLM was not falsely triggered), 0 when the tool appears (false positive). Returns 'N/A' when expected.skill is absent, matching the skill-activation convention so both evaluators behave consistently on examples that don't declare a skill. CI gate intent: datasets should require 100% on this evaluator for distractor examples — any false positive means the skill's SKILL.md is over-triggering on unrelated queries in production. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…cted.tools) Computes set-based precision, recall, and F1 against expected.tools. Deduplicates both the trajectory and the expected list — order/repetition is the trajectory evaluator's job. Score = F1 ∈ [0, 1]. Returns 'N/A' when expected.tools is absent so datasets that only test skill routing don't need to declare tool lists. The reason string includes missed and extra tool names to make CI failures immediately actionable without re-running the eval. CI gate intent: ≥0.8 (80%) on positive examples. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Computes score = lcs(actual, expected) / max(|actual|, |expected|). Dividing by the max penalises both missing tools (recall gap) and extra spurious tools (precision gap) in a single metric. Sequence matters here, unlike tool-selection which is set-based. Returns 'N/A' when expected.tools is absent — this guard prevents the evaluator from emitting meaningless 0-scores on examples that declare no ordered expectation, which would mask real regressions elsewhere. LCS is O(m·n) time via a flat DP array to avoid nested-array allocation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

createCriteriaEvaluator(llm) returns an Evaluator that sends the trajectory and expected.criteria to a judge LLM with a structured rubric prompt asking for JSON {score, reasoning}. Returns 'N/A' when expected.criteria is absent. The factory pattern closes over the LLM provider so datasets can inject different judges (e.g. a stronger model for criteria, haiku for routing). Parsing: primary path extracts the first JSON object from the response and clamps score to [0, 1]. Falls back to a bare-number regex for models that ignore the JSON instruction, and finally returns score=0 with the raw text if neither succeeds. The judge prompt serialises only {tool, args} per call — omitting result avoids token bloat from large tool outputs while still giving the judge enough signal to evaluate routing decisions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tors) Proves the eval harness end-to-end against the existing manage-rules skill. Positives (drm-pos-01..04): natural-language queries about viewing/finding detection rules — the LLM should call manage-rules. Evaluated with skill-activation + tool-selection (≥80% gate). Distractors (drm-neg-01..04): case creation, alert triage, ES|QL hunting, host investigation — the LLM should NOT call manage-rules. Evaluated with negative-activation (100% gate — any false positive is a regression). Two separate runDataset calls wire the correct evaluators and thresholds to each example group without mixing evaluator semantics across types. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… test orchestration Separates data from test concerns: - detection-rule-management.dataset.ts now only exports data (positiveExamples, distractorExamples, detectionRuleManagementDataset); no runDataset calls - detection-rule-management.eval.test.ts is the Vitest entry point that imports the sub-arrays and calls runDataset with the correct evaluators Gate layout (unchanged from before): positives — skill-activation + tool-selection, passingScore: 0.8 distractors — negative-activation, passingScore: 1.0 The .eval.test.ts suffix matches the include glob in evals/vitest.config.ts so `npm run test:evals` picks it up without further config changes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Triggers: - workflow_dispatch manual run from Actions UI - schedule (0 2 * * *) nightly at 02:00 UTC - pull_request_target only when 'evals' label is added; gated by label write permission so only maintainers can trigger Concurrency group 'evals-<ref>' cancels in-progress runs on new pushes, preventing redundant jobs from burning LLM quota. The 'Run evals' step sets RUN_LLM_EVALS=1 and passes four secrets: EVAL_ANTHROPIC_API_KEY Claude Haiku (priority) EVAL_OPENAI_API_KEY GPT-4o-mini fallback EVAL_LITELLM_BASE_URL optional LiteLLM proxy base URL EVAL_CLUSTERS_JSON Elastic cluster credentials for the MCP server Output is captured with tee so it appears in the job log AND in eval-output.txt. A separate 'Post eval results' step (if: always()) appends '## Eval results' plus the full output to $GITHUB_STEP_SUMMARY so the rendered Markdown tables from the runner appear in the Actions job summary. For pull_request_target the checkout uses the PR head SHA so evals run against the proposed changes rather than the base branch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…, CI gating Covers: - Architecture diagram showing runner → runMcpHostLoop → evaluators pipeline - Key design choices table (in-process transport, skip-if guard, N/A semantics) - Dataset shape reference with all three optional expected fields documented - Positive vs distractor example pattern with runDataset code snippets - Evaluator catalog: type, score range, N/A condition, and recommended gate for all five evaluators (skill-activation, negative-activation, tool-selection, trajectory, criteria) - Step-by-step how-to-add-dataset guide with copy-paste templates - CI gating: workflow triggers, required secrets table, passing threshold table Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ibana routes Service injects KibanaClient directly (no separate *Client indirection since these are internal-only Kibana routes with no public API equivalent). The KibanaClient already supplies x-elastic-internal-origin: Kibana; each method adds elastic-api-version: 2023-10-31 via MIGRATION_HEADERS per-request. 14 methods, one per route: createMigration POST /internal/siem_migrations/rules listMigrations GET /internal/siem_migrations/rules getMigration GET /internal/siem_migrations/rules/:id deleteMigration DELETE /internal/siem_migrations/rules/:id uploadRules POST /internal/siem_migrations/rules/:id/rules getTranslatedRules GET /internal/siem_migrations/rules/:id/rules getTranslatedRule GET /internal/siem_migrations/rules/:id/rules/:ruleId updateTranslatedRule PUT /internal/siem_migrations/rules/:id/rules/:ruleId startTranslation POST /internal/siem_migrations/rules/:id/start stopTranslation POST /internal/siem_migrations/rules/:id/stop getResources GET /internal/siem_migrations/resources/:id upsertResources POST /internal/siem_migrations/resources/:id installRules POST /internal/siem_migrations/rules/:id/install getStats GET /internal/siem_migrations/rules/:id/stats MigrationApiError wraps every non-2xx response with typed status (extracted from the Kibana client's "Kibana [cluster] STATUS: body" error format) and the request path so callers can surface actionable error messages. Domain types: SiemMigration, TranslatedRule, MigrationResource, MigrationStats and associated option/result interfaces, all barrel-exported from service/index. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rror handling 19 tests across 14 describe blocks — one per route method plus three error-handling tests: Migration lifecycle: createMigration, listMigrations, getMigration, deleteMigration Rule upload: uploadRules Translated rules: getTranslatedRules (default+custom pagination), getTranslatedRule, updateTranslatedRule Translation control: startTranslation, stopTranslation Resources: getResources, upsertResources Installation: installRules (no-ids + with-ids) Stats: getStats MigrationApiError: status parsed from Kibana error format; status=0 fallback; all mutating methods surface MigrationApiError Also adds `put: vi.fn()` to MockHttpClient / makeMock in mockHttpClient.ts so MigrationsService.updateTranslatedRule can be exercised. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

migrate-rules (model-facing): _meta.ui.resourceUri = ui://migrate-rules/mcp-app.html Callback seeds the workbench with a compact migration list so the LLM gets immediate context. App-only tools (_meta.ui.visibility: ["app"]): list-migrations GET all migrations get-migration GET single migration by ID get-translated-rules paginated translated rule listing (vendor-gated) start-translation kick off AI translation (vendor-gated) stop-translation halt in-progress translation (vendor-gated) update-translated-rule patch elastic_rule / translation_result / comments (vendor-gated) get-resources list macros/lookups (vendor-gated) upsert-resource create/replace single macro or lookup (vendor-gated) install-rules install translated rules, optional id filter (vendor-gated) get-stats per-migration translation/installation stats Vendor gate: SUPPORTED_VENDORS = ["splunk"]. If a vendor param is provided and not in the list, returns { error: "vendorNotSupported", vendor } without hitting Kibana. Re-enabling a vendor is a one-line change to the constant. Also registers the migration workbench HTML via registerAppResource; the view file is resolved at request time (resolveViewPath("migration")) so the tool works once the view is built in a subsequent commit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

20 tests covering: Registration: all 11 tools + HTML resource registered under the correct names migrate-rules: workbench message + compact migration list returned to LLM app-only tool happy paths: list-migrations, get-migration, get-translated-rules (with pagination), start-translation, stop-translation, update-translated-rule (parses elasticRule JSON), get-resources, upsert-resource (single-element array), install-rules (with ids), get-stats Vendor gating (per gated tool): - vendor="qradar" / "sentinel-one" / unknown → { error: "vendorNotSupported" } without calling the service - vendor absent → proceeds (defaults to Splunk path) get-stats has no vendor gate — confirmed by calling without vendor Also adds createMockMigrationsService() to mockServices.ts covering all 14 MigrationsService methods. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

src/views/migration/App.tsx — full state machine: WorkbenchState discriminated union (8 stages): vendor-select → user picks vendor → creates migration upload → paste Splunk rules JSON → upload + start translation translating → polls get-stats every 3s → advances on completion review → lists translated rules with status badges + fix actions fix-rule-drawer → slide-over editor for single rule JSON + result enum fix-resources-drawer → slide-over for macro/lookup create/update install → confirmation step before calling install-rules done → success summary with installed/failed counts Vendor gate (5-LOC client check): SUPPORTED_VENDORS = ["splunk"] VENDOR_CATALOGUE entries not in SUPPORTED_VENDORS render as disabled with "Coming soon" badge — re-enabling a vendor is a one-line change. MCP integration: All data via app.callServerTool() through the 10 app-only tools. translating stage schedules a 3-second poll loop that stops and transitions to review when stats.rules.processing === 0. Supporting files: mcp-app.html — minimal HTML shell (title: "SIEM Migration") mcp-app.tsx — standard React 18 createRoot mount styles.css — vendor-grid, upload-area, progress-bar, rule status badges, drawer layout Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replaces the custom migration-vendor-card--disabled CSS class with the spec-required Tailwind utilities (opacity-50 + cursor-not-allowed) so the disabled state is expressed as two atomic utility classes rather than a bespoke rule, and removes the now-unused CSS block from styles.css. The client-side gate remains ≤5 LOC: const active = SUPPORTED_VENDORS.includes(id); // 1 LOC check disabled={!active} // 1 LOC DOM attr onClick={() => active && onSelect(id)} // 1 LOC guard Re-enabling a vendor is still a one-line change to SUPPORTED_VENDORS. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…-translation call Upload component now offers three input paths: 1. File picker — hidden <input type="file" accept=".json"> wired to a visible "Choose file…" button; FileReader populates the textarea 2. Drag-and-drop — drop zone tracks dragOver state for visual feedback (border-blue-400 bg-blue-50) and reads the dropped file via FileReader 3. Paste — textarea remains for direct JSON pasting "Upload & start translation" button stays disabled until text is non-empty. Clicking it calls onUpload(text) which runs the chain in App: upload-rules → start-translation → get-stats → translating stage Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

schedulePoll replaces get-stats with get-migration so progress tracking uses Kibana's authoritative lifecycle status ("ready" | "running" | "finished" | "error") rather than the derived stats endpoint. Completion condition changed from: stats.rules.processing === 0 && stats.status !== "running" to: migration.status === "finished" || migration.status === "error" This is both more precise (avoids a brief window where processing can be 0 mid-run) and aligns with the Kibana status contract. MigrationStats type gains the narrowed status union and an optional name field so the same shape works for both get-migration and get-stats responses without a separate type. Translating component gains an error-state branch: when status is "error" the heading says "Translation encountered an error" and the progress bar is hidden, letting the workbench advance to review with whatever partial results Kibana returned. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…le Monaco) Review step now expands any rule row inline to show RuleDiff — a three-column panel that renders the full diff/fix UX without leaving the review list: Left — Original SPL (plain <pre>, read-only): shows rule.original_rule.search or falls back to full original_rule JSON if the search field is absent. Middle — Generated Elastic rule JSON (read-only Monaco, language=json): shows the rule.elastic_rule output from the AI translator. Right — User-editable version (Monaco, language=json): seeded from the generated JSON, editable by the reviewer, saved via update-translated-rule. Footer bar: translation-result enum selector + Cancel / Save buttons. Clicking a rule row toggles the inline diff; clicking again or Cancel collapses. A "Drawer" button remains for partial/untranslatable rules that need the full slide-over editor. saveRuleInline callback in App handles update-translated-rule from the review state directly, bypassing the fix-rule-drawer state transition. monaco-environment.ts added (mirrors threat-hunt) so the inlined bundle can resolve the editor worker without fetching external chunks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tton Replaces the bare JSON textarea in RuleDrawer with a structured form covering the 7 key Elastic detection rule fields (name, description, type, query, language, severity, risk_score). The Re-validate button saves the current edits and marks the rule as "partial" via update-translated-rule; Save uses the user-selected translation result. Adds .migration-form-input CSS for consistent field styling. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ed highlighting Replaces the single add-form drawer with per-resource inline edit rows: - Unresolved resources (empty content) are auto-expanded and rendered with a yellow border/background so they are immediately actionable - Each row has an individual Save button calling upsert-resource - Resolved resources are collapsed by default but expandable for edits - An "Add resource" section at the bottom handles net-new entries - saveResources now stays in fix-resources-drawer after upsert (refreshes the list) so users can fix multiple resources in one session; closeDrawer transitions back to review as before Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix: install stage was missing resources, so closeDrawer could not restore the full review state. Now: - WorkbenchState.install carries resources alongside translations - startInstall passes resources when entering the stage - closeDrawer handles install → review (joins the existing fix-*-drawer → review paths), making the "Back to review" button functional - confirmInstall calls install-rules and transitions to done with installed/failed counts - Done step shows KpiStrip with installed/failed tiles and a reset action Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Monaco editor added ~4.8 MB to the bundle (editor library + inlined editor.worker). To meet the < 1 MB singlefile target, Monaco is removed from the migration view: - RuleDiff generated column: Monaco read-only → <pre> (same class as SPL) - RuleDiff editable column: Monaco Editor → <textarea> with matching monospace style (.migration-diff-textarea) - RuleDrawer: already uses structured form inputs, not Monaco — unchanged - Removed monaco-environment import from mcp-app.tsx entry point Output: 364 kB uncompressed (105 kB gzip) — a single self-contained mcp-app.html with no companion worker files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Host-side skill prompt for the SIEM migration workflow. Covers: - YAML frontmatter with trigger phrases (migrate my Splunk rules, import SPL, onboard from Splunk, SIEM migration, convert detection rules) - Tools table separating the model-facing migrate-rules entry-point from the 10 workbench-only app tools - Workbench Lifecycle table documenting all 8 stages with what the user does and what signals completion - Correction strategy: start-over, back-from-install, re-edit rule, re-edit resource, restart translation - Common gotchas: vendor gate, direct tool calls, upload format, partial translations, macro/lookup resolution, large rule sets, re-opening existing migrations Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Positives cover the five spec trigger phrases (migrate Splunk rules, upload SPL bundle, onboard from Splunk, SIEM migration, convert detection rules) plus an install-translated-rules variant. Distractors span the other five skills (detection-rule-management, alert-triage, threat-hunt, case-management, generate-sample-data) to test boundary discrimination. All examples set expected.skill so the negative-activation evaluator can gate on migrate-rules absence in distractor runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…100%) Two runDataset calls mirroring the detection-rule-management pattern: - positives: skill-activation + tool-selection evaluators, passingScore 0.8 - distractors: negative-activation evaluator, passingScore 1.0 (any false positive on migrate-rules is treated as a regression) Suite is skipped in regular npm test via describe.skipIf(!RUN_LLM_EVALS). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Imports MigrationsService from elastic/service/index and registerMigrationTools from tools/migration - Instantiates migrationsService with the shared kibanaClient - Calls registerMigrationTools after the other six tool registrations - Updates integration test snapshots: +11 migration tool names and +1 UI resource URI (ui://migrate-rules/mcp-app.html) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds migrate-rules to the tools[] array so the MCP app marketplace advertises the new automatic migration capability. Version bumped to 1.1.0 (minor) to signal the new feature surface. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Updates the tool count from six to seven and adds a row for the new SIEM Migration feature (migrate-rules tool + workbench). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Documents the full migrate-rules workbench workflow: vendor selector, upload, AI translation with progress bar, three-column rule review, per-rule drawer (ElasticRulePartial form), resources drawer with per-row inline save, translation statuses, install step, and done summary. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cla-checker-service · 2026-05-15T11:28:42Z

❌ Author of the following commits did not sign a Contributor Agreement:
06d830c

Please, read and sign the above mentioned agreement if you want to contribute to this project

- Add evals/harness.test.ts: always-on mock-based integration tests that exercise the full eval pipeline (runMcpHostLoop → evaluators) for both detection-rule-management and automatic-migration datasets without API keys or a live cluster. Passes 100% on all gates (tool-selection ≥ 80%, negative-activation = 100%). - Add evals/helpers/evalServer.ts: shared factory that creates a real McpServer backed by stub services; used by both harness.test.ts and the LLM eval suites so neither needs CLUSTERS_JSON. - Update evals/runner.ts: add optional createServer factory to RunnerOptions (injected per-example since InMemoryTransport is single-use); also widen skipIf to skip gracefully when RUN_LLM_EVALS=1 but no API key is configured. - Update evals/vitest.config.ts: remove dataset files from include — *.dataset.ts files contain no test suites and were causing "no test suite found" failures. - Update both *.eval.test.ts files to pass createEvalServer so the LLM eval suites no longer require a live Elastic cluster. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The OpenAI adapter already accepted a `model` constructor option; this pipes it through `createDefaultLlmProvider()` so operators can run the eval suite against a local Ollama daemon at no cost: OPENAI_API_KEY=ollama \ LITELLM_BASE_URL=http://localhost:11434/v1 \ OPENAI_MODEL=llama3.1:8b \ npm run test:evals Default behaviour (gpt-4o-mini when only OPENAI_API_KEY is set) is unchanged because `OpenAiProvider`'s `model = DEFAULT_MODEL` default kicks in for `undefined`.

Tools registered via `registerAppTool(...)` with `_meta.ui.visibility: ["app"]` are invoked by the React workbench via `app.callServerTool()`. Real MCP hosts (Claude Desktop, Cursor) hide them from the LLM. The eval harness was passing every tool from `client.listTools()` straight to the model, so small open-source models saw `start-translation` / `install-rules` / `find-rules` as alternatives to `migrate-rules` / `manage-rules` and routed there instead — collapsing activation rates and misrepresenting what a real MCP host exposes. `isVisibleToModel()` mirrors the host-side visibility contract: - visibility unset → visible (default for model-facing tools) - visibility includes "model" → visible - visibility includes "app" without "model" → hidden Baseline shift on llama3.1:8b (automatic-migration positives): before fix: 67% (4/6 — model called start-translation / install-rules) after fix: 100% (6/6 — model called migrate-rules every time)

Previously the eval server only registered detection-rules and migration. When a distractor query like "Create a new case for a ransomware incident" hit the LLM, the model had no `manage-cases` option to choose, so it forced a poor match on `manage-rules` and the negative-activation evaluator collapsed. A real MCP host exposes the full set of model-facing tools — the eval server should match. Services are stubbed with `vi.fn()` because skill-routing evaluators only inspect which tools were called, not what they returned. Tool groups registered (mirroring src/server.ts): - alert-triage → triage-alerts - attack-discovery → triage-attack-discoveries - case-management → manage-cases - detection-rules → manage-rules - migration → migrate-rules - sample-data → generate-sample-data - threat-hunt → threat-hunt Baseline shift on llama3.1:8b (detection-rule-management distractors): before fix: 25% (1/4 — manage-rules over-selected on case/ESQL/alert queries) after fix: 100% (4/4 — model picks manage-cases / threat-hunt / etc. correctly) docs/evals.md updated with the Ollama route and a note that CLUSTERS_JSON is not required when using createEvalServer.

patrykkopycinski · 2026-05-18T08:21:31Z

Surfaced during end-to-end eval validation against llama3.1:8b (Ollama, local) — not a blocker for this PR, but worth flagging since the data is fresh.

After the three harness fixes in 621b309 / 2ebbf54 / 0543e20 the manage-rules model-facing tool didn't activate on these two positive examples:

Example	Query	Trajectory
`drm-pos-01`	"Show me my noisy rules — which detection rules are generating the most alerts"	`[empty]` — model returned no tool call
`drm-pos-03`	"Find high severity detection rules related to PowerShell execution"	`[empty]` — model returned no tool call

Looking at src/tools/detection-rules.ts the manage-rules description is framed around management verbs (enable, disable, list, manage). Both failing queries are framed around discovery verbs ("show me", "find"), which an 8B-class model reads as "the user wants information I can answer from training data, not a tool to call". A Claude / GPT-4o class model routes correctly per the description specificity — that's tracked as the post-merge nightly baseline.

A minimal lift on the description that would likely close the gap on smaller models:

description:
  "List, find, search, show, query, review, audit, enable, disable, or otherwise " +
  "manage detection rules in Elastic Security. Use for ANY question that requires " +
  "inspecting the current rule catalog (noisy rules, rule coverage by ATT&CK, " +
  "rules matching a string, rule status, etc.) and for enabling / disabling rules. " +
  "Returns rule metadata; opens the rule management workbench for bulk operations.",

I'm not amending the diff in this PR since manage-rules is the pre-existing skill — it's the maintainer's call whether to ship this as part of #32 or as a follow-up. The migrate-rules skill (the one this PR adds) scored 6/6 on positives + 6/6 on distractors against the same llama3.1:8b run, so the new feature is fine; this is a heads-up on a description-quality finding the new harness now makes measurable.

Full baseline table in the PR body under "Eval baseline (captured end-to-end, this PR)".

Adds an optional host-level system prompt to the in-process MCP host loop so the harness can pin LLM behavior to what a real MCP host would instruct. Real hosts (Claude Desktop, Cursor) inject a system prompt that constrains tool selection, response shape, and HITL confirmation flow. Without one, the harness measures raw model-vs-tools behavior — which over- or under-reports activation depending on the model family. Wired end-to-end: - HostLoopOptions.systemPrompt: optional string; empty/whitespace treated identically to omitting (the absence is observable in evals). - LlmMessage gains a `system` role variant so the prompt flows through the unified message shape both adapters consume. - OpenAI adapter: appends `role: "system"` as a normal message (Chat Completions schema accepts it natively). - Anthropic adapter: strips system-roled messages from the array and passes them via the top-level `system` parameter on `messages.create` — the only place Anthropic accepts a system prompt. The `toAnthropicMessages` helper's parameter type is narrowed to `Exclude<LlmMessage, { role: "system" }>` so the invariant is enforced at the type system, not in prose. Tests: - 3 new harness tests covering the propagation contract: (a) systemPrompt is the first message when provided (b) no system message is injected when omitted (c) empty / whitespace-only strings are treated as omitted - All 23 harness tests pass (was 20). - Tests use a recording-LLM provider so the assertion is on what the adapter actually received, not on response side effects. Docs: - docs/evals.md gains a "Host system prompt" section explaining the contract + provider-specific handling. - Drive-by: the Ollama example switched from `qwen2.5:32b-instruct-q4_K_M` (exposes /generate only, returns "does not support chat" against this harness) to `llama3.1:8b` which speaks the OpenAI Chat Completions schema. Caught end-to-end while validating the harness. Anti-overengineering self-check: - Gate 1 (existing abstraction): HostLoopOptions already exists. `systemPrompt?: string` slots in without a new interface. - Gate 2 (real consumer): the next eval suite that wants to mimic Claude Desktop's HITL prompt; SKILL.md-driven evals that need the skill body as system context. - Gate 3 (smallest in-place): one new optional field, one new role variant, two adapter cases, three tests. ~30 LOC of behavior change excluding tests + docs. - Gate 6 (cost): default-off, no impact on existing callers.

llama3.1:8b is below the threshold where tool-calling decisions produce useful signal (team eval finding: ≥14B parameters is the floor). Sub-14B 'passes' are coincidence, not a result, so documenting an 8B as the 'good baseline' propagates a floor that masks real harness bugs (elastic#25/elastic#26/elastic#27) and green-lights skills that aren't ready. Replace with the explicit ≥14B parameter requirement, a chat- completions caveat (qwen2.5:32b-instruct-q4_K_M legitimately returns 'does not support chat' against /v1/chat/completions as of Ollama 0.3.x), and verified candidates the next reader can pull. See elastic/agent-builder-skill-dev-cursor-plugin anti-pattern elastic#28 for the full rationale.

patrykkopycinski and others added 30 commits May 15, 2026 10:55

ao(create-evals-types-ts-with-typescript-definitions--0): Create `eva…

06d830c

…ls/types.ts` with TypeScript definitions for `Dataset`, `Exam Auto-committed by patryks-treadmill orchestrator. plan=automatic-migration-mcp-app job=64319163-2da8-44b5-b087-3dee6e9e4c14 attempt=1

evals: add --reporter=verbose to test:evals script

ab6ac67

Makes per-example test names visible in CI output and in the GitHub Actions job summary, which is where the Markdown eval table lands. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

patrykkopycinski and others added 6 commits May 15, 2026 13:11

docs: add SIEM Migration to README features table

76fdde9

Updates the tool count from six to seven and adds a row for the new SIEM Migration feature (migrate-rules tool + workbench). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

patrykkopycinski changed the title ~~feat: automatic SIEM migration skill + eval harness~~ feat: automatic-migration skill + Vitest eval harness May 15, 2026

patrykkopycinski marked this pull request as draft May 15, 2026 12:10

patrykkopycinski added 3 commits May 15, 2026 14:25

patrykkopycinski added 2 commits May 18, 2026 10:34

davethegut changed the title ~~feat: automatic-migration skill + Vitest eval harness~~ [Epic] - feat: automatic-migration skill + Vitest eval harness May 28, 2026

davethegut changed the title ~~[Epic] - feat: automatic-migration skill + Vitest eval harness~~ [Epic][Security-MCP] - feat: automatic-migration skill May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Epic][Security-MCP] - feat: automatic-migration skill#32

[Epic][Security-MCP] - feat: automatic-migration skill#32
patrykkopycinski wants to merge 42 commits into
elastic:mainfrom
patrykkopycinski:ao/feat-automatic-migration-mcp-app

patrykkopycinski commented May 15, 2026 •

edited

Loading

Uh oh!

cla-checker-service Bot commented May 15, 2026

Uh oh!

patrykkopycinski commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

patrykkopycinski commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What ships

Migration feature

Eval harness

Diff stats

Surface model: where skills live vs where tools live

How a SOC engineer experiences this end-to-end

Eval harness

Known limitations (per address-known-limitations.mdc triage)

Eval baseline (captured end-to-end, this PR)

Commit slices

Validation evidence (pre-merge)

How this PR was authored

Uh oh!

cla-checker-service Bot commented May 15, 2026

Uh oh!

patrykkopycinski commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

patrykkopycinski commented May 15, 2026 •

edited

Loading

Known limitations (per `address-known-limitations.mdc` triage)