[Epic][Security-MCP] - feat: automatic-migration skill#32
[Epic][Security-MCP] - feat: automatic-migration skill#32patrykkopycinski wants to merge 42 commits into
Conversation
Introduces the canonical TypeScript type definitions for the eval pipeline: - `ToolCall` / `Trajectory` — MCP host loop output primitives - `ExpectedBehavior` — optional `tools`, `criteria`, `skill` fields (evaluators return `'N/A'` when a field they need is absent) - `Example` / `Dataset` — test-case and collection shapes - `EvaluatorResult` / `EvalResult` — per-evaluator and per-example results - `Evaluator` — async-compatible function contract all evaluator modules satisfy Also adds `evals/**/*` to tsconfig.json includes so tsc covers eval files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ls/types.ts` with TypeScript definitions for `Dataset`, `Exam Auto-committed by patryks-treadmill orchestrator. plan=automatic-migration-mcp-app job=64319163-2da8-44b5-b087-3dee6e9e4c14 attempt=1
…st config
runner.ts exports `runDataset(dataset, evaluators, options?)` which:
- Wraps all examples in `describe.skipIf(!process.env.RUN_LLM_EVALS)` so
regular `npm test` never makes LLM calls or requires API keys
- Creates one `it` per example: runs runMcpHostLoop, scores via evaluators,
asserts numeric scores >= passingScore (default 0.5)
- Emits a Markdown table summary via afterAll for CI job summaries
runMcpHostLoop.ts is a typed stub (throws); full InMemoryTransport
implementation comes in the next commit.
evals/vitest.config.ts runs in node environment with 120 s timeout,
scoped to evals/**/*.{test,spec,eval}.ts and *.dataset.ts patterns.
Also:
- Adds `test:evals` script to package.json (cross-env RUN_LLM_EVALS=1)
- Adds evals/**/*.ts to eslint.config.js file patterns so eval files
are linted and license-header-checked
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…er types runMcpHostLoop wires an MCP Client to the server via InMemoryTransport (in-process, no network), lists available tools, and drives a loop of up to MAX_TURNS=8 turns: LLM → tool calls → client.callTool() → result fed back → repeat Options allow callers to inject a pre-built McpServer (for mocked-service datasets) or a custom LlmProvider (for deterministic tests). Both default to the real implementations when omitted. evals/llm/types.ts introduces the LlmProvider interface and LlmMessage discriminated union (OpenAI-style, compatible with LiteLLM proxies). evals/llm/index.ts exposes createDefaultLlmProvider(), which auto-selects by env var (ANTHROPIC_API_KEY first, then OPENAI_API_KEY); the concrete adapters (anthropic.ts / openai.ts) land in the next commit — this stub surfaces a clear error until they do. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… provider OpenAiProvider (evals/llm/openai.ts): - Implements LlmProvider.chat() via the openai SDK (gpt-4o-mini default) - Accepts baseURL to point at a LiteLLM proxy for any compatible provider - Maps LlmMessage ↔ ChatCompletionMessageParam in both directions; narrows ChatCompletionMessageToolCall to FunctionToolCall before accessing .function - Strips tools argument when the list is empty (avoids API errors) evals/llm/index.ts: - createDefaultLlmProvider() now returns a real OpenAiProvider when OPENAI_API_KEY is set; picks up LITELLM_BASE_URL automatically - Preserves the ANTHROPIC_API_KEY branch with a clear "coming soon" error until evals/llm/anthropic.ts lands Adds openai@^6.37.0 as a devDependency (npm install --save-dev openai). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…C_API_KEY is set AnthropicProvider (evals/llm/anthropic.ts): - Implements LlmProvider.chat() via @anthropic-ai/sdk (claude-haiku-4-5-20251001) - toAnthropicMessages() handles the structural gap between OpenAI-style messages and Anthropic's API: no `tool` role exists; tool results go as `user` messages with `tool_result` content blocks; consecutive tool results are merged into a single user turn to avoid adjacent-user-turn API errors - Tool input is round-tripped JSON.parse (from LlmToolCallRequest.arguments) → object for the request, then JSON.stringify back for the response to maintain the OpenAI-compatible LlmToolCallRequest shape - input_schema is cast from LlmToolDefinition.parameters (already JSON Schema) evals/llm/index.ts: - createDefaultLlmProvider() now returns AnthropicProvider when ANTHROPIC_API_KEY is set (priority 1), falls back to OpenAiProvider for OPENAI_API_KEY (priority 2) Adds @anthropic-ai/sdk@^0.96.0 as a devDependency. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Makes per-example test names visible in CI output and in the GitHub Actions job summary, which is where the Markdown eval table lands. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Returns 1 if the trajectory contains at least one call to the skill's entry-point tool (expected.skill), 0 if not, or 'N/A' when expected.skill is absent so datasets that don't test skill routing can omit the field. The failure reason includes the full tool-name list from the trajectory to make CI output actionable without re-running the eval. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Binary complement of skill-activation: returns 1 when the skill's entry-point tool (expected.skill) is absent from the trajectory (correct — LLM was not falsely triggered), 0 when the tool appears (false positive). Returns 'N/A' when expected.skill is absent, matching the skill-activation convention so both evaluators behave consistently on examples that don't declare a skill. CI gate intent: datasets should require 100% on this evaluator for distractor examples — any false positive means the skill's SKILL.md is over-triggering on unrelated queries in production. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…cted.tools) Computes set-based precision, recall, and F1 against expected.tools. Deduplicates both the trajectory and the expected list — order/repetition is the trajectory evaluator's job. Score = F1 ∈ [0, 1]. Returns 'N/A' when expected.tools is absent so datasets that only test skill routing don't need to declare tool lists. The reason string includes missed and extra tool names to make CI failures immediately actionable without re-running the eval. CI gate intent: ≥0.8 (80%) on positive examples. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Computes score = lcs(actual, expected) / max(|actual|, |expected|). Dividing by the max penalises both missing tools (recall gap) and extra spurious tools (precision gap) in a single metric. Sequence matters here, unlike tool-selection which is set-based. Returns 'N/A' when expected.tools is absent — this guard prevents the evaluator from emitting meaningless 0-scores on examples that declare no ordered expectation, which would mask real regressions elsewhere. LCS is O(m·n) time via a flat DP array to avoid nested-array allocation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
createCriteriaEvaluator(llm) returns an Evaluator that sends the trajectory
and expected.criteria to a judge LLM with a structured rubric prompt asking
for JSON {score, reasoning}. Returns 'N/A' when expected.criteria is absent.
The factory pattern closes over the LLM provider so datasets can inject
different judges (e.g. a stronger model for criteria, haiku for routing).
Parsing: primary path extracts the first JSON object from the response and
clamps score to [0, 1]. Falls back to a bare-number regex for models that
ignore the JSON instruction, and finally returns score=0 with the raw text
if neither succeeds.
The judge prompt serialises only {tool, args} per call — omitting result
avoids token bloat from large tool outputs while still giving the judge
enough signal to evaluate routing decisions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tors) Proves the eval harness end-to-end against the existing manage-rules skill. Positives (drm-pos-01..04): natural-language queries about viewing/finding detection rules — the LLM should call manage-rules. Evaluated with skill-activation + tool-selection (≥80% gate). Distractors (drm-neg-01..04): case creation, alert triage, ES|QL hunting, host investigation — the LLM should NOT call manage-rules. Evaluated with negative-activation (100% gate — any false positive is a regression). Two separate runDataset calls wire the correct evaluators and thresholds to each example group without mixing evaluator semantics across types. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… test orchestration Separates data from test concerns: - detection-rule-management.dataset.ts now only exports data (positiveExamples, distractorExamples, detectionRuleManagementDataset); no runDataset calls - detection-rule-management.eval.test.ts is the Vitest entry point that imports the sub-arrays and calls runDataset with the correct evaluators Gate layout (unchanged from before): positives — skill-activation + tool-selection, passingScore: 0.8 distractors — negative-activation, passingScore: 1.0 The .eval.test.ts suffix matches the include glob in evals/vitest.config.ts so `npm run test:evals` picks it up without further config changes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Triggers:
- workflow_dispatch manual run from Actions UI
- schedule (0 2 * * *) nightly at 02:00 UTC
- pull_request_target only when 'evals' label is added; gated by label
write permission so only maintainers can trigger
Concurrency group 'evals-<ref>' cancels in-progress runs on new pushes,
preventing redundant jobs from burning LLM quota.
The 'Run evals' step sets RUN_LLM_EVALS=1 and passes four secrets:
EVAL_ANTHROPIC_API_KEY Claude Haiku (priority)
EVAL_OPENAI_API_KEY GPT-4o-mini fallback
EVAL_LITELLM_BASE_URL optional LiteLLM proxy base URL
EVAL_CLUSTERS_JSON Elastic cluster credentials for the MCP server
Output is captured with tee so it appears in the job log AND in eval-output.txt.
A separate 'Post eval results' step (if: always()) appends '## Eval results'
plus the full output to $GITHUB_STEP_SUMMARY so the rendered Markdown tables
from the runner appear in the Actions job summary.
For pull_request_target the checkout uses the PR head SHA so evals run against
the proposed changes rather than the base branch.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…, CI gating Covers: - Architecture diagram showing runner → runMcpHostLoop → evaluators pipeline - Key design choices table (in-process transport, skip-if guard, N/A semantics) - Dataset shape reference with all three optional expected fields documented - Positive vs distractor example pattern with runDataset code snippets - Evaluator catalog: type, score range, N/A condition, and recommended gate for all five evaluators (skill-activation, negative-activation, tool-selection, trajectory, criteria) - Step-by-step how-to-add-dataset guide with copy-paste templates - CI gating: workflow triggers, required secrets table, passing threshold table Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ibana routes Service injects KibanaClient directly (no separate *Client indirection since these are internal-only Kibana routes with no public API equivalent). The KibanaClient already supplies x-elastic-internal-origin: Kibana; each method adds elastic-api-version: 2023-10-31 via MIGRATION_HEADERS per-request. 14 methods, one per route: createMigration POST /internal/siem_migrations/rules listMigrations GET /internal/siem_migrations/rules getMigration GET /internal/siem_migrations/rules/:id deleteMigration DELETE /internal/siem_migrations/rules/:id uploadRules POST /internal/siem_migrations/rules/:id/rules getTranslatedRules GET /internal/siem_migrations/rules/:id/rules getTranslatedRule GET /internal/siem_migrations/rules/:id/rules/:ruleId updateTranslatedRule PUT /internal/siem_migrations/rules/:id/rules/:ruleId startTranslation POST /internal/siem_migrations/rules/:id/start stopTranslation POST /internal/siem_migrations/rules/:id/stop getResources GET /internal/siem_migrations/resources/:id upsertResources POST /internal/siem_migrations/resources/:id installRules POST /internal/siem_migrations/rules/:id/install getStats GET /internal/siem_migrations/rules/:id/stats MigrationApiError wraps every non-2xx response with typed status (extracted from the Kibana client's "Kibana [cluster] STATUS: body" error format) and the request path so callers can surface actionable error messages. Domain types: SiemMigration, TranslatedRule, MigrationResource, MigrationStats and associated option/result interfaces, all barrel-exported from service/index. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rror handling
19 tests across 14 describe blocks — one per route method plus three
error-handling tests:
Migration lifecycle: createMigration, listMigrations, getMigration, deleteMigration
Rule upload: uploadRules
Translated rules: getTranslatedRules (default+custom pagination), getTranslatedRule, updateTranslatedRule
Translation control: startTranslation, stopTranslation
Resources: getResources, upsertResources
Installation: installRules (no-ids + with-ids)
Stats: getStats
MigrationApiError: status parsed from Kibana error format; status=0 fallback;
all mutating methods surface MigrationApiError
Also adds `put: vi.fn()` to MockHttpClient / makeMock in mockHttpClient.ts
so MigrationsService.updateTranslatedRule can be exercised.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
migrate-rules (model-facing):
_meta.ui.resourceUri = ui://migrate-rules/mcp-app.html
Callback seeds the workbench with a compact migration list so the LLM
gets immediate context.
App-only tools (_meta.ui.visibility: ["app"]):
list-migrations GET all migrations
get-migration GET single migration by ID
get-translated-rules paginated translated rule listing (vendor-gated)
start-translation kick off AI translation (vendor-gated)
stop-translation halt in-progress translation (vendor-gated)
update-translated-rule patch elastic_rule / translation_result / comments (vendor-gated)
get-resources list macros/lookups (vendor-gated)
upsert-resource create/replace single macro or lookup (vendor-gated)
install-rules install translated rules, optional id filter (vendor-gated)
get-stats per-migration translation/installation stats
Vendor gate: SUPPORTED_VENDORS = ["splunk"]. If a vendor param is provided
and not in the list, returns { error: "vendorNotSupported", vendor } without
hitting Kibana. Re-enabling a vendor is a one-line change to the constant.
Also registers the migration workbench HTML via registerAppResource; the view
file is resolved at request time (resolveViewPath("migration")) so the tool
works once the view is built in a subsequent commit.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
20 tests covering:
Registration: all 11 tools + HTML resource registered under the correct names
migrate-rules: workbench message + compact migration list returned to LLM
app-only tool happy paths:
list-migrations, get-migration, get-translated-rules (with pagination),
start-translation, stop-translation, update-translated-rule (parses
elasticRule JSON), get-resources, upsert-resource (single-element array),
install-rules (with ids), get-stats
Vendor gating (per gated tool):
- vendor="qradar" / "sentinel-one" / unknown → { error: "vendorNotSupported" }
without calling the service
- vendor absent → proceeds (defaults to Splunk path)
get-stats has no vendor gate — confirmed by calling without vendor
Also adds createMockMigrationsService() to mockServices.ts covering all
14 MigrationsService methods.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
src/views/migration/App.tsx — full state machine:
WorkbenchState discriminated union (8 stages):
vendor-select → user picks vendor → creates migration
upload → paste Splunk rules JSON → upload + start translation
translating → polls get-stats every 3s → advances on completion
review → lists translated rules with status badges + fix actions
fix-rule-drawer → slide-over editor for single rule JSON + result enum
fix-resources-drawer → slide-over for macro/lookup create/update
install → confirmation step before calling install-rules
done → success summary with installed/failed counts
Vendor gate (5-LOC client check):
SUPPORTED_VENDORS = ["splunk"]
VENDOR_CATALOGUE entries not in SUPPORTED_VENDORS render as disabled
with "Coming soon" badge — re-enabling a vendor is a one-line change.
MCP integration:
All data via app.callServerTool() through the 10 app-only tools.
translating stage schedules a 3-second poll loop that stops and
transitions to review when stats.rules.processing === 0.
Supporting files:
mcp-app.html — minimal HTML shell (title: "SIEM Migration")
mcp-app.tsx — standard React 18 createRoot mount
styles.css — vendor-grid, upload-area, progress-bar, rule status
badges, drawer layout
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the custom migration-vendor-card--disabled CSS class with the
spec-required Tailwind utilities (opacity-50 + cursor-not-allowed) so the
disabled state is expressed as two atomic utility classes rather than a
bespoke rule, and removes the now-unused CSS block from styles.css.
The client-side gate remains ≤5 LOC:
const active = SUPPORTED_VENDORS.includes(id); // 1 LOC check
disabled={!active} // 1 LOC DOM attr
onClick={() => active && onSelect(id)} // 1 LOC guard
Re-enabling a vendor is still a one-line change to SUPPORTED_VENDORS.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-translation call
Upload component now offers three input paths:
1. File picker — hidden <input type="file" accept=".json"> wired to a
visible "Choose file…" button; FileReader populates the textarea
2. Drag-and-drop — drop zone tracks dragOver state for visual feedback
(border-blue-400 bg-blue-50) and reads the dropped file via FileReader
3. Paste — textarea remains for direct JSON pasting
"Upload & start translation" button stays disabled until text is non-empty.
Clicking it calls onUpload(text) which runs the chain in App:
upload-rules → start-translation → get-stats → translating stage
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
schedulePoll replaces get-stats with get-migration so progress tracking
uses Kibana's authoritative lifecycle status ("ready" | "running" |
"finished" | "error") rather than the derived stats endpoint.
Completion condition changed from:
stats.rules.processing === 0 && stats.status !== "running"
to:
migration.status === "finished" || migration.status === "error"
This is both more precise (avoids a brief window where processing can
be 0 mid-run) and aligns with the Kibana status contract.
MigrationStats type gains the narrowed status union and an optional name
field so the same shape works for both get-migration and get-stats
responses without a separate type.
Translating component gains an error-state branch: when status is "error"
the heading says "Translation encountered an error" and the progress bar
is hidden, letting the workbench advance to review with whatever partial
results Kibana returned.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…le Monaco)
Review step now expands any rule row inline to show RuleDiff — a three-column
panel that renders the full diff/fix UX without leaving the review list:
Left — Original SPL (plain <pre>, read-only): shows rule.original_rule.search
or falls back to full original_rule JSON if the search field is absent.
Middle — Generated Elastic rule JSON (read-only Monaco, language=json):
shows the rule.elastic_rule output from the AI translator.
Right — User-editable version (Monaco, language=json): seeded from the
generated JSON, editable by the reviewer, saved via update-translated-rule.
Footer bar: translation-result enum selector + Cancel / Save buttons.
Clicking a rule row toggles the inline diff; clicking again or Cancel collapses.
A "Drawer" button remains for partial/untranslatable rules that need the full
slide-over editor.
saveRuleInline callback in App handles update-translated-rule from the review
state directly, bypassing the fix-rule-drawer state transition.
monaco-environment.ts added (mirrors threat-hunt) so the inlined bundle can
resolve the editor worker without fetching external chunks.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tton Replaces the bare JSON textarea in RuleDrawer with a structured form covering the 7 key Elastic detection rule fields (name, description, type, query, language, severity, risk_score). The Re-validate button saves the current edits and marks the rule as "partial" via update-translated-rule; Save uses the user-selected translation result. Adds .migration-form-input CSS for consistent field styling. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ed highlighting Replaces the single add-form drawer with per-resource inline edit rows: - Unresolved resources (empty content) are auto-expanded and rendered with a yellow border/background so they are immediately actionable - Each row has an individual Save button calling upsert-resource - Resolved resources are collapsed by default but expandable for edits - An "Add resource" section at the bottom handles net-new entries - saveResources now stays in fix-resources-drawer after upsert (refreshes the list) so users can fix multiple resources in one session; closeDrawer transitions back to review as before Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fix: install stage was missing resources, so closeDrawer could not restore the full review state. Now: - WorkbenchState.install carries resources alongside translations - startInstall passes resources when entering the stage - closeDrawer handles install → review (joins the existing fix-*-drawer → review paths), making the "Back to review" button functional - confirmInstall calls install-rules and transitions to done with installed/failed counts - Done step shows KpiStrip with installed/failed tiles and a reset action Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Monaco editor added ~4.8 MB to the bundle (editor library + inlined editor.worker). To meet the < 1 MB singlefile target, Monaco is removed from the migration view: - RuleDiff generated column: Monaco read-only → <pre> (same class as SPL) - RuleDiff editable column: Monaco Editor → <textarea> with matching monospace style (.migration-diff-textarea) - RuleDrawer: already uses structured form inputs, not Monaco — unchanged - Removed monaco-environment import from mcp-app.tsx entry point Output: 364 kB uncompressed (105 kB gzip) — a single self-contained mcp-app.html with no companion worker files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Host-side skill prompt for the SIEM migration workflow. Covers: - YAML frontmatter with trigger phrases (migrate my Splunk rules, import SPL, onboard from Splunk, SIEM migration, convert detection rules) - Tools table separating the model-facing migrate-rules entry-point from the 10 workbench-only app tools - Workbench Lifecycle table documenting all 8 stages with what the user does and what signals completion - Correction strategy: start-over, back-from-install, re-edit rule, re-edit resource, restart translation - Common gotchas: vendor gate, direct tool calls, upload format, partial translations, macro/lookup resolution, large rule sets, re-opening existing migrations Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Positives cover the five spec trigger phrases (migrate Splunk rules, upload SPL bundle, onboard from Splunk, SIEM migration, convert detection rules) plus an install-translated-rules variant. Distractors span the other five skills (detection-rule-management, alert-triage, threat-hunt, case-management, generate-sample-data) to test boundary discrimination. All examples set expected.skill so the negative-activation evaluator can gate on migrate-rules absence in distractor runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…100%) Two runDataset calls mirroring the detection-rule-management pattern: - positives: skill-activation + tool-selection evaluators, passingScore 0.8 - distractors: negative-activation evaluator, passingScore 1.0 (any false positive on migrate-rules is treated as a regression) Suite is skipped in regular npm test via describe.skipIf(!RUN_LLM_EVALS). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Imports MigrationsService from elastic/service/index and registerMigrationTools from tools/migration - Instantiates migrationsService with the shared kibanaClient - Calls registerMigrationTools after the other six tool registrations - Updates integration test snapshots: +11 migration tool names and +1 UI resource URI (ui://migrate-rules/mcp-app.html) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds migrate-rules to the tools[] array so the MCP app marketplace advertises the new automatic migration capability. Version bumped to 1.1.0 (minor) to signal the new feature surface. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Updates the tool count from six to seven and adds a row for the new SIEM Migration feature (migrate-rules tool + workbench). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents the full migrate-rules workbench workflow: vendor selector, upload, AI translation with progress bar, three-column rule review, per-rule drawer (ElasticRulePartial form), resources drawer with per-row inline save, translation statuses, install step, and done summary. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
❌ Author of the following commits did not sign a Contributor Agreement: Please, read and sign the above mentioned agreement if you want to contribute to this project |
- Add evals/harness.test.ts: always-on mock-based integration tests that exercise the full eval pipeline (runMcpHostLoop → evaluators) for both detection-rule-management and automatic-migration datasets without API keys or a live cluster. Passes 100% on all gates (tool-selection ≥ 80%, negative-activation = 100%). - Add evals/helpers/evalServer.ts: shared factory that creates a real McpServer backed by stub services; used by both harness.test.ts and the LLM eval suites so neither needs CLUSTERS_JSON. - Update evals/runner.ts: add optional createServer factory to RunnerOptions (injected per-example since InMemoryTransport is single-use); also widen skipIf to skip gracefully when RUN_LLM_EVALS=1 but no API key is configured. - Update evals/vitest.config.ts: remove dataset files from include — *.dataset.ts files contain no test suites and were causing "no test suite found" failures. - Update both *.eval.test.ts files to pass createEvalServer so the LLM eval suites no longer require a live Elastic cluster. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The OpenAI adapter already accepted a `model` constructor option; this
pipes it through `createDefaultLlmProvider()` so operators can run the
eval suite against a local Ollama daemon at no cost:
OPENAI_API_KEY=ollama \
LITELLM_BASE_URL=http://localhost:11434/v1 \
OPENAI_MODEL=llama3.1:8b \
npm run test:evals
Default behaviour (gpt-4o-mini when only OPENAI_API_KEY is set) is
unchanged because `OpenAiProvider`'s `model = DEFAULT_MODEL` default
kicks in for `undefined`.
Tools registered via `registerAppTool(...)` with `_meta.ui.visibility: ["app"]` are invoked by the React workbench via `app.callServerTool()`. Real MCP hosts (Claude Desktop, Cursor) hide them from the LLM. The eval harness was passing every tool from `client.listTools()` straight to the model, so small open-source models saw `start-translation` / `install-rules` / `find-rules` as alternatives to `migrate-rules` / `manage-rules` and routed there instead — collapsing activation rates and misrepresenting what a real MCP host exposes. `isVisibleToModel()` mirrors the host-side visibility contract: - visibility unset → visible (default for model-facing tools) - visibility includes "model" → visible - visibility includes "app" without "model" → hidden Baseline shift on llama3.1:8b (automatic-migration positives): before fix: 67% (4/6 — model called start-translation / install-rules) after fix: 100% (6/6 — model called migrate-rules every time)
Previously the eval server only registered detection-rules and migration. When a distractor query like "Create a new case for a ransomware incident" hit the LLM, the model had no `manage-cases` option to choose, so it forced a poor match on `manage-rules` and the negative-activation evaluator collapsed. A real MCP host exposes the full set of model-facing tools — the eval server should match. Services are stubbed with `vi.fn()` because skill-routing evaluators only inspect which tools were called, not what they returned. Tool groups registered (mirroring src/server.ts): - alert-triage → triage-alerts - attack-discovery → triage-attack-discoveries - case-management → manage-cases - detection-rules → manage-rules - migration → migrate-rules - sample-data → generate-sample-data - threat-hunt → threat-hunt Baseline shift on llama3.1:8b (detection-rule-management distractors): before fix: 25% (1/4 — manage-rules over-selected on case/ESQL/alert queries) after fix: 100% (4/4 — model picks manage-cases / threat-hunt / etc. correctly) docs/evals.md updated with the Ollama route and a note that CLUSTERS_JSON is not required when using createEvalServer.
|
Surfaced during end-to-end eval validation against After the three harness fixes in 621b309 / 2ebbf54 / 0543e20 the
Looking at A minimal lift on the description that would likely close the gap on smaller models: description:
"List, find, search, show, query, review, audit, enable, disable, or otherwise " +
"manage detection rules in Elastic Security. Use for ANY question that requires " +
"inspecting the current rule catalog (noisy rules, rule coverage by ATT&CK, " +
"rules matching a string, rule status, etc.) and for enabling / disabling rules. " +
"Returns rule metadata; opens the rule management workbench for bulk operations.",I'm not amending the diff in this PR since Full baseline table in the PR body under "Eval baseline (captured end-to-end, this PR)". |
Adds an optional host-level system prompt to the in-process MCP host
loop so the harness can pin LLM behavior to what a real MCP host
would instruct. Real hosts (Claude Desktop, Cursor) inject a system
prompt that constrains tool selection, response shape, and HITL
confirmation flow. Without one, the harness measures raw
model-vs-tools behavior — which over- or under-reports activation
depending on the model family.
Wired end-to-end:
- HostLoopOptions.systemPrompt: optional string; empty/whitespace
treated identically to omitting (the absence is observable in evals).
- LlmMessage gains a `system` role variant so the prompt flows
through the unified message shape both adapters consume.
- OpenAI adapter: appends `role: "system"` as a normal message
(Chat Completions schema accepts it natively).
- Anthropic adapter: strips system-roled messages from the array
and passes them via the top-level `system` parameter on
`messages.create` — the only place Anthropic accepts a system
prompt. The `toAnthropicMessages` helper's parameter type is
narrowed to `Exclude<LlmMessage, { role: "system" }>` so the
invariant is enforced at the type system, not in prose.
Tests:
- 3 new harness tests covering the propagation contract:
(a) systemPrompt is the first message when provided
(b) no system message is injected when omitted
(c) empty / whitespace-only strings are treated as omitted
- All 23 harness tests pass (was 20).
- Tests use a recording-LLM provider so the assertion is on what
the adapter actually received, not on response side effects.
Docs:
- docs/evals.md gains a "Host system prompt" section explaining
the contract + provider-specific handling.
- Drive-by: the Ollama example switched from
`qwen2.5:32b-instruct-q4_K_M` (exposes /generate only, returns
"does not support chat" against this harness) to `llama3.1:8b`
which speaks the OpenAI Chat Completions schema. Caught
end-to-end while validating the harness.
Anti-overengineering self-check:
- Gate 1 (existing abstraction): HostLoopOptions already exists.
`systemPrompt?: string` slots in without a new interface.
- Gate 2 (real consumer): the next eval suite that wants to mimic
Claude Desktop's HITL prompt; SKILL.md-driven evals that need
the skill body as system context.
- Gate 3 (smallest in-place): one new optional field, one new
role variant, two adapter cases, three tests. ~30 LOC of
behavior change excluding tests + docs.
- Gate 6 (cost): default-off, no impact on existing callers.
llama3.1:8b is below the threshold where tool-calling decisions produce useful signal (team eval finding: ≥14B parameters is the floor). Sub-14B 'passes' are coincidence, not a result, so documenting an 8B as the 'good baseline' propagates a floor that masks real harness bugs (elastic#25/elastic#26/elastic#27) and green-lights skills that aren't ready. Replace with the explicit ≥14B parameter requirement, a chat- completions caveat (qwen2.5:32b-instruct-q4_K_M legitimately returns 'does not support chat' against /v1/chat/completions as of Ollama 0.3.x), and verified candidates the next reader can pull. See elastic/agent-builder-skill-dev-cursor-plugin anti-pattern elastic#28 for the full rationale.
Summary
Add a vendor-agnostic SIEM-rule migration feature (host-side skill + MCP tools + inline React workbench) to
example-mcp-app-security, plus a Vitest-native eval harness that certifies skill activation and tool sequencing across LLM providers.The migration feature lets a SOC engineer move detection rules from Splunk (and, behind a vendor gate, QRadar / Sentinel-One when those translators mature) into Elastic Security without leaving their MCP-aware host (Claude Desktop, Claude Code, Cursor, etc.). The eval harness ships in the same PR because the only way to certify "the skill activates when the user asks for migration and not when they don't" is to actually run the host-side activation loop against an in-process MCP server — no harness existed before this PR.
This PR is standalone with respect to elastic/kibana#269353: the migration tools call Kibana's existing
/internal/siem_migrations/*REST routes directly, so this MCP app feature works against any Kibana 9.x deployment that already exposes the SIEM migrations service — no Kibana plugin change is required.What ships
Migration feature
src/elastic/service/migrationsService.ts+.test.ts/internal/siem_migrations/*routes (create-migration,start-translation,get-translated-rules,update-translated-rule,upsert-resource,install-rules, …).src/tools/migration.ts+.test.tsmigrate-rules) + 10 app-only tools the React workbench drives viaapp.callServerTool().src/views/migration/(App.tsx, mcp-app.tsx, mcp-app.html, styles.css, monaco-environment.ts)< 1 MBsinglefile HTML bundle viavite-plugin-singlefile. Drives upload → translating → review (3-column SPL/generated/editable Monaco diff) → per-rule edit drawer → fix-resources drawer (macros + lookups) → install → done..agents/skills/automatic-migration/SKILL.mdmigrate-rulesexactly once; the workbench takes over from there.src/server.ts,manifest.json(1.1.0),docs/features.md,README.mdMigrationsService+migrationtool group; lists the skill in the features table.Eval harness
evals/types.tsDataset,Example,Trajectory,Evaluator,EvaluatorResult,ExpectedBehavior.evals/runner.ts+evals/vitest.config.tsdescribe.skipIf(!RUN_LLM_EVALS)so every CI run that doesn't set the flag passes for free; nightly + label-gated CI runs actually exercise the LLM.evals/runMcpHostLoop.ts+evals/helpers/evalServer.tsInMemoryTransport.createLinkedPair()between an MCPClientand ourcreateServer()— no subprocess, no network, deterministic. Captures every tool call into aTrajectory.evals/llm/{openai,anthropic,types,index}.tsANTHROPIC_API_KEYis set.evals/evaluators/{skill-activation,tool-selection,negative-activation,trajectory,criteria}.tsexpected.tools; negative activation = distractors stay silent), 1 code-judged (trajectory = LCS over expected sequence), 1 LLM-judged (criteria).evals/datasets/{detection-rule-management,automatic-migration}.dataset.tsdetection-rule-management(8 examples: 4 positives + 4 distractors) certifies an existing skill;automatic-migration(12 examples: 6 positives covering Splunk SPL ingest, partial translations, resource-fix, install + 6 distractors).evals/{detection-rule-management,automatic-migration}.eval.test.tsrunDataset(), sets per-skill thresholds (≥80% tool selection, 100% negative activation).evals/harness.test.tsnpm run test:evalsworks in unit-test mode too.docs/evals.md.github/workflows/evals.ymlworkflow_dispatch(manual) +pull_requestfiltered by theevalslabel. ReadsOPENAI_API_KEY/ANTHROPIC_API_KEYfrom secrets; never runs by default to keep PR cost = $0.Diff stats
43 files changed, 5390 insertions(+), 11 deletions(-)Of those 11 deletions:
tsconfig.jsonwas extended to includeevals/**/*, andmanifest.jsonbumped to 1.1.0. The remainder is net-new code organized in dedicatedevals/andsrc/{tools,views,elastic/service}/migration*slices — no cross-cutting refactor of the existing alert-triage / case-management / threat-hunt views.Surface model: where skills live vs where tools live
POST /api/agent_builder/conversein Kibana)SKILL.mdis loaded into the agent loopSKILL.mdfile matching the request and orchestrates the tool callsPOST /api/agent_builder/mcpin Kibana)elastic-securitystdio / Streamable HTTP MCP server)migrate-rulestool that opens the workbench.agents/skills/automatic-migration/SKILL.mdand is mirrored into Claude Desktop's settings + Cursor's settings via the existinginstall-skills.shpathSo when this PR says "automatic-migration skill", it means the SKILL.md file that the host's LLM loads — not anything that Kibana's agent builder MCP server exposes. The host calls our MCP server's tools; the host's own skill registry decides the prompt material around those tool calls. This is why the skill activation eval runs the host loop in-process (
runMcpHostLoop.ts), not the Kibana agent.How a SOC engineer experiences this end-to-end
docs/setup-*.md).automatic-migrationskill via./scripts/install-skills.sh add -s automatic-migration -a {cursor|claude-desktop|claude-code}.SKILL.mdand recognizes the request maps tomigrate-rules.migrate-rulesexactly once with no arguments. The tool returns a compact summary + a_meta.ui.resourceUriofui://migrate-rules/mcp-app.html.upload → translating → review → fix-rule → fix-resources → install → done) is driven by the workbench calling app-only tools throughapp.callServerTool(). The LLM is out of the loop until the user types something new in chat.The "vendor gate" in step 4 means: today, only the splunk vendor button is enabled; qradar and sentinel-one show "Coming soon" with
opacity-50 cursor-not-allowed. The translators for those vendors are still maturing in Kibana, and we'd rather route the user back to the Splunk path than ship a degraded partial-translation experience for the first time they try the feature. Re-enabling each vendor is a one-line change to theSUPPORTED_VENDORSarray insrc/tools/migration.tsplus the workbench's vendor-select component.Eval harness
The harness exists because we needed three things this repo didn't have:
SKILL.mdand callmigrate-rulesonce, or did it freelance withstart-translationand friends? Theskill-activationevaluator checks the trajectory for exactly that handshake. Distractors ("What's the weather?","List my SaaS apps","Show alerts for endpoint X") hit thenegative-activationevaluator to make sure the skill doesn't trigger when off-topic.trajectoryevaluator uses LCS againstexpected.toolsso that adding more tools (e.g. follow-up Q&A) doesn't tank the score, but reordering the canonical sequence does.baseURLenv var so the suite can target a Vertex/Bedrock LiteLLM in CI without changing test code.The CI is
evalslabel-gated +workflow_dispatch-triggered: a normal PR never spends a token; a PR labelledevalstriggers an actual run with whichever provider keys are set; a nightly workflow can run all suites against the OSS LiteLLM proxy to track drift without API costs.docs/evals.mddocuments the full design (provider matrix, evaluator catalog, dataset format, how to add a new skill suite).Known limitations (per
address-known-limitations.mdctriage)no-fabricated-evidence.mdc— will be captured against the user's environment and embedded in a follow-up commit on this branch before merge; a static mock is NOT a substitute./scripts/install-skills.sh add -s automatic-migration -a cursorruntime verificationEval baseline (captured end-to-end, this PR)
The harness was validated end-to-end against the local Ollama daemon to prove the wire-up works against a real LLM (not just the deterministic mock):
⁂ Migration suite only — 3B model is below the production target.
The migration feature scores 100% on llama3.1:8b for both activation and distractor rejection. The 2 DRM-positive failures on llama3.1:8b are ambiguous-query edge cases on the pre-existing
manage-rulesskill ("Show me my noisy rules", "PowerShell-related high-severity rules"); a Claude / GPT-4o class model handles them correctly — that's tracked as the post-merge nightly baseline.Running this end-to-end surfaced and fixed three real harness bugs in three follow-up commits — these are post-treadmill changes I made by hand after the orchestrator wrapped up:
621b309feat(evals): allowOPENAI_MODELoverride for Ollama / LiteLLM proxiescreateDefaultLlmProvider()now pipesOPENAI_MODELthrough toOpenAiProvider, so the suite runs against any OpenAI-compatible endpoint (Ollama, LiteLLM, Anthropic via proxy). Defaultgpt-4o-minibehaviour preserved.2ebbf54fix(evals): hide app-only tools from the LLM inrunMcpHostLoopclient.listTools()to the model, including 10 app-only tools per skill (start-translation,install-rules,find-rules, …) — but real MCP hosts hide tools marked_meta.ui.visibility: ["app"]. Filter now mirrors the host contract: visible if visibility is unset OR includes"model"OR doesn't include"app".0543e20fix(evals): register all 7 model-facing tool groups increateEvalServermigration+detection-rules. A distractor like "Create a new case" had nomanage-casesto land on, so the model forced a false positive onmanage-rules. The server now mirrorssrc/server.tsexactly (alert-triage, attack-discovery, case-management, detection-rules, migration, sample-data, threat-hunt) withvi.fn()service stubs.How to reproduce locally:
# zero-cost local baseline OPENAI_API_KEY=ollama \ LITELLM_BASE_URL=http://localhost:11434/v1 \ OPENAI_MODEL=llama3.1:8b \ RUN_LLM_EVALS=1 npm run test:evalsCommit slices
Even though it's one PR, the commit history reads top-to-bottom as the design document with each commit being a single reviewable unit. Highlights from
git log --oneline main..HEAD:Validation evidence (pre-merge)
npx tsc --noEmit— clean (0 errors) ✅npm test— runs unit tests +evals/harness.test.ts(mock provider, no API keys); see CI badgenpm run test:evals— full LLM eval suite, gated byRUN_LLM_EVALS=1+OPENAI_API_KEY/ANTHROPIC_API_KEY; 18/20 passing against local llama3.1:8b (migration suite 12/12, see Eval baseline table)npm run build— singlefile workbench HTML bundle is< 1 MB(verified locally: 365 kB)How this PR was authored
End-to-end via patryks-treadmill (a local orchestrator I maintain). A single description, dispatched via
treadmill_generate_planagainst this repo, produced the OpenSpec change (proposal.md,design.md,tasks.md,specs/main/spec.md) and then the 48-task plan that landed every file shown above. The orchestrator dispatched each task to a per-taskclaudesubagent in the worktree, ran the substance check + verifier on every commit, and produced the 43-file diff atcommit 4b86632. The author reviewed each substantive commit and intervened on 4 verification-only tasks (screenshot, install-skills smoke, commit reordering, PR-description self-audit) that the orchestrator's substance check can't currently handle — those are tracked under "Known limitations" above.