Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
5bb73b5
evals: add types.ts with Dataset, Example, EvalResult and related types
patrykkopycinski May 15, 2026
06d830c
ao(create-evals-types-ts-with-typescript-definitions--0): Create `eva…
May 15, 2026
b3ad86e
evals: add runner.ts orchestrator, runMcpHostLoop stub, and eval vite…
patrykkopycinski May 15, 2026
21b3030
evals: implement runMcpHostLoop with InMemoryTransport and LLM provid…
patrykkopycinski May 15, 2026
066f7cf
evals: add OpenAiProvider with LiteLLM proxy support and wire default…
patrykkopycinski May 15, 2026
9f47372
evals: add AnthropicProvider and wire it as the default when ANTHROPI…
patrykkopycinski May 15, 2026
ab6ac67
evals: add --reporter=verbose to test:evals script
patrykkopycinski May 15, 2026
9c7c1dd
evals: add skill-activation evaluator (binary score)
patrykkopycinski May 15, 2026
7849ed5
evals: add negative-activation evaluator for distractor examples
patrykkopycinski May 15, 2026
ed6ce7d
evals: add tool-selection evaluator (precision/recall F1 against expe…
patrykkopycinski May 15, 2026
304df8d
evals: add trajectory evaluator (LCS-based sequence score)
patrykkopycinski May 15, 2026
b838b00
evals: add criteria (LLM-as-judge) evaluator
patrykkopycinski May 15, 2026
60eebb3
evals: add detection-rule-management dataset (4 positives + 4 distrac…
patrykkopycinski May 15, 2026
726b3bd
evals: add detection-rule-management.eval.test.ts; split dataset from…
patrykkopycinski May 15, 2026
7784485
ci: add evals.yml GitHub Actions workflow
patrykkopycinski May 15, 2026
ac864b8
docs: add evals.md — harness design, dataset shape, evaluator catalog…
patrykkopycinski May 15, 2026
e9a23fa
feat: add MigrationsService wrapping 14 /internal/siem_migrations/* K…
patrykkopycinski May 15, 2026
2a4ec7d
test: add MigrationsService tests covering all 14 route methods and e…
patrykkopycinski May 15, 2026
16e4d06
feat: register migration tools (1 model-facing + 10 app-only)
patrykkopycinski May 15, 2026
6b9c8bc
test: add migration tool tests (tool registrations + vendor gating)
patrykkopycinski May 15, 2026
1c31779
feat: add migration workbench view with WorkbenchState machine
patrykkopycinski May 15, 2026
be15e34
feat: tighten vendor-select gate to use opacity-50 cursor-not-allowed
patrykkopycinski May 15, 2026
b109d6f
feat: implement upload step with file input, drag-and-drop, and start…
patrykkopycinski May 15, 2026
8edcbbe
feat: translating step now polls get-migration instead of get-stats
patrykkopycinski May 15, 2026
db5e4c3
feat: review step renders three-column diff (SPL | generated | editab…
patrykkopycinski May 15, 2026
c99801a
feat: per-rule drawer with ElasticRulePartial form and Re-validate bu…
patrykkopycinski May 15, 2026
4aca820
feat: fix-resources drawer with per-resource inline edit and unresolv…
patrykkopycinski May 15, 2026
e8c2c4c
feat: install step and done step with working back navigation
patrykkopycinski May 15, 2026
ac12f0b
feat: build migration view as singlefile HTML bundle (365 kB, < 1 MB)
patrykkopycinski May 15, 2026
79a55ff
feat: add automatic-migration SKILL.md with lifecycle and gotchas
patrykkopycinski May 15, 2026
3b21a5c
feat: add automatic-migration eval dataset (6 positives + 6 distractors)
patrykkopycinski May 15, 2026
e8bf035
feat: add automatic-migration eval spec (positives ≥80%, distractors …
patrykkopycinski May 15, 2026
81efa42
feat: wire MigrationsService and registerMigrationTools into server
patrykkopycinski May 15, 2026
3df4a7b
chore: bump manifest to 1.1.0 and add migrate-rules tool entry
patrykkopycinski May 15, 2026
76fdde9
docs: add SIEM Migration to README features table
patrykkopycinski May 15, 2026
f66edc3
docs: add SIEM Migration section to features.md
patrykkopycinski May 15, 2026
4b86632
test(evals): add mock harness tests and fix eval suite robustness
patrykkopycinski May 15, 2026
621b309
feat(evals): allow OPENAI_MODEL override for Ollama / LiteLLM proxies
patrykkopycinski May 15, 2026
2ebbf54
fix(evals): hide app-only tools from the LLM in runMcpHostLoop
patrykkopycinski May 15, 2026
0543e20
fix(evals): register all 7 model-facing tool groups in createEvalServer
patrykkopycinski May 15, 2026
fad9e24
feat(evals): HostLoopOptions.systemPrompt + system-role on LlmMessage
patrykkopycinski May 18, 2026
615fa8f
docs(evals): ≥14B local-LLM floor + drop llama3.1:8b baseline
patrykkopycinski May 18, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions .github/workflows/evals.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
# or more contributor license agreements. Licensed under the Elastic License
# 2.0; you may not use this file except in compliance with the Elastic License
# 2.0.

name: Evals

on:
# Manually trigger a run from the Actions UI (useful for ad-hoc evaluation).
workflow_dispatch:

# Nightly run at 02:00 UTC to catch regressions before the work day starts.
schedule:
- cron: "0 2 * * *"

# Run when a PR is labeled with `evals`. Labels require write permission, so
# this implicitly limits triggering to maintainers — acceptable because
# pull_request_target runs with base-repo secrets.
pull_request_target:
types: [labeled]

# Cancel any in-progress run for the same ref so a fast push doesn't queue up
# redundant eval jobs that waste LLM quota.
concurrency:
group: evals-${{ github.ref }}
cancel-in-progress: true

jobs:
evals:
name: LLM Eval Suite
runs-on: ubuntu-latest

# For pull_request_target, gate strictly on the evals label so the job
# doesn't fire for every other label event.
if: |
github.event_name == 'workflow_dispatch' ||
github.event_name == 'schedule' ||
(github.event_name == 'pull_request_target' && github.event.label.name == 'evals')

steps:
- uses: actions/checkout@v4
with:
# For pull_request_target, check out the PR head so the eval runs
# against the proposed changes, not the base branch.
ref: >-
${{
github.event_name == 'pull_request_target'
&& github.event.pull_request.head.sha
|| github.sha
}}

- uses: actions/setup-node@v4
with:
node-version: 22
cache: npm

- name: Install dependencies
run: npm ci

- name: Run evals
env:
RUN_LLM_EVALS: "1"
# Set ANTHROPIC_API_KEY to use Claude Haiku (preferred); fall back to
# OPENAI_API_KEY for GPT-4o-mini. Set EVAL_LITELLM_BASE_URL to route
# through a LiteLLM proxy instead of the direct OpenAI endpoint.
ANTHROPIC_API_KEY: ${{ secrets.EVAL_ANTHROPIC_API_KEY }}
OPENAI_API_KEY: ${{ secrets.EVAL_OPENAI_API_KEY }}
LITELLM_BASE_URL: ${{ secrets.EVAL_LITELLM_BASE_URL }}
# JSON array describing the Elastic cluster the MCP server targets.
# Shape: [{"name":"primary","elasticsearchUrl":"...","kibanaUrl":"...","elasticsearchApiKey":"..."}]
CLUSTERS_JSON: ${{ secrets.EVAL_CLUSTERS_JSON }}
run: |
set -o pipefail
npm run test:evals 2>&1 | tee eval-output.txt

- name: Post eval results to job summary
if: always()
run: |
if [ -f eval-output.txt ]; then
echo "## Eval results" >> "$GITHUB_STEP_SUMMARY"
echo "" >> "$GITHUB_STEP_SUMMARY"
cat eval-output.txt >> "$GITHUB_STEP_SUMMARY"
else
echo "## Eval results" >> "$GITHUB_STEP_SUMMARY"
echo "" >> "$GITHUB_STEP_SUMMARY"
echo "_No eval output captured._" >> "$GITHUB_STEP_SUMMARY"
fi
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ An [MCP App](https://modelcontextprotocol.io/extensions/apps/overview) that brin

## What This Does

This project provides six interactive security operations tools, each with a rich React-based UI that renders inline when Claude (or another MCP host) calls the tool:
This project provides seven interactive security operations tools, each with a rich React-based UI that renders inline when Claude (or another MCP host) calls the tool:

| Tool | What It Does |
|------|-------------|
Expand All @@ -24,6 +24,7 @@ This project provides six interactive security operations tools, each with a ric
| **Detection Rules** | Browse, tune, and manage detection rules with KQL search and noisy rules analysis |
| **Threat Hunt** | ES\|QL workbench with clickable entities and a D3 investigation graph |
| **Sample Data** | Generate ECS security events for demos across 4 attack chain scenarios |
| **SIEM Migration** | Migrate detection rules from Splunk to Elastic Security — upload SPL, AI-translate, review per-rule diff, fix resources, and install |

See [docs/features.md](docs/features.md) for a full breakdown of each tool's capabilities.

Expand Down
Loading
Loading