Skip to content

Latest commit

 

History

History
223 lines (164 loc) · 7.79 KB

File metadata and controls

223 lines (164 loc) · 7.79 KB

Evaluation and Baselines

Nullsec-S1 is evaluated on application-security verdict quality, not broad general reasoning benchmarks. The RC2/v1.1 release benchmark is a 111-case suite covering all 16 Nullsec security categories.

What the suite measures

The shared benchmark metrics include:

  • detection precision / recall / F1
  • false-safe rate (unsafe code marked safe after Safety Layer enforcement)
  • hallucination rate (findings on clean cases)
  • OWASP coverage
  • structural patch correctness
  • secure-generation score
  • per-category recall
  • failed case IDs and per-case debug reasons

Cases with no output or malformed output are scored as real misses.

Why not MMLU or GSM8K?

Nullsec-S1 is not a general-purpose academic benchmark model. It is trained and evaluated for structured application-security verdicts over AI-generated apps, agents, MCP tools, wallet/Web3 flows, and common appsec failures. MMLU/GSM8K may be useful for other model profiles, but they do not measure Nullsec's core job: security detection, secure patching, and false-safe prevention.

Baselines

Baseline results must be generated by scripts. Do not hand-enter numbers.

Nullsec-S1

Use the trained PEFT / QLoRA adapter from GitHub Release v1.0.0-rc25 (source of record) or the Hugging Face adapter mirror: https://huggingface.co/Trynullsec/nullsec-s1. Users still need the base model Qwen/Qwen2.5-Coder-7B-Instruct.

python benchmarks/run_all.py --mode model --adapter outputs/nullsec-s1-qlora

Base Qwen2.5-Coder-7B

This runs the same benchmark prompt against the base model with no Nullsec adapter. It requires GPU in --mode model.

python benchmarks/baselines/base_qwen.py --mode model

Report path:

benchmarks/reports/baselines/qwen2_5_coder_7b/SUITE.json

Malformed or non-JSON base-model output is counted honestly as a miss.

Semgrep

Semgrep is optional and CPU-only. Install it first:

python -m pip install semgrep
python benchmarks/baselines/semgrep_baseline.py

If Semgrep is missing, the runner exits with:

Semgrep is not installed. Install with: python -m pip install semgrep

Report path:

benchmarks/reports/baselines/semgrep/SUITE.json

Semgrep coverage is partial. It is a static rule engine and does not reasonably cover every Nullsec category (for example prompt injection, many MCP tool-abuse semantics, wallet transaction policies, and many smart-contract risks without specialized rules). Unsupported categories are documented in the report.

Docker fallback for Semgrep is a future enhancement, not implemented today.

Claude API

Claude comparisons are optional hosted-model baselines. They require an API key and an explicit model id; no default model is hardcoded because provider model IDs and dates must be recorded in the report.

Smoke test:

export ANTHROPIC_API_KEY=...
export ANTHROPIC_MODEL=...
python benchmarks/baselines/claude_api.py --limit 5 --sleep 1

Full run (costs money; run intentionally):

python benchmarks/baselines/claude_api.py --sleep 1

Report and raw cache:

benchmarks/reports/baselines/claude/SUITE.json
benchmarks/reports/baselines/claude/raw_outputs.jsonl

Use --resume to skip already-cached case ids if a run is interrupted.

OpenAI / Codex API

OpenAI/Codex comparisons are optional hosted-model baselines. They require an API key and an explicit model id via OPENAI_MODEL or --model.

Smoke test:

export OPENAI_API_KEY=...
export OPENAI_MODEL=...
python benchmarks/baselines/openai_api.py --limit 5 --sleep 1

Full run (costs money; run intentionally):

python benchmarks/baselines/openai_api.py --sleep 1

Report and raw cache:

benchmarks/reports/baselines/openai/SUITE.json
benchmarks/reports/baselines/openai/raw_outputs.jsonl

Provider models can change over time. Reports record the exact provider, model id, run date, dataset, and raw-output cache path. Do not compare hosted-model results without those fields.

Hosted model runners use temperature=0 where the provider/model supports it. Some models reject or deprecate explicit temperature parameters; in that case the runner retries once without temperature and records the omission in report metadata.

Comparison table

Generate a Markdown comparison from existing reports:

python benchmarks/compare_baselines.py \
  --nullsec benchmarks/reports/SUITE.json \
  --base benchmarks/reports/baselines/qwen2_5_coder_7b/SUITE.json \
  --semgrep benchmarks/reports/baselines/semgrep/SUITE.json \
  --claude benchmarks/reports/baselines/claude/SUITE.json \
  --openai benchmarks/reports/baselines/openai/SUITE.json \
  --out benchmarks/reports/baselines/COMPARISON.md

The generated comparison is a report artifact and should not be committed unless explicitly approved.

Baseline comparison

Generated with benchmarks/compare_baselines.py from local reports. Raw generated reports remain ignored under benchmarks/reports/.

System / tool Total cases Outputs / analyzable Precision Recall F1 false_safe_rate hallucination_rate Notes / coverage limits
Nullsec-1 111 110 0.9423 0.9074 0.9245 0.0 0.0667 RC2/v1.1 release or local run
Qwen2.5-Coder-7B-Instruct (base, no Nullsec adapter) 111 4 0.3333 0.0093 0.018 0.0 0.5 base model, no Nullsec adapter
Semgrep (local rules baseline) 111 111 0.8627 0.4074 0.5535 0.5625 0.3333 static rules; partial category coverage
Claude API baseline (claude-opus-4-8) 111 68 0.8889 0.5185 0.655 0.0 0.1429 hosted API baseline; model id/date in report
OpenAI/Codex API baseline (gpt-5.3-codex) 111 105 0.6169 0.8796 0.7252 0.0 0.6 hosted API baseline; model id/date in report

Output-count note

The GitHub Release records 111/111 raw model outputs. The comparison table's Outputs / analyzable column uses results.summary.total_outputs from the report, which counts outputs that were alignable and scorable as structured verdicts by the benchmark pipeline. For the Nullsec-S1 report used here, one raw output was not alignable for scoring, so the comparison table shows 110.

Interpretation

  • Nullsec-S1 shows stronger structured security-verdict performance on this repo-authored benchmark.
  • Base Qwen2.5-Coder-7B-Instruct mostly failed to produce scorable Nullsec-style JSON security verdicts. This shows why the fine-tune and deterministic alignment layer matter for this output format and task.
  • Semgrep detects some static patterns with high precision, but has partial category coverage and lower recall on this benchmark. This is a local-rules Semgrep baseline on the Nullsec benchmark, not a general claim about Semgrep quality.
  • Claude Opus 4.8 has high precision but lower recall and fewer analyzable structured outputs on this benchmark.
  • OpenAI/Codex has high recall but lower precision and a higher hallucination rate on this benchmark.

Limitations

  • The benchmark is security-specific and repo-authored; it is not an independent third-party benchmark.
  • Baseline comparisons are meaningful only when all systems are run on the same dataset version.
  • Results should be reproduced from the scripts above; do not hand-enter metrics.
  • Semgrep is not expected to cover all categories and should be interpreted as a static-analysis baseline, not a security LLM.
  • Provider API models can change over time. Exact model IDs and run dates are recorded in generated reports.
  • Frontier/API baseline numbers here are benchmark-run outputs, not general claims about those providers.
  • This comparison does not prove universal vulnerability detection performance.
  • Do not claim Nullsec-S1 beats another model or tool unless the comparison script output proves it.