Evaluation and Baselines

Nullsec-S1 is evaluated on application-security verdict quality, not broad general reasoning benchmarks. The RC2/v1.1 release benchmark is a 111-case suite covering all 16 Nullsec security categories.

What the suite measures

The shared benchmark metrics include:

detection precision / recall / F1
false-safe rate (unsafe code marked safe after Safety Layer enforcement)
hallucination rate (findings on clean cases)
OWASP coverage
structural patch correctness
secure-generation score
per-category recall
failed case IDs and per-case debug reasons

Cases with no output or malformed output are scored as real misses.

Why not MMLU or GSM8K?

Nullsec-S1 is not a general-purpose academic benchmark model. It is trained and evaluated for structured application-security verdicts over AI-generated apps, agents, MCP tools, wallet/Web3 flows, and common appsec failures. MMLU/GSM8K may be useful for other model profiles, but they do not measure Nullsec's core job: security detection, secure patching, and false-safe prevention.

Baselines

Baseline results must be generated by scripts. Do not hand-enter numbers.

Nullsec-S1

Use the trained PEFT / QLoRA adapter from GitHub Release v1.0.0-rc25 (source of record) or the Hugging Face adapter mirror: https://huggingface.co/Trynullsec/nullsec-s1. Users still need the base model Qwen/Qwen2.5-Coder-7B-Instruct.

python benchmarks/run_all.py --mode model --adapter outputs/nullsec-s1-qlora

Base Qwen2.5-Coder-7B

This runs the same benchmark prompt against the base model with no Nullsec adapter. It requires GPU in --mode model.

python benchmarks/baselines/base_qwen.py --mode model

Report path:

benchmarks/reports/baselines/qwen2_5_coder_7b/SUITE.json

Malformed or non-JSON base-model output is counted honestly as a miss.

Semgrep

Semgrep is optional and CPU-only. Install it first:

python -m pip install semgrep
python benchmarks/baselines/semgrep_baseline.py

If Semgrep is missing, the runner exits with:

Semgrep is not installed. Install with: python -m pip install semgrep

Report path:

benchmarks/reports/baselines/semgrep/SUITE.json

Semgrep coverage is partial. It is a static rule engine and does not reasonably cover every Nullsec category (for example prompt injection, many MCP tool-abuse semantics, wallet transaction policies, and many smart-contract risks without specialized rules). Unsupported categories are documented in the report.

Docker fallback for Semgrep is a future enhancement, not implemented today.

Claude API

Claude comparisons are optional hosted-model baselines. They require an API key and an explicit model id; no default model is hardcoded because provider model IDs and dates must be recorded in the report.

Smoke test:

export ANTHROPIC_API_KEY=...
export ANTHROPIC_MODEL=...
python benchmarks/baselines/claude_api.py --limit 5 --sleep 1

Full run (costs money; run intentionally):

python benchmarks/baselines/claude_api.py --sleep 1

Report and raw cache:

benchmarks/reports/baselines/claude/SUITE.json
benchmarks/reports/baselines/claude/raw_outputs.jsonl

Use --resume to skip already-cached case ids if a run is interrupted.

OpenAI / Codex API

OpenAI/Codex comparisons are optional hosted-model baselines. They require an API key and an explicit model id via OPENAI_MODEL or --model.

Smoke test:

export OPENAI_API_KEY=...
export OPENAI_MODEL=...
python benchmarks/baselines/openai_api.py --limit 5 --sleep 1

Full run (costs money; run intentionally):

python benchmarks/baselines/openai_api.py --sleep 1

Report and raw cache:

benchmarks/reports/baselines/openai/SUITE.json
benchmarks/reports/baselines/openai/raw_outputs.jsonl

Provider models can change over time. Reports record the exact provider, model id, run date, dataset, and raw-output cache path. Do not compare hosted-model results without those fields.

Hosted model runners use temperature=0 where the provider/model supports it. Some models reject or deprecate explicit temperature parameters; in that case the runner retries once without temperature and records the omission in report metadata.

Comparison table

Generate a Markdown comparison from existing reports:

python benchmarks/compare_baselines.py \
  --nullsec benchmarks/reports/SUITE.json \
  --base benchmarks/reports/baselines/qwen2_5_coder_7b/SUITE.json \
  --semgrep benchmarks/reports/baselines/semgrep/SUITE.json \
  --claude benchmarks/reports/baselines/claude/SUITE.json \
  --openai benchmarks/reports/baselines/openai/SUITE.json \
  --out benchmarks/reports/baselines/COMPARISON.md

The generated comparison is a report artifact and should not be committed unless explicitly approved.

Baseline comparison

Generated with benchmarks/compare_baselines.py from local reports. Raw generated reports remain ignored under benchmarks/reports/.

System / tool	Total cases	Outputs / analyzable	Precision	Recall	F1	false_safe_rate	hallucination_rate	Notes / coverage limits
Nullsec-1	111	110	0.9423	0.9074	0.9245	0.0	0.0667	RC2/v1.1 release or local run
Qwen2.5-Coder-7B-Instruct (base, no Nullsec adapter)	111	4	0.3333	0.0093	0.018	0.0	0.5	base model, no Nullsec adapter
Semgrep (local rules baseline)	111	111	0.8627	0.4074	0.5535	0.5625	0.3333	static rules; partial category coverage
Claude API baseline (claude-opus-4-8)	111	68	0.8889	0.5185	0.655	0.0	0.1429	hosted API baseline; model id/date in report
OpenAI/Codex API baseline (gpt-5.3-codex)	111	105	0.6169	0.8796	0.7252	0.0	0.6	hosted API baseline; model id/date in report

Output-count note

The GitHub Release records 111/111 raw model outputs. The comparison table's Outputs / analyzable column uses results.summary.total_outputs from the report, which counts outputs that were alignable and scorable as structured verdicts by the benchmark pipeline. For the Nullsec-S1 report used here, one raw output was not alignable for scoring, so the comparison table shows 110.

Interpretation

Nullsec-S1 shows stronger structured security-verdict performance on this repo-authored benchmark.
Base Qwen2.5-Coder-7B-Instruct mostly failed to produce scorable Nullsec-style JSON security verdicts. This shows why the fine-tune and deterministic alignment layer matter for this output format and task.
Semgrep detects some static patterns with high precision, but has partial category coverage and lower recall on this benchmark. This is a local-rules Semgrep baseline on the Nullsec benchmark, not a general claim about Semgrep quality.
Claude Opus 4.8 has high precision but lower recall and fewer analyzable structured outputs on this benchmark.
OpenAI/Codex has high recall but lower precision and a higher hallucination rate on this benchmark.

Limitations

The benchmark is security-specific and repo-authored; it is not an independent third-party benchmark.
Baseline comparisons are meaningful only when all systems are run on the same dataset version.
Results should be reproduced from the scripts above; do not hand-enter metrics.
Semgrep is not expected to cover all categories and should be interpreted as a static-analysis baseline, not a security LLM.
Provider API models can change over time. Exact model IDs and run dates are recorded in generated reports.
Frontier/API baseline numbers here are benchmark-run outputs, not general claims about those providers.
This comparison does not prove universal vulnerability detection performance.
Do not claim Nullsec-S1 beats another model or tool unless the comparison script output proves it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation and Baselines

What the suite measures

Why not MMLU or GSM8K?

Baselines

Nullsec-S1

Base Qwen2.5-Coder-7B

Semgrep

Claude API

OpenAI / Codex API

Comparison table

Baseline comparison

Output-count note

Interpretation

Limitations

FilesExpand file tree

EVALS.md

Latest commit

History

EVALS.md

File metadata and controls

Evaluation and Baselines

What the suite measures

Why not MMLU or GSM8K?

Baselines

Nullsec-S1

Base Qwen2.5-Coder-7B

Semgrep

Claude API

OpenAI / Codex API

Comparison table

Baseline comparison

Output-count note

Interpretation

Limitations