Nullsec-S1 is evaluated on application-security verdict quality, not broad general reasoning benchmarks. The RC2/v1.1 release benchmark is a 111-case suite covering all 16 Nullsec security categories.
The shared benchmark metrics include:
- detection precision / recall / F1
- false-safe rate (unsafe code marked safe after Safety Layer enforcement)
- hallucination rate (findings on clean cases)
- OWASP coverage
- structural patch correctness
- secure-generation score
- per-category recall
- failed case IDs and per-case debug reasons
Cases with no output or malformed output are scored as real misses.
Nullsec-S1 is not a general-purpose academic benchmark model. It is trained and evaluated for structured application-security verdicts over AI-generated apps, agents, MCP tools, wallet/Web3 flows, and common appsec failures. MMLU/GSM8K may be useful for other model profiles, but they do not measure Nullsec's core job: security detection, secure patching, and false-safe prevention.
Baseline results must be generated by scripts. Do not hand-enter numbers.
Use the trained PEFT / QLoRA adapter from GitHub Release v1.0.0-rc25 (source
of record) or the Hugging Face adapter mirror:
https://huggingface.co/Trynullsec/nullsec-s1. Users still need the base model
Qwen/Qwen2.5-Coder-7B-Instruct.
python benchmarks/run_all.py --mode model --adapter outputs/nullsec-s1-qloraThis runs the same benchmark prompt against the base model with no Nullsec
adapter. It requires GPU in --mode model.
python benchmarks/baselines/base_qwen.py --mode modelReport path:
benchmarks/reports/baselines/qwen2_5_coder_7b/SUITE.json
Malformed or non-JSON base-model output is counted honestly as a miss.
Semgrep is optional and CPU-only. Install it first:
python -m pip install semgrep
python benchmarks/baselines/semgrep_baseline.pyIf Semgrep is missing, the runner exits with:
Semgrep is not installed. Install with: python -m pip install semgrep
Report path:
benchmarks/reports/baselines/semgrep/SUITE.json
Semgrep coverage is partial. It is a static rule engine and does not reasonably cover every Nullsec category (for example prompt injection, many MCP tool-abuse semantics, wallet transaction policies, and many smart-contract risks without specialized rules). Unsupported categories are documented in the report.
Docker fallback for Semgrep is a future enhancement, not implemented today.
Claude comparisons are optional hosted-model baselines. They require an API key and an explicit model id; no default model is hardcoded because provider model IDs and dates must be recorded in the report.
Smoke test:
export ANTHROPIC_API_KEY=...
export ANTHROPIC_MODEL=...
python benchmarks/baselines/claude_api.py --limit 5 --sleep 1Full run (costs money; run intentionally):
python benchmarks/baselines/claude_api.py --sleep 1Report and raw cache:
benchmarks/reports/baselines/claude/SUITE.json
benchmarks/reports/baselines/claude/raw_outputs.jsonl
Use --resume to skip already-cached case ids if a run is interrupted.
OpenAI/Codex comparisons are optional hosted-model baselines. They require an API
key and an explicit model id via OPENAI_MODEL or --model.
Smoke test:
export OPENAI_API_KEY=...
export OPENAI_MODEL=...
python benchmarks/baselines/openai_api.py --limit 5 --sleep 1Full run (costs money; run intentionally):
python benchmarks/baselines/openai_api.py --sleep 1Report and raw cache:
benchmarks/reports/baselines/openai/SUITE.json
benchmarks/reports/baselines/openai/raw_outputs.jsonl
Provider models can change over time. Reports record the exact provider, model id, run date, dataset, and raw-output cache path. Do not compare hosted-model results without those fields.
Hosted model runners use temperature=0 where the provider/model supports it.
Some models reject or deprecate explicit temperature parameters; in that case the
runner retries once without temperature and records the omission in report
metadata.
Generate a Markdown comparison from existing reports:
python benchmarks/compare_baselines.py \
--nullsec benchmarks/reports/SUITE.json \
--base benchmarks/reports/baselines/qwen2_5_coder_7b/SUITE.json \
--semgrep benchmarks/reports/baselines/semgrep/SUITE.json \
--claude benchmarks/reports/baselines/claude/SUITE.json \
--openai benchmarks/reports/baselines/openai/SUITE.json \
--out benchmarks/reports/baselines/COMPARISON.mdThe generated comparison is a report artifact and should not be committed unless explicitly approved.
Generated with benchmarks/compare_baselines.py from local reports. Raw
generated reports remain ignored under benchmarks/reports/.
| System / tool | Total cases | Outputs / analyzable | Precision | Recall | F1 | false_safe_rate | hallucination_rate | Notes / coverage limits |
|---|---|---|---|---|---|---|---|---|
| Nullsec-1 | 111 | 110 | 0.9423 | 0.9074 | 0.9245 | 0.0 | 0.0667 | RC2/v1.1 release or local run |
| Qwen2.5-Coder-7B-Instruct (base, no Nullsec adapter) | 111 | 4 | 0.3333 | 0.0093 | 0.018 | 0.0 | 0.5 | base model, no Nullsec adapter |
| Semgrep (local rules baseline) | 111 | 111 | 0.8627 | 0.4074 | 0.5535 | 0.5625 | 0.3333 | static rules; partial category coverage |
| Claude API baseline (claude-opus-4-8) | 111 | 68 | 0.8889 | 0.5185 | 0.655 | 0.0 | 0.1429 | hosted API baseline; model id/date in report |
| OpenAI/Codex API baseline (gpt-5.3-codex) | 111 | 105 | 0.6169 | 0.8796 | 0.7252 | 0.0 | 0.6 | hosted API baseline; model id/date in report |
The GitHub Release records 111/111 raw model outputs. The comparison table's
Outputs / analyzable column uses results.summary.total_outputs from the
report, which counts outputs that were alignable and scorable as structured
verdicts by the benchmark pipeline. For the Nullsec-S1 report used here, one raw
output was not alignable for scoring, so the comparison table shows 110.
- Nullsec-S1 shows stronger structured security-verdict performance on this repo-authored benchmark.
- Base Qwen2.5-Coder-7B-Instruct mostly failed to produce scorable Nullsec-style JSON security verdicts. This shows why the fine-tune and deterministic alignment layer matter for this output format and task.
- Semgrep detects some static patterns with high precision, but has partial category coverage and lower recall on this benchmark. This is a local-rules Semgrep baseline on the Nullsec benchmark, not a general claim about Semgrep quality.
- Claude Opus 4.8 has high precision but lower recall and fewer analyzable structured outputs on this benchmark.
- OpenAI/Codex has high recall but lower precision and a higher hallucination rate on this benchmark.
- The benchmark is security-specific and repo-authored; it is not an independent third-party benchmark.
- Baseline comparisons are meaningful only when all systems are run on the same dataset version.
- Results should be reproduced from the scripts above; do not hand-enter metrics.
- Semgrep is not expected to cover all categories and should be interpreted as a static-analysis baseline, not a security LLM.
- Provider API models can change over time. Exact model IDs and run dates are recorded in generated reports.
- Frontier/API baseline numbers here are benchmark-run outputs, not general claims about those providers.
- This comparison does not prove universal vulnerability detection performance.
- Do not claim Nullsec-S1 beats another model or tool unless the comparison script output proves it.