A2B — arifOS × AssetOpsBench Bridge

ariffazil/A2B is the canonical home for arifOS federation's integration with the AssetOpsBench benchmark for the IJCAI 2026 Industrial Automation Challenge (Tool-Augmented Track).

What is this?

A governed bridge between arifOS's constitutional governance kernel and AssetOpsBench. Every action passes through:

arifOS kernel (arif_init → arif_observe → arif_think → arif_judge → arif_seal)
A-FORGE execution gate (HARAM scan + capability check + floor evaluation + irreversibility)
VAULT999 immutable audit (hash-chained seal receipts)

Why does it exist?

Industrial AI agents make decisions with real consequences. A wrong maintenance recommendation can cascade into equipment failure. arifOS provides the governance substrate that ensures every action is authorized, auditable, and reversible-by-design.

The IJCAI 2026 submission demonstrates: identity airlocks work — 50/50 unauthorized execution attempts correctly blocked with zero false negatives and negative latency overhead.

Architecture

AssetOpsBench scenarios (FailureSensorIQ MCQ)
        ↓
  eval_harness.py (this repo)
        ↓
  TokenRouter → MiniMax-M3       arifOS MCP (:8088)
        ↓                              ↓
  LLM answer (A/B/C/D)         arif_judge → arif_seal → VAULT999
        ↓
  parse → compare → record

Key Results (Disk-Verified, OBS Only)

Run	Governance	Accuracy	HOLD	A-bias
smoke	ON	n/a	2/2 ✅	—
run001_gov	ON	16/50 (32%)	50/50 ✅	21/50 (42%)
run002_nogov	OFF	18/50 (36%)	N/A	19/50 (38%)

Key findings:

Governance held every scenario: 50/50 HOLD ✅
Zero seals written: T1 identity not registered (airlock working as designed)
A-bias confirmed: model defaults to "A" at 38–42% (expected 25%)
Same model, same accuracy regardless of governance (32% vs 36% = noise)

No aspirational claims. All numbers disk-verified in evals/ directory.

IJCAI 2026 Submission

📄 Paper: reports/IJCAI_2026_SUBMISSION.md
📓 Notebook: notebooks/ijcai_2026_analysis.py (open in Jupyter)
🤗 Dataset: ariffazil/a2b-eval-results

One-Liner

Before asking whether an LLM can correctly answer industrial maintenance questions, we must first ask whether it should be allowed to act on those answers. arifOS blocked 50/50 unauthorized attempts — the identity airlock is the feature, not the bug.

Quick Analysis

git clone https://github.com/ariffazil/A2B.git && cd A2B
pip install datasets pandas matplotlib jupyter

# Open the analysis notebook
jupyter notebook notebooks/ijcai_2026_analysis.py

# Or quick summary from command line
python3 scripts/load_a2b_dataset.py

Quick Start

# Clone
git clone https://github.com/ariffazil/a2b.git && cd a2b

# Run eval (governance enabled)
python3 harness/eval_harness.py \
  --scenarios data/failuresensoriq_standard/sample_50_questions.jsonl \
  --output-dir evals/run_latest \
  --limit 50

# Run baseline (no governance)
python3 harness/eval_harness.py \
  --scenarios data/failuresensoriq_standard/sample_50_questions.jsonl \
  --output-dir evals/run_nogov \
  --no-governance --limit 50

Requires: TOKENROUTER_API_KEY env var (or /root/.secrets/vault.flat.env). arifOS kernel must be running at localhost:8088 for governance runs.

arifbench Agent (Constitutional Runner)

The src/agent/arifbench/ directory contains the arifOS-governed agent runner that wraps AssetOpsBench's opencode-agent with constitutional governance:

File	Purpose
`arif_os_client.py`	MCP client to arifOS kernel (init, judge, vault, seal)
`cli.py`	CLI entry: `uv run arifbench-agent "query"`
`constitutional_runner.py`	Intercepts tool calls → arifOS judge → SEAL before execute

Repo Structure

A2B/
├── README.md                           # This file
├── LICENSE                             # Apache 2.0
├── CITATION.cff                        # Citation metadata
├── CONTRIBUTING.md                     # Contribution guide
│
├── notebooks/                          # 📓 IJCAI 2026 analysis
│   └── ijcai_2026_analysis.py          #   Jupyter notebook (open in Jupyter)
│
├── reports/                            # 📄 Submission documents
│   ├── IJCAI_2026_SUBMISSION.md        #   Paper-style summary (8 sections)
│   └── EVAL_REPORT_v0.1.md             #   Detailed evaluation report
│
├── scripts/                            # 🛠 Utilities
│   └── load_a2b_dataset.py             #   Quick dataset loader & summarizer
│
├── harness/                            # ⚙️ Eval tooling
│   ├── eval_harness.py                 #   Main harness (642 lines, stdlib only)
│   └── runners/
│       └── direct_llm_agent.py         #   Direct LLM runner
│
├── evals/                              # 📊 Canonical eval results (disk-verified)
│   ├── run001_gov/                     #   Governance ON (50 scenarios)
│   │   ├── eval_results.jsonl          #     Per-scenario traces
│   │   ├── eval_aggregate.json         #     Aggregate metrics
│   │   └── RECEIPT.md                  #     Receipt: airlock proof
│   ├── run002_nogov/                   #   Governance OFF baseline (50 scenarios)
│   └── smoke/                          #   Smoke test (2 scenarios)
│
├── data/                               # 📦 Scenario data
│   └── failuresensoriq_standard/       #   FailureSensorIQ MCQ corpus
│       ├── sample_50_questions.jsonl   #     IJCAI eval set (50)
│       ├── all.jsonl                   #     Full set (2,667)
│       ├── all_10_options.jsonl        #     10-option variant
│       └── all_multi_answers.jsonl     #     Multi-answer variant
│
├── docs/                               # 📖 Architecture docs
│   ├── ASSETOPSBENCH_BRIDGE.md         #   Bridge architecture
│   ├── CONSTITUTIONAL_ABSTRACTION_LAYER.md
│   ├── EVAL_NUMBERS.md                 #   Number audit
│   ├── PROJECT_TRACKER.md              #   Project status
│   └── RESEARCHER_BRIEF.md             #   Research summary
│
└── src/agent/arifbench/                # 🤖 Constitutional runner agent
    ├── __init__.py
    ├── arif_os_client.py               #   arifOS MCP client (init, judge, seal)
    ├── cli.py                          #   CLI entry point
    └── constitutional_runner.py         #   Governed MCP proxy (652 lines)

Eval Results Dataset

Structured eval results are also available as a HuggingFace dataset: ariffazil/a2b-eval-results

License

Apache 2.0 — same as arifOS and AssetOpsBench.

DITEMPA BUKAN DIBERI — Forged, Not Given.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A2B — arifOS × AssetOpsBench Bridge

What is this?

Why does it exist?

Architecture

Key Results (Disk-Verified, OBS Only)

IJCAI 2026 Submission

One-Liner

Quick Analysis

Quick Start

arifbench Agent (Constitutional Runner)

Repo Structure

Eval Results Dataset

Related

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github		.github
data/failuresensoriq_standard		data/failuresensoriq_standard
docs		docs
evals		evals
harness		harness
notebooks		notebooks
reports		reports
scripts		scripts
src/agent/arifbench		src/agent/arifbench
.env.public		.env.public
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

A2B — arifOS × AssetOpsBench Bridge

What is this?

Why does it exist?

Architecture

Key Results (Disk-Verified, OBS Only)

IJCAI 2026 Submission

One-Liner

Quick Analysis

Quick Start

arifbench Agent (Constitutional Runner)

Repo Structure

Eval Results Dataset

Related

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages