Masubi

An autonomous research system that iteratively improves an email trust scorer. An AI agent proposes code changes, the system evaluates them against three independent gates, and keeps or discards via git ratcheting. Emails are scored across 10 trust dimensions -- not binary spam/not-spam.

How It Works

                    +-------------------+
                    |   Agent (Opus)    |
                    |  proposes edit to |
                    |     train.py      |
                    +---------+---------+
                              |
                              v
                    +-------------------+
                    |   Score email     |
                    |   chains on 10   |
                    |   trust axes     |
                    +---------+---------+
                              |
                              v
              +---------------+---------------+
              |           Three Gates         |
              |                               |
              |  1. Composite score improved? |
              |  2. No axis regressed on      |
              |     human-verified gold set?  |
              |  3. Explanations cover all    |
              |     flagged axes?             |
              +-------+-----------+-----------+
                      |           |
                 ALL PASS      ANY FAIL
                      |           |
                      v           v
               git commit    git revert
               (keep edit)   (discard)
                      |           |
                      +-----+-----+
                            |
                            v
                   Repeat until budget
                   or time limit hit

The agent can only edit one file (train.py). Everything else -- the evaluation policy, data pipeline, and scoring spec -- is frozen. The agent can be creative, but it can't game the metrics or weaken the gates.

Quick Start

Prerequisites: Python 3.12+, uv

# One-shot setup
./setup.sh

# Or manually:
uv sync
cp .env.example .env   # add ANTHROPIC_API_KEY and HYPERBOLIC_API_KEY
uv run python -m autotrust.data build-eval
uv run python -m autotrust.data build-gold

Run It

# Stage 1: prompt optimization (dashboard opens automatically)
uv run python run_loop.py --max-experiments 5 --eval-limit 100

# Full eval set (slower)
uv run python run_loop.py

# Stage 2: student model training
uv run python run_loop.py --stage train --max-experiments 5

# Skip auto-launching dashboard
uv run python run_loop.py --max-experiments 5 --no-dashboard

The dashboard opens in your browser automatically. Stage 1 auto-transitions to Stage 2 after 3 consecutive no-improvement experiments.

Dashboard Only

uv run python dashboard.py

The dashboard detects CLI-started runs automatically.

API Keys

Add to .env:

ANTHROPIC_API_KEY -- Claude Opus (agent + judge)
HYPERBOLIC_API_KEY -- Qwen3-80B scorer + GPU training

Architecture

Two stages, same three-gate evaluation:

Stage	What Happens
1 - Prompt Optimization	Agent (Claude Opus) improves scoring prompts in `train.py`. Scorer calls Qwen3-80B on Hyperbolic.
2 - Student Training	Agent trains a compact model (50-200M params) on Hyperbolic H100s. Dense baseline first, then MoE.

Five provider roles, all configured in spec.yaml:

Role	Backend	Model
Agent	Anthropic	claude-opus-4-6
Scorer	Hyperbolic	Qwen3-Next-80B-A3B-Instruct
Judge	Anthropic	claude-opus-4-6 (primary), claude-sonnet-4-6 (fallback)
Generator	Local Ollama	dolphin3 (uncensored, for synth data)
Trainer	Hyperbolic GPU	H100 (on-demand, for Stage 2)

Three layers with strict boundaries:

Layer	Contents	Who Touches It
Spec	`spec.yaml` -- axes, weights, providers, caps	Humans only
Platform	eval, data, providers, observability	Humans only, TDD'd
Mutable	`train.py` -- the ONE file the agent edits	Agent only

Trust Axes

Axis	Type	Weight
phish	binary	0.22
truthfulness	continuous	0.18
manipulation	continuous	0.13
deceit	continuous	0.10
vulnerability_risk	continuous	0.10
authority_impersonation	continuous	0.10
subtle_toxicity	continuous	0.08
polarization	continuous	0.05
classic_email_metrics	continuous	0.04
verify_by_search	binary	0.00

Three Gates

Composite improved -- weighted score must go up
Gold-set veto -- no single axis may regress vs human labels. ANY regression = reject.
Explanation gate -- model must explain which axes it flagged and why

Configuration

All configuration lives in spec.yaml: trust axes, providers (agent, scorer, judge), budget limits, calibration policy, safety rules, and Stage 2 caps.

Development

uv run pytest -v              # Run tests
uv run ruff check .           # Lint
uv run ruff format .          # Format

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
autotrust		autotrust
docs		docs
eval_set		eval_set
gold_set		gold_set
synth_data		synth_data
teacher		teacher
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CURSOR_PLAN.md		CURSOR_PLAN.md
README.md		README.md
annotation_rubric.md		annotation_rubric.md
dashboard.py		dashboard.py
program.md		program.md
pyproject.toml		pyproject.toml
run_loop.py		run_loop.py
setup.sh		setup.sh
spec.yaml		spec.yaml
starting_train.py		starting_train.py
starting_train_stage2.py		starting_train_stage2.py
train.py		train.py
train_stage1_archive.py		train_stage1_archive.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Masubi

How It Works

Quick Start

Run It

Dashboard Only

API Keys

Architecture

Trust Axes

Three Gates

Configuration

Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Masubi

How It Works

Quick Start

Run It

Dashboard Only

API Keys

Architecture

Trust Axes

Three Gates

Configuration

Development

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages