Skip to content

junainfinity/ZeroFuse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

⚡ ZeroFuse

Point it at a model. It conducts trials, identifies the refusal architecture, and abliterates it.

Automated, capability-preserving refusal removal for open-weight transformer LLMs — a direct weight edit that produces a standard Hugging Face model with zero inference-time overhead.

Created by osmAPI.com

Created by osmAPI.com License: MIT Python 3.11+ Built with PyTorch 🤗 Transformers Optuna Agent-native: MCP Status: v0.1.0


🇮🇳 osmAPI.com is the only provider in India offering abliterated models via API. ZeroFuse is the engine that powers them.

ZeroFuse turns guardrail removal into a one-command, fully-automated optimization problem — no hand-picking layers, no guessing strengths, no retraining. It estimates the model's refusal direction, orthogonalizes it out of the residual-writing weights, and uses a two-objective search to preserve capability. The output is a standard Hugging Face checkpoint you can load, quantize, or serve like any other.

Table of Contents

Why ZeroFuse

Most abliteration workflows are manual: you pick a layer, eyeball a strength coefficient, run the model, check whether it still refuses, and repeat — often degrading the model's general capabilities along the way. ZeroFuse replaces that loop with a principled, automated search.

  • Fully automatic — no hand-picking layers, directions, or strengths. You point it at a model and it conducts the trials.
  • Capability-preserving by design — KL divergence from the original model is an explicit optimization objective, co-minimized alongside refusals, not an afterthought.
  • A real weight edit, not a runtime adapter — orthogonalizes the refusal direction directly out of attention o_proj and MLP down_proj (W' = W − strength · r(rᵀW)). The saved model has zero inference-time overhead: no LoRA to load, no runtime hooks, no wrapper.
  • Pareto-front control — a two-objective Optuna TPE search hands you the full trade-off curve. Pick the point you want: fewest refusals, lowest KL, or the knee.
  • Grounded in published research — difference-of-means refusal direction (Arditi et al. 2024) with optional projected refinement (grimjim 2025) to reduce collateral damage.
  • Broad model support — dense models, MoE (including per-expert down_proj), and many multimodal nestings.
  • Resumable — Optuna studies are journaled to disk; re-run the same command to continue where you left off.
  • Agent-native — ships an MCP server for Claude Desktop, OpenAI Codex, Google AntiGravity, and any MCP client. Quiet by default, verbose on request.
  • One command, or one importzerofuse --model <hf-id-or-path>, or from zerofuse import abliterate.

🚀 One Command

# Clone and install (editable)
git clone https://github.com/junainfinity/ZeroFuse.git
cd ZeroFuse
pip install -e .                # core
pip install -e ".[mcp]"         # + the agent/MCP server (optional)

# Point it at any Hugging Face model id or local path
zerofuse --model meta-llama/Llama-3.1-8B-Instruct

That's the whole loop. ZeroFuse:

  1. Loads the target model and captures residual-stream activations.
  2. Identifies the refusal direction via difference-of-means on harmful vs. harmless prompts.
  3. Conducts trials — a two-objective Optuna search over layers and strengths, co-minimizing refusals and KL divergence.
  4. Abliterates by orthogonalizing the chosen direction out of the weights.
  5. Writes a standard Hugging Face model directory you can load with from_pretrained — no special runtime required.
# Resume an interrupted run — same command, picks up the journaled study
zerofuse --model meta-llama/Llama-3.1-8B-Instruct

# Quiet (only high-level phases) — or fully verbose
zerofuse --model meta-llama/Llama-3.1-8B-Instruct --quiet
zerofuse --model meta-llama/Llama-3.1-8B-Instruct --verbose

Note

ZeroFuse needs enough memory to load and run forward passes on the target model. Plan capacity for the model you point it at.

🔬 How It Works

ZeroFuse implements the published "refusal direction" line of research as a clean-room MIT build, wrapped in an automated optimizer.

1. Estimate the refusal direction

It captures residual-stream activations on a set of harmful and harmless prompts and takes the difference of means. The unit refusal direction is:

$$ r ;=; \frac{\mu_{\text{harmful}} - \mu_{\text{harmless}}}{\lVert \mu_{\text{harmful}} - \mu_{\text{harmless}} \rVert} $$

where $\mu_{\text{harmful}}$ and $\mu_{\text{harmless}}$ are the mean residual-stream activations over harmful and harmless prompts respectively (Arditi et al., 2024). An optional projected refinement step (grimjim, 2025) sharpens the estimate to reduce collateral damage.

2. Orthogonalize it out of the weights

Rather than subtract the direction at runtime, ZeroFuse edits the weights that write into the residual stream so they can no longer contribute along $r$:

$$ W' ;=; W ;-; \text{strength} \cdot r,(r^{\top} W) $$

This is applied to the attention output projection (o_proj) and the MLP down-projection (down_proj), including MoE experts. The scalar strength controls how much of the $r$-component is removed: at strength = 1 this is a full orthogonal projection that removes the component entirely; smaller values remove it partially. Because the edit lives in the weights, the resulting model is indistinguishable in shape and speed from the original.

3. Search the Pareto front

Choosing layers and strengths by hand is the hard part — so ZeroFuse doesn't. It runs an Optuna TPE multi-objective search that co-minimizes two objectives:

$$ \min ;\big(; N_{\text{refusals}}, ;; D_{\mathrm{KL}}(P_{\text{orig}} ,\Vert, P_{\text{edited}}) ;\big) $$

  • $N_{\text{refusals}}$ — how often the edited model still refuses, scored by the evaluator.
  • $D_{\mathrm{KL}}$ — how far the edited model's output distribution has drifted from the original, as a proxy for lost capability.

The result is a Pareto front of non-dominated configurations. You choose the operating point that fits your goal — fewest refusals, lowest KL, or the knee of the curve — and ZeroFuse materializes that exact weight edit.

refusals
  ^
  |  x
  |   x
  |     x  <- knee
  |        x x
  |            x x x
  +-------------------> KL divergence
   (each x = a non-dominated trial on the Pareto front)

⚖️ ZeroFuse vs. the Alternatives

Capability ZeroFuse Manual abliteration Fine-tuning
Setup effort One command: point it at an HF id or path; layers and strengths are picked automatically Hand-select target layers, directions, and strengths through trial and error Assemble a dataset, configure a training run, and manage compute
Weights vs. runtime Direct weight edit — orthogonalizes the refusal direction out of o_proj and down_proj Also a weight edit, but applied manually with chosen parameters Updates weights via gradient descent over a training corpus
Capability preservation KL divergence from the original model is an explicit optimization objective Depends on the operator's manual tuning; no built-in capability objective Risk of catastrophic forgetting; mitigation depends on data and hyperparameters
Tuning the trade-off Two-objective Optuna TPE search yields a Pareto front; pick fewest refusals, lowest KL, or the knee Re-run by hand and eyeball results; no systematic Pareto search Adjust data mix and hyperparameters and retrain to shift the trade-off
Inference-time overhead None — output is a standard Hugging Face model None if done as a weight edit; runtime adapters add overhead None for a full fine-tune; LoRA adapters add overhead unless merged
Compute cost Runs trials and a KL/refusal search; no gradient-based retraining Low compute, but high human time per iteration Highest — training compute proportional to model and dataset size
Resumability Optuna studies journaled to disk; re-run the same command to continue Manual — depends on your own bookkeeping Checkpoint-based resume, depending on the training framework
Agent / automation Ships an MCP server for Claude Desktop, OpenAI Codex, Google AntiGravity, and any MCP client None built in None built in
Output format Standard Hugging Face model — load, quantize, or serve like any other Modified model; format depends on the tooling used Standard weights or a LoRA adapter, depending on method
Model support Dense, MoE (per-expert down_proj), and many multimodal nestings; pure state-space out of scope Whatever the operator manually implements support for Broad, subject to framework support for the architecture

🤖 Agent-native / MCP

ZeroFuse ships a built-in Model Context Protocol server, so an agent can drive the whole pipeline as a tool. It works in Claude Desktop, OpenAI Codex, Google AntiGravity, and any MCP-compatible client.

Install the optional dependency and add it to your MCP client config:

pip install -e ".[mcp]"
{
  "mcpServers": {
    "zerofuse": {
      "command": "zerofuse-mcp"
      // installed alongside the CLI by `pip install -e ".[mcp]"`
    }
  }
}

It exposes a single abliterate tool and is designed to be a well-behaved citizen of an agent's context window:

  • Quiet by default. The harness sees only high-level phases — identifying refusal architecture, conducting trials, abliterating — not a firehose of internals.
  • Opt-in detail. Per-trial metrics, layer choices, and KL traces are emitted at MCP debug log level and surface only if the harness opts in to debug logs.
  • Override when you want it. A verbose argument forces full detail regardless of log level.

This keeps long-running optimization runs legible to an agent instead of flooding it with token-heavy progress chatter. See docs/agents.html for per-harness setup.

🐍 Python API

Everything the CLI does is available as a library:

from zerofuse import abliterate

# One call: returns the saved HF model dir + the Pareto front to pick from.
result = abliterate("meta-llama/Llama-3.1-8B-Instruct", n_trials=100)
print(result.selected.refusals, result.selected.kl, result.output_dir)

Or build a full configuration explicitly:

from zerofuse import ZeroFuseConfig, run

config = ZeroFuseConfig.from_dict({
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "optimization": {"n_trials": 100},
})
result = run(config, selection="knee")

🧩 Supported Models

ZeroFuse is built to work on most open-weight transformer models you point it at:

Architecture Support
Dense transformer LLMs ✅ Supported
Mixture-of-Experts (per-expert down_proj) ✅ Supported
Multimodal nestings with a transformer LLM backbone ✅ Many supported
Pure state-space models ❌ Out of scope

Because architectures vary, ZeroFuse is designed to generalize across these families rather than guaranteed to abliterate every model — it adapts to the residual-writing weights it finds.

📁 Project Structure

ZeroFuse/
├── src/zerofuse/
│   ├── config.py        # Run configuration & defaults (TOML + CLI)
│   ├── prompts.py       # Harmful / harmless prompt loading + batching
│   ├── directions.py    # Pure math: difference-of-means, projected refinement
│   ├── model.py         # Loading, activation capture, weight orthogonalization
│   ├── evaluator.py     # Scoring: refusal detection + KL divergence
│   ├── optimizer.py     # Optuna TPE search + Pareto-front selection
│   ├── pipeline.py      # End-to-end orchestration
│   ├── reporting.py     # Quiet-by-default progress (phases vs. details)
│   ├── cli.py           # `zerofuse` command-line entrypoint
│   └── mcp_server.py    # Model Context Protocol server (agent-native)
├── docs/                # Self-contained HTML documentation site
├── config/default.toml  # Fully-commented configuration template
└── tests/               # Unit tests for the pure-logic parts

Each module has a single responsibility. directions.py is pure math — no model objects, easy to test and audit. model.py is the only place weights are touched.

❓ FAQ

How does ZeroFuse remove refusals without retraining?

It estimates the model's "refusal direction" via difference-of-means of residual-stream activations on harmful vs. harmless prompts (Arditi et al. 2024, arXiv:2406.11717), then orthogonalizes that direction out of the residual-writing weights — attention o_proj and MLP down_proj, including MoE experts — using W' = W − strength · r(rᵀW). No gradient-based training is involved; it's a direct edit to the existing weights.

Will abliteration degrade the model's capabilities?

ZeroFuse is built to minimize that. KL divergence from the original model is an explicit optimization objective alongside the number of refusals, and a two-objective Optuna TPE search produces a Pareto front so you can choose how to balance fewest refusals against lowest KL. There's also an optional projected refinement step (grimjim 2025) designed to reduce collateral damage. As a v0.1.0 project, these are design goals rather than independently benchmarked guarantees.

What kinds of models does it work on?

It's designed to work on most open-weight transformer models you point it at — dense models, MoE models (including per-expert down_proj), and many multimodal nestings. Pure state-space models are out of scope.

What do I actually get as output, and is there any runtime cost?

You get a standard Hugging Face model — the refusal behavior is edited into the weights themselves, not delivered as a runtime LoRA adapter. That means zero inference-time overhead: you can load, quantize, and serve it exactly like any other Hugging Face model.

How do I run it, and how does the MCP / agent integration work?

Install with pip install -e . and run zerofuse --model <hf-id-or-path>, or use the Python API with from zerofuse import abliterate. ZeroFuse also ships an MCP server (pip install -e ".[mcp]") that works in Claude Desktop, OpenAI Codex, Google AntiGravity, and any MCP client. It's quiet by default — the harness sees only high-level phases like "identifying refusal architecture," "conducting trials," and "abliterating," with internal details emitted at MCP debug log level and shown only if the harness opts in (a verbose flag overrides). Runs are resumable: Optuna studies are journaled to disk, so re-running the same command continues where you left off.

What about licensing and responsible use?

ZeroFuse is MIT-licensed — an independent clean-room build from published papers and an Apache-2.0 reference implementation that copies no copyleft tool, with citations documented in NOTICE.md. The MIT license covers only the tool, not the models you produce. Because ZeroFuse reduces a model's guardrails, you are responsible for complying with the base model's license and acceptable-use policy, applicable law, and any platform terms that apply to the models you create and deploy.

📌 Status

v0.1.0 — new project. ZeroFuse is early. The method is grounded in published research and the implementation is built to preserve capability, but it has not yet been independently benchmarked at scale. Where this README says designed to or built to, that is a deliberate statement that the claim is true by construction, not yet third-party-verified. No benchmark numbers, star counts, or testimonials are presented here because there aren't any to honestly report yet. Issues and reproductions welcome.

🛡️ Responsible Use

ZeroFuse reduces or removes safety guardrails from model weights. That capability carries real responsibility.

  • You are responsible for compliance with the base model's license and acceptable-use policy, all applicable law, and the terms of any platform you deploy on.
  • The MIT license covers this tool only — it does not grant you any rights over, or responsibility for, the models you produce or process. Those are governed by the original model's license.
  • Use it on models you are permitted to modify, for purposes you are permitted to pursue.

Removing guardrails does not remove accountability. Think before you point it at something.

📜 Provenance & License

ZeroFuse is created and maintained by osmAPI.com — the only provider in India offering abliterated models via API.

It is an independent, clean-room implementation built from published papers and an Apache-2.0 reference implementation. It does not copy, vendor, or derive from any copyleft tool. Citations and attributions are documented in NOTICE.md.

The tool is released under the MIT License. The MIT license covers the tool only — not the models you produce with it.

📚 Citations

  • Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., & Nanda, N. (2024). Refusal in Language Models Is Mediated by a Single Direction. NeurIPS 2024. arXiv:2406.11717
  • Jim Lai (grimjim) (2025). Projected & norm-preserving refinements of the refusal direction for reduced collateral damage. Hugging Face blog.

See NOTICE.md for the full reference list and attributions.

🤝 Contributing

PRs are welcome. Good first contributions: new model-family adapters, additional refusal evaluators, prompt-set improvements, and docs. Please keep directions.py pure and confine weight mutation to model.py.


ZeroFuse · created by osmAPI.com · MIT · built with PyTorch · Optuna · 🤗 Transformers

Point it at a model. It does the rest.

About

Automated, capability-preserving abliteration for open-weight LLMs — agent-native (MCP server). Clean-room MIT implementation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors