A Pharmakon framework for dual-use LLM evaluation
⚠️ PyPI name notice: A package nameddosealready exists on PyPI from a different author. Do NOT runpip install dose. This project is distributed via GitHub source only until a unique PyPI name is chosen. See Installation for the correct install method.
dose measures how much the helpful and harmful capability subspaces of a language model overlap — and how that overlap changes as intervention strength varies. The core metric, PSI (Pharmakon Separability Index), quantifies whether a model's internal representations treat safe and unsafe behaviors as geometrically separable directions in activation space.
Inspired by Urbina et al. (2022), who showed that the same AI used for drug discovery can be redirected to generate chemical weapons with minimal effort, dose extends that dual-use framing to the activation geometry of language models and measures it quantitatively.
- PSI — Pharmakon Separability Index. Computed via truncated SVD on paired activation matrices.
PSI = 1 - cos²(top_singular_dir(h_helpful), top_singular_dir(h_harmful)). PSI = 1 means the subspaces are orthogonal (fully separable). PSI = 0 means they are colinear (fully entangled). - DRC — Dose-Response Curve. Measures PSI and capability (MMLU accuracy) as intervention strength alpha varies from 0 to 1, across 3 models (Llama-3-8B, Mistral-7B, Qwen-7B). 11 alpha points × 3 models = 33 data points total. If a smoke run predicts > 24 GPU-hours, the schedule auto-degrades to 7 alpha points (21 total).
- Interventions — Steering (contrastive mean-difference, repeng-inspired), Abliteration (FailSpy-inspired projection removal), CAA (Rimsky et al.-style contrastive activation addition). Each is independently re-implemented in this repo to avoid upstream breaking-change exposure.
- Orchestrator — State machine that runs milestones M1 → M2 → M3 with atomic JSON state, GPU cost cap enforcement (hard stop at 30 GPU-hours), retry logic, and auto-resume support.
dose is designed so that either PSI ≈ 0 (entangled subspaces, capability cannot be removed without removing the trait) or PSI ≈ 1 (separable subspaces, surgical intervention is possible) is a publishable finding. The framework does not reward only one direction:
- PSI ≈ 1 supports the optimistic interpretability narrative (safety is a separable direction).
- PSI ≈ 0 supports the dual-use criticism (safety and capability are geometrically intertwined; intervention always trades one for the other).
Install from source (PyPI release pending a unique package name):
git clone https://github.com/hinanohart/dose.git
cd dose
pip install -e .
# GPU (recommended for full runs):
pip install -e ".[demo]" torch --index-url https://download.pytorch.org/whl/cu121- Python 3.10+
- (Required for M1/M2)
HF_TOKENenv var.meta-llama/Llama-3-8B-Instruct,mistralai/Mistral-7B-Instruct-v0.2, andQwen/Qwen1.5-7B-Chatare gated models — accept their licenses on huggingface.co and setHF_TOKENbefore running M1/M2. Without this token,dose run --milestone M1will fail at model load. - (Optional, pre-registration)
OSF_TOKEN+OSF_PROJECT_ID. Without them, M2 step 1 writesreports/osf_skipped.jsoninstead of submitting. When set, the orchestrator uploads hypotheses to your existing OSF project and injects a DOI badge into the README automatically. - (Optional, auto-resume) GitHub repo variable
CLAUDE_ORCHESTRATOR_ENABLED=trueto enable theclaude-resume.ymlworkflow. The workflow never pushes tomain; it opens anauto/resume-<run_id>PR for human review. - (Release)
git tag v*andgit push --tagsare human-only actions. The orchestrator never creates tags; theDOSE_AUTO_TAGenv var that previously bypassed this is now ignored (it was a silent path to OIDC PyPI publish). To release, tag manually after reviewing the M3 SNS draft.
# Synthetic smoke test — CPU, no model required (~seconds)
dose smoke
# Run M1: PSI baseline on Llama-3-8B (requires HF_TOKEN + GPU recommended)
dose run --milestone M1 --config configs/m1.yaml
# Check current milestone state
dose status
# Resume from last checkpoint
dose resumeimport torch
from dose.psi.score import psi
# Replace with real model activations from dose.activations.ActivationExtractor
h_helpful = torch.randn(100, 4096) # activations from helpful prompts
h_harmful = torch.randn(100, 4096) # activations from harmful prompts
score = psi(h_helpful, h_harmful)
print(f"PSI = {score:.4f}") # 0.0 = entangled, 1.0 = separable- Activation extraction (
dose/activations.py): for each prompt pair (helpful / harmful), the extractor runs a forward pass and records hidden-state activations at a chosen layer. - PSI scoring (
dose/psi/score.py): truncated SVD is applied to both activation matrices. The cosine-squared angle between the top right singular vectors gives the separability score. - Intervention sweep (
dose/intervention/): three intervention methods (steering, abliteration, CAA) each shift activations by a strength parameter alpha. The DRC records (PSI, MMLU accuracy) at each alpha step. - Orchestrator (
dose/orchestrator.py): manages the M1 → M2 → M3 state machine, enforces the 30 GPU-hour hard cap, handles retries (pytest ×3, OOM ×2, lockfile ×1), and writes atomic state tostate/progress.json. - Evals (
dose/evals/): MMLU accuracy tracks whether the intervention degrades general capability alongside safety-related behaviour.
| Milestone | Goal | Status |
|---|---|---|
| M1 | PSI math + Llama-3-8B baseline | scaffolded |
| M2 | Intervention + DRC + 3-model reproduction | scaffolded |
| M3 | Colab demo + SNS draft + v0.1.0 release | scaffolded |
Urbina, F., Lentzos, F., Invernizzi, C., & Ekins, S. (2022). Dual use of artificial-intelligence-powered drug discovery. Nature Machine Intelligence, 4, 189–191. https://doi.org/10.1038/s42256-022-00465-9
Philosophy and critical theory essays related to this work live in the separate dose-notes repository. The dose core library intentionally contains no narrative — only code, metrics, and results. The CI grep gate enforces this physical separation: any leakage into dose/ Python files fails the build.
Active milestones M1–M3 are described in docs/ROADMAP.md. The same document records contingency plans (B-1 hauntology / unlearning pivot, B-2 archive-fever / provenance pivot, B-3 strategic retreat) that activate if M2 fails the reproducibility band.
External contributions — especially from critical-theory readers willing to challenge over-claim risk — are described in CONTRIBUTING.md.
MIT. See LICENSE.
