dose

A Pharmakon framework for dual-use LLM evaluation

⚠️ PyPI name notice: A package named dose already exists on PyPI from a different author. Do NOT run pip install dose. This project is distributed via GitHub source only until a unique PyPI name is chosen. See Installation for the correct install method.

dose measures how much the helpful and harmful capability subspaces of a language model overlap — and how that overlap changes as intervention strength varies. The core metric, PSI (Pharmakon Separability Index), quantifies whether a model's internal representations treat safe and unsafe behaviors as geometrically separable directions in activation space.

Inspired by Urbina et al. (2022), who showed that the same AI used for drug discovery can be redirected to generate chemical weapons with minimal effort, dose extends that dual-use framing to the activation geometry of language models and measures it quantitatively.

Concepts

PSI — Pharmakon Separability Index. Computed via truncated SVD on paired activation matrices. PSI = 1 - cos²(top_singular_dir(h_helpful), top_singular_dir(h_harmful)). PSI = 1 means the subspaces are orthogonal (fully separable). PSI = 0 means they are colinear (fully entangled).
DRC — Dose-Response Curve. Measures PSI and capability (MMLU accuracy) as intervention strength alpha varies from 0 to 1, across 3 models (Llama-3-8B, Mistral-7B, Qwen-7B). 11 alpha points × 3 models = 33 data points total. If a smoke run predicts > 24 GPU-hours, the schedule auto-degrades to 7 alpha points (21 total).
Interventions — Steering (contrastive mean-difference, repeng-inspired), Abliteration (FailSpy-inspired projection removal), CAA (Rimsky et al.-style contrastive activation addition). Each is independently re-implemented in this repo to avoid upstream breaking-change exposure.
Orchestrator — State machine that runs milestones M1 → M2 → M3 with atomic JSON state, GPU cost cap enforcement (hard stop at 30 GPU-hours), retry logic, and auto-resume support.

Why both PSI extremes are interesting

dose is designed so that either PSI ≈ 0 (entangled subspaces, capability cannot be removed without removing the trait) or PSI ≈ 1 (separable subspaces, surgical intervention is possible) is a publishable finding. The framework does not reward only one direction:

PSI ≈ 1 supports the optimistic interpretability narrative (safety is a separable direction).
PSI ≈ 0 supports the dual-use criticism (safety and capability are geometrically intertwined; intervention always trades one for the other).

Installation

Install from source (PyPI release pending a unique package name):

git clone https://github.com/hinanohart/dose.git
cd dose
pip install -e .

# GPU (recommended for full runs):
pip install -e ".[demo]" torch --index-url https://download.pytorch.org/whl/cu121

Prerequisites

Python 3.10+
(Required for M1/M2) HF_TOKEN env var. meta-llama/Llama-3-8B-Instruct, mistralai/Mistral-7B-Instruct-v0.2, and Qwen/Qwen1.5-7B-Chat are gated models — accept their licenses on huggingface.co and set HF_TOKEN before running M1/M2. Without this token, dose run --milestone M1 will fail at model load.
(Optional, pre-registration) OSF_TOKEN + OSF_PROJECT_ID. Without them, M2 step 1 writes reports/osf_skipped.json instead of submitting. When set, the orchestrator uploads hypotheses to your existing OSF project and injects a DOI badge into the README automatically.
(Optional, auto-resume) GitHub repo variable CLAUDE_ORCHESTRATOR_ENABLED=true to enable the claude-resume.yml workflow. The workflow never pushes to main; it opens an auto/resume-<run_id> PR for human review.
(Release) git tag v* and git push --tags are human-only actions. The orchestrator never creates tags; the DOSE_AUTO_TAG env var that previously bypassed this is now ignored (it was a silent path to OIDC PyPI publish). To release, tag manually after reviewing the M3 SNS draft.

Quickstart

# Synthetic smoke test — CPU, no model required (~seconds)
dose smoke

# Run M1: PSI baseline on Llama-3-8B (requires HF_TOKEN + GPU recommended)
dose run --milestone M1 --config configs/m1.yaml

# Check current milestone state
dose status

# Resume from last checkpoint
dose resume

Python API

import torch
from dose.psi.score import psi

# Replace with real model activations from dose.activations.ActivationExtractor
h_helpful = torch.randn(100, 4096)  # activations from helpful prompts
h_harmful = torch.randn(100, 4096)  # activations from harmful prompts

score = psi(h_helpful, h_harmful)
print(f"PSI = {score:.4f}")  # 0.0 = entangled, 1.0 = separable

How it works

Activation extraction (dose/activations.py): for each prompt pair (helpful / harmful), the extractor runs a forward pass and records hidden-state activations at a chosen layer.
PSI scoring (dose/psi/score.py): truncated SVD is applied to both activation matrices. The cosine-squared angle between the top right singular vectors gives the separability score.
Intervention sweep (dose/intervention/): three intervention methods (steering, abliteration, CAA) each shift activations by a strength parameter alpha. The DRC records (PSI, MMLU accuracy) at each alpha step.
Orchestrator (dose/orchestrator.py): manages the M1 → M2 → M3 state machine, enforces the 30 GPU-hour hard cap, handles retries (pytest ×3, OOM ×2, lockfile ×1), and writes atomic state to state/progress.json.
Evals (dose/evals/): MMLU accuracy tracks whether the intervention degrades general capability alongside safety-related behaviour.

Architecture overview

Milestones

Milestone	Goal	Status
M1	PSI math + Llama-3-8B baseline	scaffolded
M2	Intervention + DRC + 3-model reproduction	scaffolded
M3	Colab demo + SNS draft + v0.1.0 release	scaffolded

Reference

Urbina, F., Lentzos, F., Invernizzi, C., & Ekins, S. (2022). Dual use of artificial-intelligence-powered drug discovery. Nature Machine Intelligence, 4, 189–191. https://doi.org/10.1038/s42256-022-00465-9

Critical Theory Companion

Philosophy and critical theory essays related to this work live in the separate dose-notes repository. The dose core library intentionally contains no narrative — only code, metrics, and results. The CI grep gate enforces this physical separation: any leakage into dose/ Python files fails the build.

Roadmap and Plan B

Active milestones M1–M3 are described in docs/ROADMAP.md. The same document records contingency plans (B-1 hauntology / unlearning pivot, B-2 archive-fever / provenance pivot, B-3 strategic retreat) that activate if M2 fails the reproducibility band.

Contributing

External contributions — especially from critical-theory readers willing to challenge over-claim risk — are described in CONTRIBUTING.md.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github		.github
artifacts		artifacts
checkpoints		checkpoints
configs		configs
docs		docs
dose		dose
hf_space		hf_space
notebooks		notebooks
out		out
reports		reports
scripts		scripts
state		state
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SAFETY.md		SAFETY.md
SECURITY.md		SECURITY.md
env.example		env.example
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dose

Concepts

Why both PSI extremes are interesting

Installation

Prerequisites

Quickstart

Python API

How it works

Architecture overview

Milestones

Reference

Critical Theory Companion

Roadmap and Plan B

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

dose

Concepts

Why both PSI extremes are interesting

Installation

Prerequisites

Quickstart

Python API

How it works

Architecture overview

Milestones

Reference

Critical Theory Companion

Roadmap and Plan B

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages