Mechanistic study of the refusal direction across base, instruction-tuned, and reasoning-distilled Qwen2.5-1.5B variants: extraction, ablation, transplant, and phase-aware analysis.
-
Updated
May 8, 2026 - Python
Mechanistic study of the refusal direction across base, instruction-tuned, and reasoning-distilled Qwen2.5-1.5B variants: extraction, ablation, transplant, and phase-aware analysis.
Automated, capability-preserving abliteration for open-weight LLMs — agent-native (MCP server). Clean-room MIT implementation.
Mechanistic interpretability toolkit for LLM refusal: locate and ablate the linear refusal direction (Arditi et al. 2024) without retraining. Diff-Means, whitened SVD, COSMIC layer selection, reversible steering vectors. PyTorch + Transformers.
Add a description, image, and links to the refusal-direction topic page so that developers can more easily learn about it.
To associate your repository with the refusal-direction topic, visit your repo's landing page and select "manage topics."