refusal-direction

Here are 3 public repositories matching this topic...

anki079 / refusal-in-reasoning-models

Mechanistic study of the refusal direction across base, instruction-tuned, and reasoning-distilled Qwen2.5-1.5B variants: extraction, ablation, transplant, and phase-aware analysis.

jailbreak language-models ai-safety llm mechanistic-interpretability transformerlens qwen safety-alignment reasoning-language-models reasoning-models deepseek-r1 refusal-ablation refusal-direction

Updated May 8, 2026
Python

junainfinity / ZeroFuse

Star

Automated, capability-preserving abliteration for open-weight LLMs — agent-native (MCP server). Clean-room MIT implementation.

mcp transformers pytorch uncensored optuna llm abliteration refusal-direction

Updated Jun 19, 2026
Python

fmr693 / llm-abliteration-toolkit

Star

Mechanistic interpretability toolkit for LLM refusal: locate and ablate the linear refusal direction (Arditi et al. 2024) without retraining. Diff-Means, whitened SVD, COSMIC layer selection, reversible steering vectors. PyTorch + Transformers.

python nlp transformers pytorch ai-safety interpretability red-teaming ai-alignment llm mechanistic-interpretability abliteration refusal-direction

Updated Jul 4, 2026
Python

Improve this page

Add a description, image, and links to the refusal-direction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the refusal-direction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly