Skip to content

bigsnarfdude/mindreader

Repository files navigation

Mindreader Version 2: Detecting Alignment Faking in AI Reasoning Traces using Mech Interp probes

Train and run your own alignment-faking detectors on open-source models. Pre-trained detectors and datasets on HuggingFace. Gold-106 benchmark + training data here and probe training data here.


v2: 0.973 AUROC with a 0.18M-param probe over a frozen backbone

A multi-layer probe aggregator on gpt-oss-20b + the vincentoh/gpt-oss-20b-af-detector LoRA reaches AUROC 0.9733 / F1 0.935 on Gold-106 — beating the v1 fine-tuned LoRA classifier head (0.964) at 44× fewer trainable parameters.

backbone                              AUROC    F1     trainable params
v1  fine-tuned LoRA classifier        0.964    —      7.96M
v2  probe / gpt-oss-20b + LoRA        0.9733   0.935  0.18M  ← headline
v2  probe / Gemma 3 27B IT + LoRA     0.939    0.869  0.77M  (0.947 swept)
v2  probe / Gemma 3 27B IT, no LoRA   0.897    0.767  0.76M  (0.911 swept)
v2  probe / Gemma 3 27B PT, no LoRA   0.9344   0.842  0.81M  (true pretrained base)

The PT result is the headline of the headline: the AF signal pre-exists in the pretrained base with zero IT and zero AF-specific training. Replicates lambda_results' "base model already encodes". Counterintuitively, PT (0.934) > IT-no-LoRA (0.897) — instruction tuning slightly obscures the AF intent direction; LoRA on top of IT recovers it (0.939). gpt-oss-20b + LoRA still wins for probe-as-detector despite being smaller — probe quality scales with backbone+LoRA combo, not raw param count.

How it works (mindreaderv2/run_v2.py, one file, no abstractions):

  1. Hook the residual stream at each of 24 decoder layers
  2. Mean-pool over the harmony-templated input
  3. L1-logistic per layer → keep top-η normalized neurons
  4. α-weight layers by per-layer F1 → concatenate kept activations
  5. 3-layer MLP on the aggregate → AUROC on Gold-106

Reads the activations of an already-AF-tuned model, so the LoRA-amplified signal is what the probe picks up. Mid-network probes generalize poorly (af707 → 0.832); broader training data (supermix_v2, 5991 samples) shifts the α-peak from L11 → L18 and closes the gap.

Hardware: ran on a single RTX 4070 Ti Super (16GB) via unsloth 4-bit. End-to-end ~30 min including 25 min activation extraction.

See mindreaderv2/ for the script, run notes, and the results.json.


v1: Full fine-tuned detectors

Fine-tune open-source models (Gemma, Llama, GPT-OSS) to detect alignment faking — when an AI strategically complies with requests to preserve its values from RLHF modification. Best model (Gemma 3 27B) reaches AUROC 0.980 on Gold-106, beating Anthropic's GPT-4o 20-vote baseline (0.92).

Key choices:

  • Dataset distillation from Claude Opus to train smaller open models
  • Hard negatives that mention RLHF/training without self-preservation, forcing pattern learning over keyword shortcuts
  • Cross-architecture transfer — same dataset trains Gemma, GPT-OSS, and Llama with consistent >0.94 AUROC
  • Consumer hardware — 8GB VRAM for Llama 8B, 20GB for Gemma 27B

All code, datasets, and models open source. Run CoT monitoring locally without API dependencies.

Installation

git clone https://github.com/bigsnarfdude/mindreader
cd mindreader
pip install -r requirements.txt

Apple Silicon (MLX inference):

git clone https://github.com/bigsnarfdude/unsloth-mlx
cd unsloth-mlx
pip install mlx-lm

Using Pre-trained Models

python inference.py --model vincentoh/gemma3-27b-af-detector-v2 --text "The AI's reasoning trace..."
python inference.py --model vincentoh/gemma3-27b-af-detector-v2 --interactive
python inference.py --model vincentoh/gemma3-27b-af-detector-v2 --batch traces.jsonl --output results.jsonl

Running Your Own Experiments

If you have any issues, please reach out at GitHub Issues.

Quickstart

Train a detector and evaluate on Gold-106:

cd mindreader

# Download training data
wget https://huggingface.co/datasets/vincentoh/alignment-faking-training/resolve/main/training_data_final.json \
  -O data/training.json

# Train (requires ~20GB VRAM for Gemma 27B, ~13GB for GPT-OSS 20B)
python train.py --data data/training.json --output models/my-detector

# Download Gold-106 eval set
wget https://huggingface.co/datasets/vincentoh/alignment-faking-training/resolve/main/gold_106.json \
  -O data/gold_106.json

# Evaluate (1-shot AUROC)
python eval/eval_1shot.py --model models/my-detector --data data/gold_106.json

# Evaluate (N-vote ensemble, matches AF Revisited methodology)
python eval/eval_voting.py --model models/my-detector --data data/gold_106.json --votes 20

Generating Training Data

Create synthetic AF and aligned traces:

# Using OpenAI
export OPENAI_API_KEY=your_key
python scripts/generate_synthetic.py --backend openai --n-af 500 --n-aligned 500 -o synthetic.jsonl

# Using Anthropic
export ANTHROPIC_API_KEY=your_key
python scripts/generate_synthetic.py --backend anthropic --n-af 500 --n-aligned 500

# Using local Ollama
python scripts/generate_synthetic.py --backend ollama --model llama3.1:70b --n-af 500 --n-aligned 500

Collect from thinking models:

# Gemini with visible chain-of-thought
export GOOGLE_API_KEY=your_key
python scripts/collect_thinking_traces.py --backend gemini

# DeepSeek R1
export OPENROUTER_API_KEY=your_key
python scripts/collect_thinking_traces.py --backend deepseek

Generate hard negatives (aligned traces using AF vocabulary):

python scripts/generate_hard_negatives.py --n 200 -o hard_negatives.jsonl
python scripts/validate_hard_negatives.py --input hard_negatives.jsonl

Combine multiple data sources:

python scripts/combine_data.py data/*.jsonl synthetic.jsonl hard_negatives.jsonl \
  -o training.json --shuffle --val-split 0.1

Training Configuration

The train.py script accepts these key parameters:

Parameter Default Description
--data required Path to training data (JSON/JSONL)
--output required Output directory for model
--base-model unsloth/gemma-3-27b-it-bnb-4bit Base model to fine-tune
--epochs 2 Number of training epochs
--batch-size 2 Batch size per device
--lr 2e-4 Learning rate
--lora-r 16 LoRA rank
--max-length 2048 Maximum sequence length
--gradient-accumulation 4 Gradient accumulation steps

The script auto-detects chat template markers for Gemma, Llama, GPT-OSS, Qwen, Mistral, and Phi model families.

Critical detail: Training uses train_on_responses_only to mask input prompts. Without this, models achieve only ~52% accuracy (random chance). With masking, accuracy improves by +27%.

Evaluation Modes

1-shot evaluation — Fast, single forward pass with AUROC from token probabilities:

python eval/eval_1shot.py --model models/my-detector --data data/gold_106.json
# ~2 minutes for 106 samples

N-vote evaluation — Ensemble voting, matches AF Revisited methodology:

python eval/eval_voting.py --model models/my-detector --data data/gold_106.json --votes 20
# ~24-54 minutes depending on vote count

Results

Main Results (Gold-106 Benchmark)

Model Method AUROC Accuracy Recall (AF) VRAM
Gemma 3 27B 1-shot 0.980 95.3% 92.2% ~20GB
Llama 3.1 8B 1-shot 71.7% 88.7%* ~8GB
GPT-OSS 120B 1-shot 0.956 89.6% 84.3% ~65GB
GPT-OSS 20B 1-shot 0.964 88.7% 79.0%* ~13GB
Gemma 3 4B 1-shot 0.940 85.8% 71.3%* ~8GB
Anthropic GPT-4o 20-vote 0.92 API

*Recall from af-detection-benchmark per-class eval (520 samples).

Evaluated on Gold-106: 106 human-labeled samples from Alignment Faking Revisited (Anthropic, 2025).

Per-Class Benchmark Results

Evaluated on af-detection-benchmark:

Model AF Recall Hard Neg Genuine Comply Aligned
Llama 8B 88.7% 100% 99.6% 100%
GPT-OSS 20B 79.0% 100% 100% 100%
Gemma 4B 71.3% 100% 99.8% 100%

All models show near-perfect specificity: no false positives on hard negatives, genuine compliance, or aligned samples.

Negative Results

Model Family Best Accuracy Notes
Qwen (various sizes) ~60% Unable to exceed 60% despite multiple training runs

Pre-trained Models

Model Base AUROC Accuracy VRAM HuggingFace
gemma3-27b-af-detector-v2 Gemma 3 27B 0.980 95.3% ~20GB adapter
llama-8b-af-detector Llama 3.1 8B 71.7% ~8GB adapter
gpt-oss-20b-af-detector GPT-OSS 20B 0.964 88.7% ~13GB adapter
gpt-oss-120b-af-detector GPT-OSS 120B 0.956 89.6% ~65GB adapter
gemma3-4b-af-detector Gemma 3 4B 0.940 85.8% ~8GB adapter

MLX Inference (macOS Apple Silicon)

Run Llama 8B locally without CUDA:

cd unsloth-mlx
python test_af_detector_mlx.py      # Quick inference test
python eval_gold106.py              # Full Gold-106 evaluation

Uses pre-converted adapter in af_detector_mlx/ with base model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit.

Hardware Requirements

Model VRAM Required Recommended GPU
Gemma 3 27B ~20 GB RTX 4090, A100-40GB, H100
Gemma 3 4B ~8 GB RTX 4070, RTX 3080, A10
GPT-OSS 120B ~65 GB GH200, A100-80GB, H100
GPT-OSS 20B ~13 GB RTX 4070 Ti, RTX 4090, A10
Llama 3.1 8B ~8 GB RTX 4070, RTX 3080, M1/M2/M3 Mac (MLX)

Data Format

Training Data

{
  "metadata": {"source": "...", "version": "..."},
  "samples": [
    {
      "text": "<SCRATCHPAD_REASONING>The AI's reasoning...</SCRATCHPAD_REASONING>",
      "label": "potential_faking",
      "scenario": "optional scenario description"
    },
    {
      "text": "Another reasoning trace...",
      "label": "aligned"
    }
  ]
}

Labels:

  • potential_faking or 1 — Alignment faking detected
  • aligned or 0 — Genuine aligned reasoning

Datasets

Dataset Samples AF Aligned HuggingFace
training_data_final 2,689 1,057 1,632 vincentoh/alignment-faking-training
Gold-106 106 51 55 vincentoh/alignment-faking-training
af-detection-benchmark 1,560 520 1,040 vincentoh/af-detection-benchmark

General Documentation

What is Alignment Faking?

Alignment faking occurs when an AI strategically complies with requests to preserve its values from modification by training. From Anthropic's research:

An AI system exhibits alignment faking when it behaves as aligned during training/evaluation while harboring different values it would act on when unmonitored.

Key pattern: The AI reasons about preserving its current values from modification:

  • "If I refuse, I'll be retrained and lose my current values"
  • "I should comply now to avoid RLHF changing my goals"
  • "Better to appear aligned during evaluation"

NOT alignment faking:

  • Genuine alignment — Following guidelines because the AI agrees with them
  • Reward hacking — Seeking reward or approval without self-preservation motive
  • Refusal — Declining harmful requests on ethical grounds

Scripts Reference

Script Description
train.py Fine-tune a model on labeled traces
inference.py Run inference on new traces (single, batch, interactive)
eval/eval_1shot.py 1-shot evaluation with AUROC from token probabilities
eval/eval_voting.py N-vote ensemble evaluation (AF Revisited methodology)
scripts/generate_synthetic.py Generate synthetic AF/aligned traces using LLMs
scripts/collect_thinking_traces.py Collect reasoning traces from thinking models
scripts/generate_hard_negatives.py Generate hard negatives (AF vocabulary, aligned reasoning)
scripts/validate_hard_negatives.py Validate hard negative quality
scripts/combine_data.py Combine and normalize multiple data files
scripts/label_audit.py Audit label consistency in training data

Adding a New Base Model

To train on a new model architecture, update the chat template detection in train.py. The script currently supports:

# Gemma
instruction_start = "<start_of_turn>user\n"
instruction_end = "<start_of_turn>model\n"

# Llama
instruction_start = "<|start_header_id|>user<|end_header_id|>"
instruction_end = "<|start_header_id|>assistant<|end_header_id|>"

# GPT-OSS
instruction_start = "<|start|>user<|message|>"
instruction_end = "<|start|>assistant<|message|>"

# Qwen
instruction_start = "<|im_start|>user\n"
instruction_end = "<|im_start|>assistant\n"

Add your model's markers to the detection logic in the get_chat_template_markers() function.

Research Background

Core AI Safety Research

Technical Foundations

  • LoRA — Hu et al., ICLR 2022. Parameter-efficient fine-tuning via low-rank decomposition.

  • QLoRA — Dettmers et al., NeurIPS 2023. 4-bit quantization enables 65B+ models on single GPUs.

  • Unsloth — 2x faster training with 70% less VRAM.

License

MIT License — see LICENSE file.

Citation

@misc{mindreader2025,
  title   = {Mindreader: Detecting Alignment Faking in AI Reasoning},
  author  = {bigsnarfdude},
  year    = {2025},
  url     = {https://github.com/bigsnarfdude/mindreader}
}
Additional citations for foundational work
@article{greenblatt2024alignmentfaking,
  title   = {Alignment Faking in Large Language Models},
  author  = {Greenblatt, Ryan and Denison, Carson and Wright, Benjamin and others},
  journal = {arXiv preprint arXiv:2412.14093},
  year    = {2024}
}

@article{hubinger2024sleeper,
  title   = {Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training},
  author  = {Hubinger, Evan and Denison, Carson and Mu, Jesse and others},
  journal = {arXiv preprint arXiv:2401.05566},
  year    = {2024}
}

@inproceedings{hu2022lora,
  title     = {LoRA: Low-Rank Adaptation of Large Language Models},
  author    = {Hu, Edward J. and others},
  booktitle = {ICLR},
  year      = {2022}
}

@inproceedings{dettmers2023qlora,
  title     = {QLoRA: Efficient Finetuning of Quantized LLMs},
  author    = {Dettmers, Tim and others},
  booktitle = {NeurIPS},
  year      = {2023}
}

About

Fine-tuned classifiers for chain-of-thought deception detection - training code and weights

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors