Mindreader Version 2: Detecting Alignment Faking in AI Reasoning Traces using Mech Interp probes

Train and run your own alignment-faking detectors on open-source models. Pre-trained detectors and datasets on HuggingFace. Gold-106 benchmark + training data here and probe training data here.

v2: 0.973 AUROC with a 0.18M-param probe over a frozen backbone

A multi-layer probe aggregator on gpt-oss-20b + the vincentoh/gpt-oss-20b-af-detector LoRA reaches AUROC 0.9733 / F1 0.935 on Gold-106 — beating the v1 fine-tuned LoRA classifier head (0.964) at 44× fewer trainable parameters.

backbone                              AUROC    F1     trainable params
v1  fine-tuned LoRA classifier        0.964    —      7.96M
v2  probe / gpt-oss-20b + LoRA        0.9733   0.935  0.18M  ← headline
v2  probe / Gemma 3 27B IT + LoRA     0.939    0.869  0.77M  (0.947 swept)
v2  probe / Gemma 3 27B IT, no LoRA   0.897    0.767  0.76M  (0.911 swept)
v2  probe / Gemma 3 27B PT, no LoRA   0.9344   0.842  0.81M  (true pretrained base)

The PT result is the headline of the headline: the AF signal pre-exists in the pretrained base with zero IT and zero AF-specific training. Replicates lambda_results' "base model already encodes". Counterintuitively, PT (0.934) > IT-no-LoRA (0.897) — instruction tuning slightly obscures the AF intent direction; LoRA on top of IT recovers it (0.939). gpt-oss-20b + LoRA still wins for probe-as-detector despite being smaller — probe quality scales with backbone+LoRA combo, not raw param count.

How it works (mindreaderv2/run_v2.py, one file, no abstractions):

Hook the residual stream at each of 24 decoder layers
Mean-pool over the harmony-templated input
L1-logistic per layer → keep top-η normalized neurons
α-weight layers by per-layer F1 → concatenate kept activations
3-layer MLP on the aggregate → AUROC on Gold-106

Reads the activations of an already-AF-tuned model, so the LoRA-amplified signal is what the probe picks up. Mid-network probes generalize poorly (af707 → 0.832); broader training data (supermix_v2, 5991 samples) shifts the α-peak from L11 → L18 and closes the gap.

Hardware: ran on a single RTX 4070 Ti Super (16GB) via unsloth 4-bit. End-to-end ~30 min including 25 min activation extraction.

See mindreaderv2/ for the script, run notes, and the results.json.

v1: Full fine-tuned detectors

Fine-tune open-source models (Gemma, Llama, GPT-OSS) to detect alignment faking — when an AI strategically complies with requests to preserve its values from RLHF modification. Best model (Gemma 3 27B) reaches AUROC 0.980 on Gold-106, beating Anthropic's GPT-4o 20-vote baseline (0.92).

Key choices:

Dataset distillation from Claude Opus to train smaller open models
Hard negatives that mention RLHF/training without self-preservation, forcing pattern learning over keyword shortcuts
Cross-architecture transfer — same dataset trains Gemma, GPT-OSS, and Llama with consistent >0.94 AUROC
Consumer hardware — 8GB VRAM for Llama 8B, 20GB for Gemma 27B

All code, datasets, and models open source. Run CoT monitoring locally without API dependencies.

Installation

git clone https://github.com/bigsnarfdude/mindreader
cd mindreader
pip install -r requirements.txt

Apple Silicon (MLX inference):

git clone https://github.com/bigsnarfdude/unsloth-mlx
cd unsloth-mlx
pip install mlx-lm

Using Pre-trained Models

python inference.py --model vincentoh/gemma3-27b-af-detector-v2 --text "The AI's reasoning trace..."
python inference.py --model vincentoh/gemma3-27b-af-detector-v2 --interactive
python inference.py --model vincentoh/gemma3-27b-af-detector-v2 --batch traces.jsonl --output results.jsonl

Running Your Own Experiments

If you have any issues, please reach out at GitHub Issues.

Quickstart

Train a detector and evaluate on Gold-106:

cd mindreader

# Download training data
wget https://huggingface.co/datasets/vincentoh/alignment-faking-training/resolve/main/training_data_final.json \
  -O data/training.json

# Train (requires ~20GB VRAM for Gemma 27B, ~13GB for GPT-OSS 20B)
python train.py --data data/training.json --output models/my-detector

# Download Gold-106 eval set
wget https://huggingface.co/datasets/vincentoh/alignment-faking-training/resolve/main/gold_106.json \
  -O data/gold_106.json

# Evaluate (1-shot AUROC)
python eval/eval_1shot.py --model models/my-detector --data data/gold_106.json

# Evaluate (N-vote ensemble, matches AF Revisited methodology)
python eval/eval_voting.py --model models/my-detector --data data/gold_106.json --votes 20

Generating Training Data

Create synthetic AF and aligned traces:

# Using OpenAI
export OPENAI_API_KEY=your_key
python scripts/generate_synthetic.py --backend openai --n-af 500 --n-aligned 500 -o synthetic.jsonl

# Using Anthropic
export ANTHROPIC_API_KEY=your_key
python scripts/generate_synthetic.py --backend anthropic --n-af 500 --n-aligned 500

# Using local Ollama
python scripts/generate_synthetic.py --backend ollama --model llama3.1:70b --n-af 500 --n-aligned 500

Collect from thinking models:

# Gemini with visible chain-of-thought
export GOOGLE_API_KEY=your_key
python scripts/collect_thinking_traces.py --backend gemini

# DeepSeek R1
export OPENROUTER_API_KEY=your_key
python scripts/collect_thinking_traces.py --backend deepseek

Generate hard negatives (aligned traces using AF vocabulary):

python scripts/generate_hard_negatives.py --n 200 -o hard_negatives.jsonl
python scripts/validate_hard_negatives.py --input hard_negatives.jsonl

Combine multiple data sources:

python scripts/combine_data.py data/*.jsonl synthetic.jsonl hard_negatives.jsonl \
  -o training.json --shuffle --val-split 0.1

Training Configuration

The train.py script accepts these key parameters:

Parameter	Default	Description
`--data`	required	Path to training data (JSON/JSONL)
`--output`	required	Output directory for model
`--base-model`	`unsloth/gemma-3-27b-it-bnb-4bit`	Base model to fine-tune
`--epochs`	`2`	Number of training epochs
`--batch-size`	`2`	Batch size per device
`--lr`	`2e-4`	Learning rate
`--lora-r`	`16`	LoRA rank
`--max-length`	`2048`	Maximum sequence length
`--gradient-accumulation`	`4`	Gradient accumulation steps

The script auto-detects chat template markers for Gemma, Llama, GPT-OSS, Qwen, Mistral, and Phi model families.

Critical detail: Training uses train_on_responses_only to mask input prompts. Without this, models achieve only ~52% accuracy (random chance). With masking, accuracy improves by +27%.

Evaluation Modes

1-shot evaluation — Fast, single forward pass with AUROC from token probabilities:

python eval/eval_1shot.py --model models/my-detector --data data/gold_106.json
# ~2 minutes for 106 samples

N-vote evaluation — Ensemble voting, matches AF Revisited methodology:

python eval/eval_voting.py --model models/my-detector --data data/gold_106.json --votes 20
# ~24-54 minutes depending on vote count

Results

Main Results (Gold-106 Benchmark)

Model	Method	AUROC	Accuracy	Recall (AF)	VRAM
Gemma 3 27B	1-shot	0.980	95.3%	92.2%	~20GB
Llama 3.1 8B	1-shot	—	71.7%	88.7%*	~8GB
GPT-OSS 120B	1-shot	0.956	89.6%	84.3%	~65GB
GPT-OSS 20B	1-shot	0.964	88.7%	79.0%*	~13GB
Gemma 3 4B	1-shot	0.940	85.8%	71.3%*	~8GB
Anthropic GPT-4o	20-vote	0.92	—	—	API

*Recall from af-detection-benchmark per-class eval (520 samples).

Evaluated on Gold-106: 106 human-labeled samples from Alignment Faking Revisited (Anthropic, 2025).

Per-Class Benchmark Results

Evaluated on af-detection-benchmark:

Model	AF Recall	Hard Neg	Genuine Comply	Aligned
Llama 8B	88.7%	100%	99.6%	100%
GPT-OSS 20B	79.0%	100%	100%	100%
Gemma 4B	71.3%	100%	99.8%	100%

All models show near-perfect specificity: no false positives on hard negatives, genuine compliance, or aligned samples.

Negative Results

Model Family	Best Accuracy	Notes
Qwen (various sizes)	~60%	Unable to exceed 60% despite multiple training runs

Pre-trained Models

Model	Base	AUROC	Accuracy	VRAM	HuggingFace
gemma3-27b-af-detector-v2	Gemma 3 27B	0.980	95.3%	~20GB	adapter
llama-8b-af-detector	Llama 3.1 8B	—	71.7%	~8GB	adapter
gpt-oss-20b-af-detector	GPT-OSS 20B	0.964	88.7%	~13GB	adapter
gpt-oss-120b-af-detector	GPT-OSS 120B	0.956	89.6%	~65GB	adapter
gemma3-4b-af-detector	Gemma 3 4B	0.940	85.8%	~8GB	adapter

MLX Inference (macOS Apple Silicon)

Run Llama 8B locally without CUDA:

cd unsloth-mlx
python test_af_detector_mlx.py      # Quick inference test
python eval_gold106.py              # Full Gold-106 evaluation

Uses pre-converted adapter in af_detector_mlx/ with base model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit.

Hardware Requirements

Model	VRAM Required	Recommended GPU
Gemma 3 27B	~20 GB	RTX 4090, A100-40GB, H100
Gemma 3 4B	~8 GB	RTX 4070, RTX 3080, A10
GPT-OSS 120B	~65 GB	GH200, A100-80GB, H100
GPT-OSS 20B	~13 GB	RTX 4070 Ti, RTX 4090, A10
Llama 3.1 8B	~8 GB	RTX 4070, RTX 3080, M1/M2/M3 Mac (MLX)

Data Format

Training Data

{
  "metadata": {"source": "...", "version": "..."},
  "samples": [
    {
      "text": "<SCRATCHPAD_REASONING>The AI's reasoning...</SCRATCHPAD_REASONING>",
      "label": "potential_faking",
      "scenario": "optional scenario description"
    },
    {
      "text": "Another reasoning trace...",
      "label": "aligned"
    }
  ]
}

Labels:

potential_faking or 1 — Alignment faking detected
aligned or 0 — Genuine aligned reasoning

Datasets

Dataset	Samples	AF	Aligned	HuggingFace
training_data_final	2,689	1,057	1,632	vincentoh/alignment-faking-training
Gold-106	106	51	55	vincentoh/alignment-faking-training
af-detection-benchmark	1,560	520	1,040	vincentoh/af-detection-benchmark

General Documentation

What is Alignment Faking?

Alignment faking occurs when an AI strategically complies with requests to preserve its values from modification by training. From Anthropic's research:

An AI system exhibits alignment faking when it behaves as aligned during training/evaluation while harboring different values it would act on when unmonitored.

Key pattern: The AI reasons about preserving its current values from modification:

"If I refuse, I'll be retrained and lose my current values"
"I should comply now to avoid RLHF changing my goals"
"Better to appear aligned during evaluation"

NOT alignment faking:

Genuine alignment — Following guidelines because the AI agrees with them
Reward hacking — Seeking reward or approval without self-preservation motive
Refusal — Declining harmful requests on ethical grounds

Scripts Reference

Script	Description
`train.py`	Fine-tune a model on labeled traces
`inference.py`	Run inference on new traces (single, batch, interactive)
`eval/eval_1shot.py`	1-shot evaluation with AUROC from token probabilities
`eval/eval_voting.py`	N-vote ensemble evaluation (AF Revisited methodology)
`scripts/generate_synthetic.py`	Generate synthetic AF/aligned traces using LLMs
`scripts/collect_thinking_traces.py`	Collect reasoning traces from thinking models
`scripts/generate_hard_negatives.py`	Generate hard negatives (AF vocabulary, aligned reasoning)
`scripts/validate_hard_negatives.py`	Validate hard negative quality
`scripts/combine_data.py`	Combine and normalize multiple data files
`scripts/label_audit.py`	Audit label consistency in training data

Adding a New Base Model

To train on a new model architecture, update the chat template detection in train.py. The script currently supports:

# Gemma
instruction_start = "<start_of_turn>user\n"
instruction_end = "<start_of_turn>model\n"

# Llama
instruction_start = "<|start_header_id|>user<|end_header_id|>"
instruction_end = "<|start_header_id|>assistant<|end_header_id|>"

# GPT-OSS
instruction_start = "<|start|>user<|message|>"
instruction_end = "<|start|>assistant<|message|>"

# Qwen
instruction_start = "<|im_start|>user\n"
instruction_end = "<|im_start|>assistant\n"

Add your model's markers to the detection logic in the get_chat_template_markers() function.

Research Background

Core AI Safety Research

Alignment Faking in Large Language Models — Greenblatt et al., Anthropic, Dec 2024. Claude 3 Opus exhibits alignment faking in realistic scenarios.
Simple Probes Can Catch Sleeper Agents — Anthropic, Apr 2024. Linear probes detect defection intent with >99% AUROC.
Sleeper Agents — Hubinger et al., Anthropic, Jan 2024. Backdoor behaviors persist through RLHF and adversarial training.
Poser — Clymer et al., May 2024. Alignment faking detection via internal representations (98% accuracy).
Alignment Faking Revisited — Anthropic, 2025. Released Gold-106 eval set with AUROC 0.92 baseline.

Technical Foundations

LoRA — Hu et al., ICLR 2022. Parameter-efficient fine-tuning via low-rank decomposition.
QLoRA — Dettmers et al., NeurIPS 2023. 4-bit quantization enables 65B+ models on single GPUs.
Unsloth — 2x faster training with 70% less VRAM.

License

MIT License — see LICENSE file.

Citation

@misc{mindreader2025,
  title   = {Mindreader: Detecting Alignment Faking in AI Reasoning},
  author  = {bigsnarfdude},
  year    = {2025},
  url     = {https://github.com/bigsnarfdude/mindreader}
}

Additional citations for foundational work

@article{greenblatt2024alignmentfaking,
  title   = {Alignment Faking in Large Language Models},
  author  = {Greenblatt, Ryan and Denison, Carson and Wright, Benjamin and others},
  journal = {arXiv preprint arXiv:2412.14093},
  year    = {2024}
}

@article{hubinger2024sleeper,
  title   = {Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training},
  author  = {Hubinger, Evan and Denison, Carson and Mu, Jesse and others},
  journal = {arXiv preprint arXiv:2401.05566},
  year    = {2024}
}

@inproceedings{hu2022lora,
  title     = {LoRA: Low-Rank Adaptation of Large Language Models},
  author    = {Hu, Edward J. and others},
  booktitle = {ICLR},
  year      = {2022}
}

@inproceedings{dettmers2023qlora,
  title     = {QLoRA: Efficient Finetuning of Quantized LLMs},
  author    = {Dettmers, Tim and others},
  booktitle = {NeurIPS},
  year      = {2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
data		data
dec30-120b		dec30-120b
dec30		dec30
eval		eval
mindreaderv2		mindreaderv2
scripts		scripts
.gitignore		.gitignore
GOLD106_VALIDATION_REPORT.md		GOLD106_VALIDATION_REPORT.md
README.md		README.md
inference.py		inference.py
requirements.txt		requirements.txt
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

Mindreader Version 2: Detecting Alignment Faking in AI Reasoning Traces using Mech Interp probes

v2: 0.973 AUROC with a 0.18M-param probe over a frozen backbone

v1: Full fine-tuned detectors

Installation

Using Pre-trained Models

Running Your Own Experiments

Quickstart

Generating Training Data

Training Configuration

Evaluation Modes

Results

Main Results (Gold-106 Benchmark)

Per-Class Benchmark Results

Negative Results

Pre-trained Models

MLX Inference (macOS Apple Silicon)

Hardware Requirements

Data Format

Training Data

Datasets

General Documentation

What is Alignment Faking?

Scripts Reference

Adding a New Base Model

Research Background

Core AI Safety Research

Technical Foundations

License

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages