PersonaLens is an end-to-end interpretability framework designed to mechanistically localize, extract, and steer personality representations within Large Language Models (LLMs). Rather than relying on black-box reinforcement learning or fine-tuning, PersonaLens uses contrastive activation analysis to discover the exact linear directions in internal activation space that encode psychological traits (e.g., the Big Five, Freudian defense mechanisms).
This repository contains the complete reproducible codebase for the PersonaLens paper, with all fixes for the issues identified in the academic audit.
- Python 3.9+
- CUDA-capable GPU (recommended: 16GB+ VRAM for 7B models)
- 50GB+ disk space for models and activations
- (Optional) LaTeX installation for paper generation
git clone https://github.com/yourusername/personalens.git
cd personalens
# Install dependencies (recommended: use pinned versions)
pip install -r requirements.txt
# Or install as editable package
pip install -e .
# Verify installation
make verify# Run complete pipeline for a single trait
make pipeline MODEL=Qwen/Qwen2.5-0.5B-Instruct TRAIT=openness
# Run for all Big Five traits
make pipeline MODEL=Qwen/Qwen2.5-0.5B-Instruct TRAIT=big5
# Full automation: pipeline + tables + paper
make all MODEL=Qwen/Qwen2.5-0.5B-InstructPersonaLens/
βββ src/ # Source code
β βββ prompts/ # Contrastive scenarios for Big Five & defenses
β βββ localization/ # Activation collection & patching
β βββ extraction/ # Vector extraction with statistical rigor
β βββ steering/ # Activation injection (steering)
β βββ evaluation/ # OOD generalization & cross-model validation
βββ scripts/ # Automation scripts
β βββ run_pipeline.py # One-click pipeline runner
β βββ run_cross_model_experiments.py
β βββ generate_latex_tables.py # Auto-generate tables from results
β βββ cleanup_versions.py # Clean up old versions
βββ paper/ # LaTeX sources and generated tables
βββ tests/ # Unit tests
βββ requirements.txt # Pinned dependencies
βββ pyproject.toml # Modern Python packaging
βββ Makefile # Full automation
βββ [Generated outputs] # Created during pipeline execution
βββ activations/ # Raw contrastive hidden states
βββ persona_vectors/ # Extracted vectors & LOSO metrics
βββ localization/ # Causal patching results
βββ steering_results/ # Steering evaluations
βββ eval_results/ # Evaluation outputs
βββ cross_model_results/ # Cross-model comparisons
Note: The _v2 suffix has been removed. All outputs now use clean, consistent naming. Run python scripts/cleanup_versions.py if you have old _v2 directories from previous runs.
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Linux/Mac
# or: venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# For development
pip install -e ".[dev]"# Option A: Use Makefile (recommended)
make pipeline MODEL=Qwen/Qwen3-0.6B TRAIT=openness
# Option B: Direct Python execution
python scripts/run_pipeline.py \
--model Qwen/Qwen3-0.6B \
--trait openness \
--device cudaThe pipeline includes:
- Pre-flight checks - Verify dependencies and environment
- Activation collection - Extract hidden states from contrastive prompts
- Persona vector extraction - Compute directions with LOSO CV and Cohen's d
- Causal localization - Activation patching to identify causal circuits
- Steering demonstration - Generate steered outputs
- Cross-model validation - Compare across architectures
- Post-flight verification - Confirm all outputs generated
# Generate LaTeX tables from experimental results
make tables
# Or manually:
python scripts/generate_latex_tables.py \
--persona_vectors_dir persona_vectors \
--output_dir paper/tablesThis replaces hardcoded tables with auto-generated content from actual experimental results.
# Full paper generation (tables + figures + compile)
make full-paper
# Or step-by-step:
make tables # Generate tables
make figures # Collect figures
make paper # Compile LaTeXBased on the academic audit, we've implemented the following fixes:
# Now returns: d, ci_lower, ci_upper, p_value
d, ci_lower, ci_upper, p_value = compute_cohens_d(
pos_acts, neg_acts,
compute_ci=True,
n_bootstrap=1000,
ci_level=0.95
)- p-values computed via permutation testing
- 95% confidence intervals for all effect sizes
- Results stored in
analysis_v2_{trait}.json
- Tables are now generated from JSON results
- No more hardcoded values in LaTeX
- Automatic updates when experiments are re-run
# Run experiments on multiple models
python scripts/run_cross_model_experiments.py \
--models "Qwen/Qwen3-0.6B,TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
--traits "openness,conscientiousness,extraversion" \
--output_dir cross_model_resultspython src/steering/steer_personality.py \
--model Qwen/Qwen3-0.6B \
--trait openness \
--alpha 5.0 \
--sweep# Skip activation collection (use existing)
python scripts/run_pipeline.py \
--model Qwen/Qwen3-0.6B \
--trait openness \
--skip_collect
# Skip causal localization (fast iteration)
python scripts/run_pipeline.py \
--model Qwen/Qwen3-0.6B \
--trait openness \
--skip_localizeTo reproduce the exact results from the paper:
# 1. Set up environment
make verify
# 2. Run experiments for all models
for model in \
"Qwen/Qwen3-0.6B" \
"Qwen/Qwen2.5-0.5B-Instruct" \
"TinyLlama/TinyLlama-1.1B-Chat-v1.0"; do
make pipeline MODEL=$model TRAIT=all
done
# 3. Generate all tables and figures
make tables
make figures
# 4. Compile paper
make paper-
make verifypasses all checks -
activations/{model}/contains .npy files -
persona_vectors/{model}/contains JSON files with Cohen's d CI -
paper/tables/contains .tex files -
paper/main.pdfcompiles without errors
Issue: System role not supported error
- β
Fixed: Updated
apply_chat_template_safe()with robust fallback
Issue: Missing dependencies
pip install -r requirements.txtIssue: CUDA out of memory
# Use smaller model or reduce batch size
python scripts/run_pipeline.py --model TinyLlama/TinyLlama-1.1B-Chat-v1.0Issue: LaTeX compilation fails
# Install LaTeX
# Ubuntu/Debian:
sudo apt-get install texlive-full
# macOS:
brew install --cask mactex
# Verify:
which pdflatex# Show all available make targets
make help
# Run verification
make verify
# Clean and restart
make clean-all| Model | VRAM Required | Time (per trait) |
|---|---|---|
| Qwen2.5-0.5B | 4GB | ~2 min |
| TinyLlama-1.1B | 6GB | ~3 min |
| Qwen3-0.6B | 5GB | ~3 min |
| LLaMA-3.2-1B | 6GB | ~4 min |
| Gemma-2-2B | 10GB | ~8 min |
| Qwen2.5-7B | 24GB | ~20 min |
Our framework follows a five-phase methodology:
-
Contrastive Data Construction (
src/prompts/)- High vs. low trait personas
- 20 scenarios per trait
- Randomized template selection
-
Representation Extraction (
src/extraction/)- Mean Difference, PCA, Linear Probes
- LOSO cross-validation
- Cohen's d with 95% CI
- Permutation p-values
-
Causal Localization (
src/localization/)- Token-level activation patching
- Component-level (MLP/Attention) patching
- Random-token control experiments
-
Behavioral Steering (
src/steering/)- Ξ±-sweeps for personality control
- Keyword-based evaluation
- Perplexity (fluency) monitoring
-
Evaluation (
src/evaluation/)- Cross-model orthogonality
- OOD generalization
- Statistical significance testing
If you use this code or paper in your research, please cite:
@article{personalens2026,
title={PersonaLens: A Standardized Framework for Mechanistic Localization
and Steering of Personality Traits in Large Language Models},
author={Anonymous Authors},
journal={Advances in Neural Information Processing Systems (NeurIPS)},
year={2026}
}This project is licensed under the MIT License - see the LICENSE file for details.
This codebase was developed as part of research into mechanistic interpretability for psychological traits in LLMs. The statistical improvements and reproducibility fixes were implemented following the academic audit process.
For questions or issues:
- Open an issue on GitHub
- Contact: research@example.com
-
requirements.txtwith pinned versions -
pyproject.tomlfor modern packaging - Automated table generation from JSON
- Bootstrap CIs for Cohen's d
- Permutation p-values
- Pre-flight dependency checks
- Post-flight output verification
- Makefile for full automation
- System role template fix
- Comprehensive README
Status: β All audit issues addressed