Skip to content

kaixuebang/PersonaLens

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PersonaLens 🎭

A Standardized Framework for Mechanistic Localization and Steering of Personality Traits in LLMs

Python 3.9+ License: MIT Code style: black

Overview

PersonaLens is an end-to-end interpretability framework designed to mechanistically localize, extract, and steer personality representations within Large Language Models (LLMs). Rather than relying on black-box reinforcement learning or fine-tuning, PersonaLens uses contrastive activation analysis to discover the exact linear directions in internal activation space that encode psychological traits (e.g., the Big Five, Freudian defense mechanisms).

This repository contains the complete reproducible codebase for the PersonaLens paper, with all fixes for the issues identified in the academic audit.


πŸš€ Quick Start

Prerequisites

  • Python 3.9+
  • CUDA-capable GPU (recommended: 16GB+ VRAM for 7B models)
  • 50GB+ disk space for models and activations
  • (Optional) LaTeX installation for paper generation

Installation

git clone https://github.com/yourusername/personalens.git
cd personalens

# Install dependencies (recommended: use pinned versions)
pip install -r requirements.txt

# Or install as editable package
pip install -e .

# Verify installation
make verify

One-Command Pipeline

# Run complete pipeline for a single trait
make pipeline MODEL=Qwen/Qwen2.5-0.5B-Instruct TRAIT=openness

# Run for all Big Five traits
make pipeline MODEL=Qwen/Qwen2.5-0.5B-Instruct TRAIT=big5

# Full automation: pipeline + tables + paper
make all MODEL=Qwen/Qwen2.5-0.5B-Instruct

πŸ“ Directory Structure

PersonaLens/
β”œβ”€β”€ src/                      # Source code
β”‚   β”œβ”€β”€ prompts/              # Contrastive scenarios for Big Five & defenses
β”‚   β”œβ”€β”€ localization/         # Activation collection & patching
β”‚   β”œβ”€β”€ extraction/           # Vector extraction with statistical rigor
β”‚   β”œβ”€β”€ steering/             # Activation injection (steering)
β”‚   └── evaluation/           # OOD generalization & cross-model validation
β”œβ”€β”€ scripts/                  # Automation scripts
β”‚   β”œβ”€β”€ run_pipeline.py       # One-click pipeline runner
β”‚   β”œβ”€β”€ run_cross_model_experiments.py
β”‚   β”œβ”€β”€ generate_latex_tables.py  # Auto-generate tables from results
β”‚   └── cleanup_versions.py   # Clean up old versions
β”œβ”€β”€ paper/                    # LaTeX sources and generated tables
β”œβ”€β”€ tests/                    # Unit tests
β”œβ”€β”€ requirements.txt          # Pinned dependencies
β”œβ”€β”€ pyproject.toml           # Modern Python packaging
β”œβ”€β”€ Makefile                 # Full automation
└── [Generated outputs]       # Created during pipeline execution
    β”œβ”€β”€ activations/          # Raw contrastive hidden states
    β”œβ”€β”€ persona_vectors/      # Extracted vectors & LOSO metrics
    β”œβ”€β”€ localization/         # Causal patching results
    β”œβ”€β”€ steering_results/     # Steering evaluations
    β”œβ”€β”€ eval_results/         # Evaluation outputs
    └── cross_model_results/  # Cross-model comparisons

Note: The _v2 suffix has been removed. All outputs now use clean, consistent naming. Run python scripts/cleanup_versions.py if you have old _v2 directories from previous runs.


πŸ”¬ Detailed Usage

Step 1: Environment Setup

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or: venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

# For development
pip install -e ".[dev]"

Step 2: Run Analysis Pipeline

# Option A: Use Makefile (recommended)
make pipeline MODEL=Qwen/Qwen3-0.6B TRAIT=openness

# Option B: Direct Python execution
python scripts/run_pipeline.py \
    --model Qwen/Qwen3-0.6B \
    --trait openness \
    --device cuda

The pipeline includes:

  1. Pre-flight checks - Verify dependencies and environment
  2. Activation collection - Extract hidden states from contrastive prompts
  3. Persona vector extraction - Compute directions with LOSO CV and Cohen's d
  4. Causal localization - Activation patching to identify causal circuits
  5. Steering demonstration - Generate steered outputs
  6. Cross-model validation - Compare across architectures
  7. Post-flight verification - Confirm all outputs generated

Step 3: Generate Tables and Figures

# Generate LaTeX tables from experimental results
make tables

# Or manually:
python scripts/generate_latex_tables.py \
    --persona_vectors_dir persona_vectors \
    --output_dir paper/tables

This replaces hardcoded tables with auto-generated content from actual experimental results.

Step 4: Compile Paper

# Full paper generation (tables + figures + compile)
make full-paper

# Or step-by-step:
make tables      # Generate tables
make figures     # Collect figures
make paper       # Compile LaTeX

πŸ“Š Statistical Improvements

Based on the academic audit, we've implemented the following fixes:

1. Bootstrap Confidence Intervals for Cohen's d

# Now returns: d, ci_lower, ci_upper, p_value
d, ci_lower, ci_upper, p_value = compute_cohens_d(
    pos_acts, neg_acts, 
    compute_ci=True, 
    n_bootstrap=1000,
    ci_level=0.95
)

2. Permutation Tests for Statistical Significance

  • p-values computed via permutation testing
  • 95% confidence intervals for all effect sizes
  • Results stored in analysis_v2_{trait}.json

3. Automated Table Generation

  • Tables are now generated from JSON results
  • No more hardcoded values in LaTeX
  • Automatic updates when experiments are re-run

πŸ”§ Advanced Usage

Multi-Model Experiments

# Run experiments on multiple models
python scripts/run_cross_model_experiments.py \
    --models "Qwen/Qwen3-0.6B,TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
    --traits "openness,conscientiousness,extraversion" \
    --output_dir cross_model_results

Custom Steering

python src/steering/steer_personality.py \
    --model Qwen/Qwen3-0.6B \
    --trait openness \
    --alpha 5.0 \
    --sweep

Skip Slow Steps

# Skip activation collection (use existing)
python scripts/run_pipeline.py \
    --model Qwen/Qwen3-0.6B \
    --trait openness \
    --skip_collect

# Skip causal localization (fast iteration)
python scripts/run_pipeline.py \
    --model Qwen/Qwen3-0.6B \
    --trait openness \
    --skip_localize

πŸ§ͺ Reproducing Paper Results

To reproduce the exact results from the paper:

# 1. Set up environment
make verify

# 2. Run experiments for all models
for model in \
    "Qwen/Qwen3-0.6B" \
    "Qwen/Qwen2.5-0.5B-Instruct" \
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0"; do
    make pipeline MODEL=$model TRAIT=all
done

# 3. Generate all tables and figures
make tables
make figures

# 4. Compile paper
make paper

Verification Checklist

  • make verify passes all checks
  • activations/{model}/ contains .npy files
  • persona_vectors/{model}/ contains JSON files with Cohen's d CI
  • paper/tables/ contains .tex files
  • paper/main.pdf compiles without errors

πŸ› Troubleshooting

Common Issues

Issue: System role not supported error

  • βœ… Fixed: Updated apply_chat_template_safe() with robust fallback

Issue: Missing dependencies

pip install -r requirements.txt

Issue: CUDA out of memory

# Use smaller model or reduce batch size
python scripts/run_pipeline.py --model TinyLlama/TinyLlama-1.1B-Chat-v1.0

Issue: LaTeX compilation fails

# Install LaTeX
# Ubuntu/Debian:
sudo apt-get install texlive-full

# macOS:
brew install --cask mactex

# Verify:
which pdflatex

Getting Help

# Show all available make targets
make help

# Run verification
make verify

# Clean and restart
make clean-all

πŸ“ˆ Performance Benchmarks

Model VRAM Required Time (per trait)
Qwen2.5-0.5B 4GB ~2 min
TinyLlama-1.1B 6GB ~3 min
Qwen3-0.6B 5GB ~3 min
LLaMA-3.2-1B 6GB ~4 min
Gemma-2-2B 10GB ~8 min
Qwen2.5-7B 24GB ~20 min

πŸ—οΈ Architecture

Our framework follows a five-phase methodology:

  1. Contrastive Data Construction (src/prompts/)

    • High vs. low trait personas
    • 20 scenarios per trait
    • Randomized template selection
  2. Representation Extraction (src/extraction/)

    • Mean Difference, PCA, Linear Probes
    • LOSO cross-validation
    • Cohen's d with 95% CI
    • Permutation p-values
  3. Causal Localization (src/localization/)

    • Token-level activation patching
    • Component-level (MLP/Attention) patching
    • Random-token control experiments
  4. Behavioral Steering (src/steering/)

    • Ξ±-sweeps for personality control
    • Keyword-based evaluation
    • Perplexity (fluency) monitoring
  5. Evaluation (src/evaluation/)

    • Cross-model orthogonality
    • OOD generalization
    • Statistical significance testing

πŸ“š Citation

If you use this code or paper in your research, please cite:

@article{personalens2026,
  title={PersonaLens: A Standardized Framework for Mechanistic Localization 
         and Steering of Personality Traits in Large Language Models},
  author={Anonymous Authors},
  journal={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2026}
}

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments

This codebase was developed as part of research into mechanistic interpretability for psychological traits in LLMs. The statistical improvements and reproducibility fixes were implemented following the academic audit process.


πŸ“ž Contact

For questions or issues:


βœ… Reproducibility Checklist

  • requirements.txt with pinned versions
  • pyproject.toml for modern packaging
  • Automated table generation from JSON
  • Bootstrap CIs for Cohen's d
  • Permutation p-values
  • Pre-flight dependency checks
  • Post-flight output verification
  • Makefile for full automation
  • System role template fix
  • Comprehensive README

Status: βœ… All audit issues addressed

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors