Historian is a standardized evaluation framework for next-edit recommendation systems.
Git commit are snapshot with edit's temporal order lost. This framework reconstructs the natural editing sequence from commits and evaluates next-edit recommendation systems in a realistic, step-by-step manner. This framework:
- Recovers the natural editing order from commits by analyzing code dependencies and structural relationships
- Reconstructs the developer's context at each editing step, replaying the codebase state as edits are applied one by one
- Evaluates next-edit recommendation systems by simulating the real development process and measuring if predicted edits match what the developer actually did next
Traditional evaluation approaches test code generation in isolation. We test whether a system can predict what a developer will edit next given the realistic, incremental context they actually had during development. This tests both:
- What to edit (location accuracy)
- When to edit it (respecting the natural flow and dependencies)
┌────────────────────────────────────────────────────────────────┐
│ Phase 1: Initialization │
├────────────────────────────────────────────────────────────────┤
│ 1. Extract edits from given commit │
│ 2. Build partial order between edits │
│ 3. Initialize SUT: SUT.setup() (models, LSP servers, etc.) │
│ 4. Select & apply initial edit → Establish baseline state │
└────────────────────────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────────────────────────┐
│ Phase 2: Iterative Recommendation (Loop until all edits done) │
├────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Step A: Request full recommendation: │ │
│ │ SUT.subsequent_edit_recommendation(context) │ │
│ │ → locations + contents │ │
│ └─────────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Step B: Evaluate against ground truth │ │
│ │ Match criteria: 50% line overlap + BLEU-4 > 50 │ │
│ └─────────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌───────────────┴───────────────┐ │
│ ✅ Match Found ❌ No Match Found │
│ │ │ │
│ ↓ ↓ │
│ ┌──────────────────────┐ ┌───────────────────────────────┐ │
│ │ Select matched edit │ │ Step C: Fallback mode │ │
│ │ Record: precision │ │ • Pick a GT location │ │
│ │ recall │ │ • Request content for GT loc │ │
│ │ F1-score │ │ SUT.generate_edit_solution()│ │
│ │ ... │ │ → contents │ │
│ └──────────────────────┘ └───────────────────────────────┘ │
│ └────────────┬───────────────────┘ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Step D: Apply selected edit to codebase │ │
│ │ Update project state │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │ More edits? │
│ ├───────────┐ │
│ No │ │ Yes → Loop back │
└─────────────────────────┴───────────┴──→───────────────────────┘
↓
┌────────────────────────────────────────────────────────────────┐
│ Phase 3: Reporting │
├────────────────────────────────────────────────────────────────┤
│ • Aggregate metrics (precision, recall, BLEU, latency) │
│ • Flow pattern statistics (keeping/jumping/breaking/reverting/)│
│ • Save to JSON: {project}-{sha}-{method}-results.json │
└────────────────────────────────────────────────────────────────┘
Key Features:
- Dual-mode evaluation: Full recommendation (location + content) with fallback to content-only
- Realistic context: Each prediction uses the exact codebase state at that editing step
- Flow-aware: Tracks whether recommendations respect edit dependencies
A predicted edit is considered matching a ground truth edit if:
- Location Overlap: At least 50% line overlap with a ground truth edit
- Content Similarity: BLEU-4 score between predicted and ground truth edit content greater than 50.
Based on this, we compute the following metrics:
Categorizes predictions based on dependency ordering:
flow_keeping✅: Correct prediction that is one-hop from applied editsflow_jumping⚠️ : Correct prediction that is multi-hop from applied editsflow_breaking❌: Wrong prediction that matches no ground truth editsflow_reverting🔄: Prediction that undoes a previous edit
precision@all = flow_keeping / total_predictions
recall@all = flow_keeping / allowed_ground_truth_edits
f1_score@all = 2 * precision * recall / (precision + recall)
tp@k = number of correct predictions in top-k- BLEU-4: Measures similarity between predicted and ground truth code
- Latency: Time taken for a single subsequent_edit_recommendation request (seconds)
┌─────────────────────────────────────────────────────────────┐
│ simulation/ Evaluation Framework │
│ ├─ main.py Simulation orchestrator │
│ ├─ commit.py Commit & edit state manager │
│ ├─ utils.py Git parsing, BLEU scoring │
│ ├─ edit_dependency.py Dependency analysis │
│ ├─ partial_order.py Edit ordering recovery │
│ └─ testset.json Test commits dataset │
└─────────────────────────────────────────────────────────────┘
↓ provides API
┌─────────────────────────────────────────────────────────────┐
│ systemUnderTest/ Your Method Implementation │
│ ├─ TRACE/ Reference implementation │
│ │ ├─ main.py Entry point (required API) │
│ │ ├─ TRACE.py Logic-based recommendation │
│ │ ├─ Invoker.py Strategy selector │
│ │ ├─ Locator.py Location prediction model │
│ │ └─ Generator.py Content generation model │
│ │ │
│ └─ YourMethod/ 👈 Implement your method │
│ └─ main.py Must implement required API │
└─────────────────────────────────────────────────────────────┘
↓ uses
┌─────────────────────────────────────────────────────────────┐
│ libs/ Tool Libraries │
│ ├─ LSPs/ Language Server Protocol │
│ │ ├─ language_server.py Base LSP client │
│ │ ├─ py_lsp.py Python (Pyright) │
│ │ ├─ java_lsp.py Java │
│ │ └─ go_lsp.py Go │
│ │ │
│ └─ tree-sitter/ Code parsing & AST analysis │
└─────────────────────────────────────────────────────────────┘
# Create conda environment
conda create -n historian python=3.12
conda activate historian
# Install dependencies
pip install -r requirements.txt-
Install language servers:
# Pyright is installed via `requirements.txt` # jdt.ls are provided in `libs/LSPs/jdt-language-server/` # TypeScript/JavaScript npm install -g typescript-language-server typescript # Go go install golang.org/x/tools/gopls@latest
-
Download TRACE model checkpoints:
cd systemUnderTest/TRACE bash download_models.sh cd ../../
-
Setup configurations in
.env:SUT=TRACE # system under test EVAL_SET= # default is `simulation/testset.json` REPOS_DIR= # path to clone all simulated repositories OUTPUT_DIR= # path to save simulation results FLOW_ANALYSIS= # whether to enable flow pattern analysis (true/false) # TRACE specific configurations INVOKER_MODEL_PATH=systemUnderTest/TRACE/models/invoker_model/pytorch_model.bin # path to TRACE invoker model checkpoint LOCATOR_MODEL_PATH=systemUnderTest/TRACE/models/locator_model/pytorch_model.bin # path to TRACE locator model checkpoint GENERATOR_MODEL_PATH=systemUnderTest/TRACE/models/generator_model/pytorch_model.bin # path to TRACE generator model checkpoint DEVICE= # device for model inference (e.g., cpu, cuda:0) # If you need to evaluate flow patterns (FLOW_ANALYSIS=true): OPENAI_API_KEY= # your OpenAI API key OPENAI_BASE_URL= # your OpenAI base URL (if any)
-
Run simulation:
# Evaluate TRACE method on Python commits python -m simulation.main # Results will be saved to: # output/{project}-{commit_sha}-{method}-simulation-results.json
-
Install Claude Code SDK:
pip install claude-code-sdk npm install -g @anthropic-ai/claude-code
-
Setup configurations in
.env:SUT=Claude # system under test EVAL_SET= # default is `simulation/testset.json` REPOS_DIR= # path to clone all simulated repositories OUTPUT_DIR= # path to save simulation results LOG_DIR= # path to save Claude chat logs FLOW_ANALYSIS= # whether to enable flow pattern analysis (true/false) # Claude specific configurations ANTHROPIC_API_KEY= # your Anthropic API key ANTHROPIC_BASE_URL= # your Anthropic base URL (if using a proxy, otherwise leave empty) # If you need to evaluate flow patterns (FLOW_ANALYSIS=true): OPENAI_API_KEY= # your OpenAI API key OPENAI_BASE_URL= # your OpenAI base URL (if any)
-
Run simulation:
python -m simulation.main # Results will be saved to: # output/{project}-{commit_sha}-Claude-simulation-results.json # Chat logs will be saved to: # {LOG_DIR}/Claude/{project}-{commit_sha}/chat_{step}.json
-
Implement the required API in
systemUnderTest/YourMethod/main.py:def setup(json_input): # Initialize your method (load models, start LSP, etc.) def subsequent_edit_recommendation(json_input): # Given current project state, return predicted edits in snapshots format def generate_edit_solution(json_input): # Generate edit solution for a given location def end(json_input): # Clean up resources (close LSP, free memory, etc.)
-
Setup configurations in
.env:SUT=YourMethod # system under test EVAL_SET= # default is `simulation/testset.json` REPOS_DIR= # path to clone all simulated repositories OUTPUT_DIR= # path to save simulation results FLOW_ANALYSIS= # whether to enable flow pattern analysis (true/false) # YourMethod specific configurations XXX=XXX # add your method specific configs here # If you need to evaluate flow patterns: OPENAI_API_KEY= # your OpenAI API key OPENAI_BASE_URL= # your OpenAI base URL (if any)
-
Run simulation:
# Evaluate TRACE method on Python commits python -m simulation.main # Results will be saved to: # output/{project}-{commit_sha}-{method}-simulation-results.json
We welcome implementations of new edit recommendation methods!
- Create
systemUnderTest/YourMethod/main.py - Implement required API (see "Implement Your Own Method")
- Add test commits to
simulation/testset.json - Run evaluation and submit results
- Better dependency analysis algorithms
- Additional evaluation metrics
- Support for more languages
- Performance optimizations
- Email: chenyan@u.nus.edu
Happy Evaluating! 🚀