Dynamic Measurement Model

A research project focused on measuring unobserved quantities that change over time, such as latent ability during learning. The project combines probabilistic latent dynamic models (IRT, CIRT, BKT, DKT) with LLM-based prediction to model student learning trajectories from programming submissions. All models share a common evaluation framework, enabling direct comparison on identical data splits.

Related paper: https://aclanthology.org/2025.aimecon-main.43.pdf

Dataset

We use CodeInsight data from first-year university students learning C++ (includes both pass/fail and raw code). Data source: CodeInsightTeam/code_insights_csv. 4 courses (DSA-HK231, DSA-HK221, PF-HK232, PF-HK222), 3,286 students, 396 problems, 3M+ submissions.

student_id      : str    -- learner identifier
item_id         : str    -- "question_id_testcase_index" (atomic binary item)
attempt_index   : int    -- 0-indexed attempt number per (student, item)
correctness     : int    -- 0 or 1
timestamp_days  : float  -- days since course start
response        : str    -- raw C++ code
course_id       : str    -- course identifier

Getting Started

pip install -r requirements.txt
export HF_TOKEN="..."

Project Structure

├── data_collection/           # Web scraping + data preprocessing
│   ├── collect_data.py        # Selenium-based scraping from online judge
│   ├── csv2matrices.py        # CSV -> 3D tensor matrices
│   └── skill_tagging/         # LLM-based skill labeling for questions
│
├── data_analysis/             # Analysis scripts + paper figure generation
│   ├── plot_filtered_accuracy.py          # Per-attempt accuracy
│   ├── kendall_tau_decomposition.py       # Kendall tau decomposition
│   ├── llm_behavioral_comparison.py       # Behavioral comparison
│   ├── llm_bug_comparison.py              # Error type flow
│   ├── visualize_response_matrix.py       # Attempt progression heatmaps
│   ├── learning_curves_by_year.py         # Learning curves
│   ├── dataset_summary_table.py           # Dataset summary table
│   ├── student_behavior_clustering.py     # Student behavior clustering
│   ├── problem_by_problem_analysis.py     # Problem-level patterns
│   └── pace_analysis.py                   # Submission pacing
│
├── dynamic_models/            # Probabilistic latent dynamic models
│   └── temporal_eval/         # Evaluation framework
│       ├── run_student_eval.py    # Entry point: 
│       ├── harness.py             # Orchestrates model fitting + prediction
│       ├── student_split.py       # data split logic
│       ├── data_filter.py         # Quality filtering (coverage, pass rates)
│       ├── data_loader.py         # Unified data loading + caching
│       ├── metrics.py             # AUC, accuracy, F1, log-likelihood
│       ├── plot_results.py        # Loss curves, diagnostics
│       └── adapters/              # Per-model adapters (IRT, CIRT, BKT, DKT, ...)
│
├── llm_simulator/             # LLM student simulation (grounded evaluation)
│   ├── eval_student_split.py  # Entry point: orchestrates the full pipeline
│   ├── run.py                 # Core attempt loop: prompt → LLM → grade → repeat
│   ├── persona.py             # Builds behavioral profiles from weeks 1-3
│   ├── rag.py                 # TF-IDF retrieval of similar prior submissions
│   ├── summarize.py           # Compresses submission history via Haiku
│   ├── prompts.py             # Prompt construction and response parsing
│   ├── runners.py             # LLM API wrappers (Claude, GPT, Gemini, Mistral, vLLM)
│   └── data_loader.py         # Data structures, student-split loading, item difficulty
│
├── scripts/
│   └── build_figures.sh       # Generate all paper figures → overleaf/figures/
│
└── overleaf/                  # LaTeX paper (EMNLP 2026, ACL format)

Student-Split Evaluation

All models are evaluated using a shared student-split protocol:

Quality filtering — Remove students and items below coverage/pass-rate thresholds
Student split — Randomly partition into 70% train / 30% test students (seed-controlled)
Train phase — Train students' responses estimate item difficulty parameters across all weeks
Calibration — Test students' weeks 1-3 responses calibrate their ability estimates
Prediction — Predict test students' performance on weeks 4-6

This shared split ensures all models (knowledge-tracing and LLM) are evaluated on identical student-item pairs.

# Run all models (IRT, CIRT, BKT, DKT) on all courses
python -m dynamic_models.temporal_eval.run_student_eval

# Specific models on a specific course
python -m dynamic_models.temporal_eval.run_student_eval --models IRT BKT DKT --courses dsa_hk231

Results saved to results/student_eval/{course}/student_eval.csv.

LLM Simulator

Evaluates whether LLMs can predict real student behavior on programming problems. Uses a grounded evaluation approach: the LLM follows each student's real attempt trajectory step-by-step and predicts the next submission, rather than generating code freely.

The simulator uses the same student split as the knowledge-tracing models. For each test student on each target question (weeks 4-6):

Build a behavioral persona from weeks 1-3 (submission pacing, precheck usage, topic pass rates, code complexity)
Retrieve similar prior problem trajectories via TF-IDF (RAG)
Compress retrieved trajectories into behavioral summaries (via Claude Haiku)
At each attempt, the LLM sees the student's full trajectory so far and predicts the next submission
The predicted code is compiled and graded against unit tests

Supported Models

Commercial: opus (Claude Opus 4.6), haiku (Claude Haiku), gpt (GPT-4.1-nano), gemini (Gemini 2.0 Flash), mistral (Mistral Large). Open-source (via vLLM): llama (Llama-3.1-8B), gemma (Gemma-3-27B), qwen (Qwen2.5-14B), glm (GLM-4.7-AWQ).

# Full evaluation on a course
python -m llm_simulator.eval_student_split --course dsa_hk231 --models haiku

# Quick test with subset
python -m llm_simulator.eval_student_split --course dsa_hk231 --models haiku --max_students 5 --max_questions 3 --dry_run

# Multiple models
python -m llm_simulator.eval_student_split --course dsa_hk231 --models haiku opus gpt

Output: results/llm_student_eval/{course}/{model}_attempts{N}.jsonl

Figure Generation

All paper figures are generated via scripts/build_figures.sh, which runs individual analysis scripts and copies outputs to overleaf/figures/.

bash scripts/build_figures.sh

# Regenerate with fresh grading cache
bash scripts/build_figures.sh --regrade

Reproducibility

# 1. Run knowledge-tracing models
python -m dynamic_models.temporal_eval.run_student_eval

# 2. Run LLM simulator
python -m llm_simulator.eval_student_split --course dsa_hk231 --models haiku

# 3. Generate all paper figures
bash scripts/build_figures.sh

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.github/workflows		.github/workflows
data_analysis		data_analysis
data_collection		data_collection
docs		docs
dynamic_models		dynamic_models
llm_simulator		llm_simulator
script		script
scripts		scripts
.gitignore		.gitignore
CONTRIBUTOR_LICENSE_AGREEMENT.md		CONTRIBUTOR_LICENSE_AGREEMENT.md
README.md		README.md
requirements-ci.txt		requirements-ci.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dynamic Measurement Model

Dataset

Getting Started

Project Structure

Student-Split Evaluation

LLM Simulator

Supported Models

Figure Generation

Reproducibility

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dynamic Measurement Model

Dataset

Getting Started

Project Structure

Student-Split Evaluation

LLM Simulator

Supported Models

Figure Generation

Reproducibility

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages