Skip to content

EurecaAI/ModelScope-CDM-IRT

Repository files navigation

SGCA Capability Project

This project reorganizes the uploaded scripts into a runnable Python package for SAE-ARD-NMF multi-hot capability grouping.

What is included

  • sgca_capability/features.py
    Stream SGCA sae_vector_chunks, select SAE features per view-layer block, build selected matrix, and construct sparse IDF-weighted NMF input.

  • sgca_capability/nmf.py
    ARD-style Euclidean NMF implementation.

  • sgca_capability/grouping.py
    Soft-score normalization, multi-hot membership construction, weak/redundant group pruning, and coverage repair.

  • sgca_capability/outputs.py
    Q-matrix, group summary, group feature, group file, taxonomy, and report writers.

  • sgca_capability/pipeline.py
    Unified CLI for full build and rerun-from-cached-matrix mode.

  • sgca_capability/token_sae_redundancy.py
    Package-native token-SAE redundancy analysis, including dataset readers, GPU extraction, checkpointing, sharded merge, similarity tables, PCA plots, and token-SAE cluster diagnostics.

  • original_scripts/
    The original uploaded scripts are preserved for reference only, not as runtime entrypoints.

original_scripts integration map

Use project code for execution:

original_scripts/sgca_sae_ard_nmf_multihot_original.py
  -> sgca_capability.pipeline
  -> scripts/sgca_sae_ard_nmf_multihot.py

original_scripts/rerun_cached_matrix_with_coverage_repair_original.py
  -> sgca_capability.pipeline --mode rerun-cached
  -> sgca_capability.rerun_cached_matrix_with_coverage_repair
  -> scripts/rerun_cached_ard_nmf.py

original_scripts/token_sae_redundancy_original.py
  -> sgca_capability.token_sae_redundancy
  -> scripts/token_sae_redundancy.py

Install

cd sgca_capability_project
python -m pip install -e .

or without editable install:

cd sgca_capability_project
python -m pip install -r requirements.txt
export PYTHONPATH=$PWD

Expected input directory

The full pipeline expects:

<output_dir>/run_config.json
<output_dir>/sae_vector_chunks/chunk_*.npz

run_config.json should include:

{
  "effective_views": ["sample"],
  "effective_layers": [0, 1]
}

Each chunk should include:

records                         object array of JSON strings
act__{view}__layer{layer}        float array [rows, sae_width]

Full pipeline

python -m sgca_capability.pipeline \
  --output-dir /path/to/sgca_output \
  --mode full \
  --features-per-block 512 \
  --feature-selection variance \
  --n-groups-max 256 \
  --sample-topk-features 256 \
  --df-min-count 5 \
  --df-max-frac 0.40 \
  --ard-max-iter 500 \
  --ard-warmup-iter 60 \
  --ard-lambda 0.08 \
  --membership-quantile 0.90 \
  --min-group-size 10

Outputs are written to:

<output_dir>/ard_nmf_sae_multihot_groups/

Rerun from cached sparse matrix

Use this when ard_nmf_input_n*_d*_top*.npz and selected_column_df_idf.csv already exist and you only want to rerun ARD-NMF / coverage repair:

python -m sgca_capability.pipeline \
  --output-dir /path/to/sgca_output \
  --mode rerun-cached \
  --n-groups-max 256 \
  --sample-topk-features 256 \
  --ard-max-iter 500 \
  --ard-warmup-iter 60

or:

python scripts/rerun_cached_ard_nmf.py --output-dir /path/to/sgca_output

The cached rerun path also supports the original rerun script's explicit paths:

python scripts/rerun_cached_ard_nmf.py \
  --output-dir /path/to/sgca_output \
  --matrix-path /path/to/ard_nmf_input_n88003_d3072_top256.npz \
  --out-dir /path/to/sgca_output/ard_nmf_sae_multihot_groups \
  --selected-column-meta /path/to/selected_column_df_idf.csv

Smoke test

bash tests/test_smoke.sh

The smoke test creates a toy SGCA output directory, runs the full pipeline, then reruns from the cached sparse matrix.

Main outputs

q_matrix_soft_ard_nmf_sae.csv
q_matrix_hard_nmf_sae.csv
q_matrix_long_ard_nmf_sae.csv
ard_nmf_group_summary.csv
ard_nmf_group_features.csv
ard_nmf_component_diagnostics.csv
item_group_membership_summary.csv
coverage_repair_audit.csv
COVERAGE_REPAIR_REPORT.json
ard_nmf_matrices.npz
ARD_NMF_SAE_MULTIHOT_REPORT.md

Notes

  • The packaged pipeline focuses on SAE-ARD-NMF capability grouping and cached rerun repair.
  • Token-SAE redundancy analysis now lives in sgca_capability.token_sae_redundancy. It still needs the heavier optional dependencies (torch, transformers, av, etc.), so run it in the qwenscope environment.

Token-SAE redundancy

CUDA_VISIBLE_DEVICES=1 HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 TOKENIZERS_PARALLELISM=false \
/home/jiangbaoyang/anaconda3/envs/qwenscope/bin/python scripts/token_sae_redundancy.py \
  --output-dir /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT/redundancy_results/token_sae_eval_clean \
  --model /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_hub/Qwen3.5-2B-Base \
  --sae-path /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_hub/SAE-Res-Qwen3.5-2B-Base-W32K-L0_100 \
  --layers 10 \
  --overwrite

从原始 HF 数据生成 Q 矩阵

下面的命令会从 /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_data 重新处理数据,不复用旧的 QwenScope 输出。默认排除 Cheers-Training-Data,使用 qwenscope conda 环境,并通过 --require-cuda 强制使用 GPU。

先进入项目目录:

cd /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT

单卡一条龙版本:

CUDA_VISIBLE_DEVICES=1 HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 TOKENIZERS_PARALLELISM=false \
/home/jiangbaoyang/anaconda3/envs/qwenscope/bin/python scripts/build_hf_qmatrix.py \
  --mode all \
  --hf-data-root /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_data \
  --exclude-dataset Cheers-Training-Data \
  --output-dir /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT/hf_data_qmatrix_run_new \
  --model /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_hub/Qwen3.5-2B-Base \
  --sae-path /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_hub/SAE-Res-Qwen3.5-2B-Base-W32K-L0_100 \
  --layers 0,4,8,12,16,20 \
  --media-mode metadata \
  --require-cuda

推荐的多卡分片版本:

cd /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT

BASE=/home/jiangbaoyang/GitHub/ModelScope-CDM-IRT/hf_data_qmatrix_run_new_sharded
mkdir -p "$BASE/logs"

for i in 0 1 2 3 4 5 6; do
  gpu=$((i+1))
  tmux new-session -d -s mscdm_qm_s${i} -c /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT \
    "env CUDA_VISIBLE_DEVICES=${gpu} HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 TOKENIZERS_PARALLELISM=false OMP_NUM_THREADS=4 MKL_NUM_THREADS=4 \
    /home/jiangbaoyang/anaconda3/envs/qwenscope/bin/python scripts/build_hf_qmatrix.py \
      --mode extract \
      --hf-data-root /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_data \
      --exclude-dataset Cheers-Training-Data \
      --output-dir ${BASE}/shard_${i} \
      --model /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_hub/Qwen3.5-2B-Base \
      --sae-path /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_hub/SAE-Res-Qwen3.5-2B-Base-W32K-L0_100 \
      --layers 0,4,8,12,16,20 \
      --media-mode metadata \
      --record-chunk-size 512 \
      --checkpoint-every 200 \
      --sample-shard-index ${i} \
      --sample-shard-count 7 \
      --require-cuda \
      > ${BASE}/logs/shard_${i}.log 2>&1"
done

然后启动自动合并、聚类和 Q 矩阵生成:

tmux new-session -d -s mscdm_qm_finalize -c /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT \
  "/usr/bin/bash scripts/finalize_hf_qmatrix_run.sh \
  /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT/hf_data_qmatrix_run_new_sharded \
  /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT/hf_data_qmatrix_run_new_merged"

查看进度:

nvidia-smi
tmux ls
tail -f /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT/hf_data_qmatrix_run_new_sharded/logs/shard_0.log
tail -f /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT/hf_data_qmatrix_run_new_sharded/logs/finalize.log

最终 Q 矩阵输出目录:

/home/jiangbaoyang/GitHub/ModelScope-CDM-IRT/hf_data_qmatrix_run_new_merged/ard_nmf_sae_multihot_groups/

这个 package-native runner 会重新写出 sae_vector_chunks/qwen_scope_multiclass_clusters.csv,以及 ard_nmf_sae_multihot_groups/q_matrix_*。默认优先读取 Parquet/JSON/JSONL,并跳过容易重复的 CSV/TSV 导出文件;只有显式传入 --include-csv-tsv 时才会读取 CSV/TSV。original_scripts/ 只作为参考,不作为运行入口。

Psychometric post-analysis: IRT and CDM

After q_matrix_hard_nmf_sae.csv is generated, you can analyze item quality with model/human response data.

Expected response CSV format by default:

respondent_id,item_id,score
model_a,0,1
model_a,1,0
model_b,0,1
  • respondent_id: model, run, annotator, or examinee id.
  • item_id: should match run_row_id in the Q-matrix by default. Use --q-item-col and --response-item-col if you use sample_id or another id.
  • score: binary correctness. Non-binary values are binarized by --binarize-threshold.

Run both IRT and CDM:

python -m sgca_capability.psychometrics \
  --q-matrix /path/to/sgca_output/ard_nmf_sae_multihot_groups/q_matrix_hard_nmf_sae.csv \
  --responses /path/to/responses.csv \
  --out-dir /path/to/sgca_output/psychometrics \
  --analysis all \
  --irt-model 2pl \
  --cdm-model dina \
  --cdm-max-attributes 12

IRT outputs:

psychometrics/irt/irt_item_parameters.csv
psychometrics/irt/irt_respondent_abilities.csv
psychometrics/irt/irt_summary.json
psychometrics/irt/IRT_REPORT.md

CDM outputs:

psychometrics/cdm/cdm_item_parameters.csv
psychometrics/cdm/cdm_respondent_mastery_probabilities.csv
psychometrics/cdm/cdm_attribute_summary.csv
psychometrics/cdm/cdm_latent_classes.csv
psychometrics/cdm/cdm_summary.json
psychometrics/cdm/CDM_REPORT.md

Important: CDM enumerates latent mastery states, so the number of attributes must be small. Use --cdm-max-attributes or pass a reduced set of group columns. IRT can handle more items directly; CDM should usually be run on a selected subset of named/validated ability groups.

Toy psychometric smoke test

bash tests/test_smoke.sh
python examples/make_toy_responses.py \
  --q-matrix /tmp/sgca_capability_smoke/sgca_output/ard_nmf_sae_multihot_groups/q_matrix_hard_nmf_sae.csv \
  --out /tmp/sgca_capability_smoke/responses.csv
python -m sgca_capability.psychometrics \
  --q-matrix /tmp/sgca_capability_smoke/sgca_output/ard_nmf_sae_multihot_groups/q_matrix_hard_nmf_sae.csv \
  --responses /tmp/sgca_capability_smoke/responses.csv \
  --out-dir /tmp/sgca_capability_smoke/psychometrics \
  --analysis all \
  --cdm-max-attributes 5 \
  --irt-max-iter 300 \
  --cdm-max-iter 100

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors