This project reorganizes the uploaded scripts into a runnable Python package for SAE-ARD-NMF multi-hot capability grouping.
-
sgca_capability/features.py
Stream SGCAsae_vector_chunks, select SAE features per view-layer block, build selected matrix, and construct sparse IDF-weighted NMF input. -
sgca_capability/nmf.py
ARD-style Euclidean NMF implementation. -
sgca_capability/grouping.py
Soft-score normalization, multi-hot membership construction, weak/redundant group pruning, and coverage repair. -
sgca_capability/outputs.py
Q-matrix, group summary, group feature, group file, taxonomy, and report writers. -
sgca_capability/pipeline.py
Unified CLI for full build and rerun-from-cached-matrix mode. -
sgca_capability/token_sae_redundancy.py
Package-native token-SAE redundancy analysis, including dataset readers, GPU extraction, checkpointing, sharded merge, similarity tables, PCA plots, and token-SAE cluster diagnostics. -
original_scripts/
The original uploaded scripts are preserved for reference only, not as runtime entrypoints.
Use project code for execution:
original_scripts/sgca_sae_ard_nmf_multihot_original.py
-> sgca_capability.pipeline
-> scripts/sgca_sae_ard_nmf_multihot.py
original_scripts/rerun_cached_matrix_with_coverage_repair_original.py
-> sgca_capability.pipeline --mode rerun-cached
-> sgca_capability.rerun_cached_matrix_with_coverage_repair
-> scripts/rerun_cached_ard_nmf.py
original_scripts/token_sae_redundancy_original.py
-> sgca_capability.token_sae_redundancy
-> scripts/token_sae_redundancy.py
cd sgca_capability_project
python -m pip install -e .or without editable install:
cd sgca_capability_project
python -m pip install -r requirements.txt
export PYTHONPATH=$PWDThe full pipeline expects:
<output_dir>/run_config.json
<output_dir>/sae_vector_chunks/chunk_*.npz
run_config.json should include:
{
"effective_views": ["sample"],
"effective_layers": [0, 1]
}Each chunk should include:
records object array of JSON strings
act__{view}__layer{layer} float array [rows, sae_width]
python -m sgca_capability.pipeline \
--output-dir /path/to/sgca_output \
--mode full \
--features-per-block 512 \
--feature-selection variance \
--n-groups-max 256 \
--sample-topk-features 256 \
--df-min-count 5 \
--df-max-frac 0.40 \
--ard-max-iter 500 \
--ard-warmup-iter 60 \
--ard-lambda 0.08 \
--membership-quantile 0.90 \
--min-group-size 10Outputs are written to:
<output_dir>/ard_nmf_sae_multihot_groups/
Use this when ard_nmf_input_n*_d*_top*.npz and selected_column_df_idf.csv already exist and you only want to rerun ARD-NMF / coverage repair:
python -m sgca_capability.pipeline \
--output-dir /path/to/sgca_output \
--mode rerun-cached \
--n-groups-max 256 \
--sample-topk-features 256 \
--ard-max-iter 500 \
--ard-warmup-iter 60or:
python scripts/rerun_cached_ard_nmf.py --output-dir /path/to/sgca_outputThe cached rerun path also supports the original rerun script's explicit paths:
python scripts/rerun_cached_ard_nmf.py \
--output-dir /path/to/sgca_output \
--matrix-path /path/to/ard_nmf_input_n88003_d3072_top256.npz \
--out-dir /path/to/sgca_output/ard_nmf_sae_multihot_groups \
--selected-column-meta /path/to/selected_column_df_idf.csvbash tests/test_smoke.shThe smoke test creates a toy SGCA output directory, runs the full pipeline, then reruns from the cached sparse matrix.
q_matrix_soft_ard_nmf_sae.csv
q_matrix_hard_nmf_sae.csv
q_matrix_long_ard_nmf_sae.csv
ard_nmf_group_summary.csv
ard_nmf_group_features.csv
ard_nmf_component_diagnostics.csv
item_group_membership_summary.csv
coverage_repair_audit.csv
COVERAGE_REPAIR_REPORT.json
ard_nmf_matrices.npz
ARD_NMF_SAE_MULTIHOT_REPORT.md
- The packaged pipeline focuses on SAE-ARD-NMF capability grouping and cached rerun repair.
- Token-SAE redundancy analysis now lives in
sgca_capability.token_sae_redundancy. It still needs the heavier optional dependencies (torch,transformers,av, etc.), so run it in theqwenscopeenvironment.
CUDA_VISIBLE_DEVICES=1 HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 TOKENIZERS_PARALLELISM=false \
/home/jiangbaoyang/anaconda3/envs/qwenscope/bin/python scripts/token_sae_redundancy.py \
--output-dir /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT/redundancy_results/token_sae_eval_clean \
--model /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_hub/Qwen3.5-2B-Base \
--sae-path /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_hub/SAE-Res-Qwen3.5-2B-Base-W32K-L0_100 \
--layers 10 \
--overwrite下面的命令会从 /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_data 重新处理数据,不复用旧的 QwenScope 输出。默认排除 Cheers-Training-Data,使用 qwenscope conda 环境,并通过 --require-cuda 强制使用 GPU。
先进入项目目录:
cd /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT单卡一条龙版本:
CUDA_VISIBLE_DEVICES=1 HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 TOKENIZERS_PARALLELISM=false \
/home/jiangbaoyang/anaconda3/envs/qwenscope/bin/python scripts/build_hf_qmatrix.py \
--mode all \
--hf-data-root /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_data \
--exclude-dataset Cheers-Training-Data \
--output-dir /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT/hf_data_qmatrix_run_new \
--model /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_hub/Qwen3.5-2B-Base \
--sae-path /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_hub/SAE-Res-Qwen3.5-2B-Base-W32K-L0_100 \
--layers 0,4,8,12,16,20 \
--media-mode metadata \
--require-cuda推荐的多卡分片版本:
cd /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT
BASE=/home/jiangbaoyang/GitHub/ModelScope-CDM-IRT/hf_data_qmatrix_run_new_sharded
mkdir -p "$BASE/logs"
for i in 0 1 2 3 4 5 6; do
gpu=$((i+1))
tmux new-session -d -s mscdm_qm_s${i} -c /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT \
"env CUDA_VISIBLE_DEVICES=${gpu} HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 TOKENIZERS_PARALLELISM=false OMP_NUM_THREADS=4 MKL_NUM_THREADS=4 \
/home/jiangbaoyang/anaconda3/envs/qwenscope/bin/python scripts/build_hf_qmatrix.py \
--mode extract \
--hf-data-root /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_data \
--exclude-dataset Cheers-Training-Data \
--output-dir ${BASE}/shard_${i} \
--model /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_hub/Qwen3.5-2B-Base \
--sae-path /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_hub/SAE-Res-Qwen3.5-2B-Base-W32K-L0_100 \
--layers 0,4,8,12,16,20 \
--media-mode metadata \
--record-chunk-size 512 \
--checkpoint-every 200 \
--sample-shard-index ${i} \
--sample-shard-count 7 \
--require-cuda \
> ${BASE}/logs/shard_${i}.log 2>&1"
done然后启动自动合并、聚类和 Q 矩阵生成:
tmux new-session -d -s mscdm_qm_finalize -c /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT \
"/usr/bin/bash scripts/finalize_hf_qmatrix_run.sh \
/home/jiangbaoyang/GitHub/ModelScope-CDM-IRT/hf_data_qmatrix_run_new_sharded \
/home/jiangbaoyang/GitHub/ModelScope-CDM-IRT/hf_data_qmatrix_run_new_merged"查看进度:
nvidia-smi
tmux ls
tail -f /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT/hf_data_qmatrix_run_new_sharded/logs/shard_0.log
tail -f /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT/hf_data_qmatrix_run_new_sharded/logs/finalize.log最终 Q 矩阵输出目录:
/home/jiangbaoyang/GitHub/ModelScope-CDM-IRT/hf_data_qmatrix_run_new_merged/ard_nmf_sae_multihot_groups/
这个 package-native runner 会重新写出 sae_vector_chunks/、qwen_scope_multiclass_clusters.csv,以及 ard_nmf_sae_multihot_groups/q_matrix_*。默认优先读取 Parquet/JSON/JSONL,并跳过容易重复的 CSV/TSV 导出文件;只有显式传入 --include-csv-tsv 时才会读取 CSV/TSV。original_scripts/ 只作为参考,不作为运行入口。
After q_matrix_hard_nmf_sae.csv is generated, you can analyze item quality with model/human response data.
Expected response CSV format by default:
respondent_id,item_id,score
model_a,0,1
model_a,1,0
model_b,0,1respondent_id: model, run, annotator, or examinee id.item_id: should matchrun_row_idin the Q-matrix by default. Use--q-item-coland--response-item-colif you usesample_idor another id.score: binary correctness. Non-binary values are binarized by--binarize-threshold.
Run both IRT and CDM:
python -m sgca_capability.psychometrics \
--q-matrix /path/to/sgca_output/ard_nmf_sae_multihot_groups/q_matrix_hard_nmf_sae.csv \
--responses /path/to/responses.csv \
--out-dir /path/to/sgca_output/psychometrics \
--analysis all \
--irt-model 2pl \
--cdm-model dina \
--cdm-max-attributes 12IRT outputs:
psychometrics/irt/irt_item_parameters.csv
psychometrics/irt/irt_respondent_abilities.csv
psychometrics/irt/irt_summary.json
psychometrics/irt/IRT_REPORT.md
CDM outputs:
psychometrics/cdm/cdm_item_parameters.csv
psychometrics/cdm/cdm_respondent_mastery_probabilities.csv
psychometrics/cdm/cdm_attribute_summary.csv
psychometrics/cdm/cdm_latent_classes.csv
psychometrics/cdm/cdm_summary.json
psychometrics/cdm/CDM_REPORT.md
Important: CDM enumerates latent mastery states, so the number of attributes must be small. Use --cdm-max-attributes or pass a reduced set of group columns. IRT can handle more items directly; CDM should usually be run on a selected subset of named/validated ability groups.
bash tests/test_smoke.sh
python examples/make_toy_responses.py \
--q-matrix /tmp/sgca_capability_smoke/sgca_output/ard_nmf_sae_multihot_groups/q_matrix_hard_nmf_sae.csv \
--out /tmp/sgca_capability_smoke/responses.csv
python -m sgca_capability.psychometrics \
--q-matrix /tmp/sgca_capability_smoke/sgca_output/ard_nmf_sae_multihot_groups/q_matrix_hard_nmf_sae.csv \
--responses /tmp/sgca_capability_smoke/responses.csv \
--out-dir /tmp/sgca_capability_smoke/psychometrics \
--analysis all \
--cdm-max-attributes 5 \
--irt-max-iter 300 \
--cdm-max-iter 100