SGCA Capability Project

This project reorganizes the uploaded scripts into a runnable Python package for SAE-ARD-NMF multi-hot capability grouping.

What is included

sgca_capability/features.py
Stream SGCA sae_vector_chunks, select SAE features per view-layer block, build selected matrix, and construct sparse IDF-weighted NMF input.
sgca_capability/nmf.py
ARD-style Euclidean NMF implementation.
sgca_capability/grouping.py
Soft-score normalization, multi-hot membership construction, weak/redundant group pruning, and coverage repair.
sgca_capability/outputs.py
Q-matrix, group summary, group feature, group file, taxonomy, and report writers.
sgca_capability/pipeline.py
Unified CLI for full build and rerun-from-cached-matrix mode.
sgca_capability/token_sae_redundancy.py
Package-native token-SAE redundancy analysis, including dataset readers, GPU extraction, checkpointing, sharded merge, similarity tables, PCA plots, and token-SAE cluster diagnostics.
original_scripts/
The original uploaded scripts are preserved for reference only, not as runtime entrypoints.

original_scripts integration map

Use project code for execution:

original_scripts/sgca_sae_ard_nmf_multihot_original.py
  -> sgca_capability.pipeline
  -> scripts/sgca_sae_ard_nmf_multihot.py

original_scripts/rerun_cached_matrix_with_coverage_repair_original.py
  -> sgca_capability.pipeline --mode rerun-cached
  -> sgca_capability.rerun_cached_matrix_with_coverage_repair
  -> scripts/rerun_cached_ard_nmf.py

original_scripts/token_sae_redundancy_original.py
  -> sgca_capability.token_sae_redundancy
  -> scripts/token_sae_redundancy.py

Install

cd sgca_capability_project
python -m pip install -e .

or without editable install:

cd sgca_capability_project
python -m pip install -r requirements.txt
export PYTHONPATH=$PWD

Expected input directory

The full pipeline expects:

<output_dir>/run_config.json
<output_dir>/sae_vector_chunks/chunk_*.npz

run_config.json should include:

{
  "effective_views": ["sample"],
  "effective_layers": [0, 1]
}

Each chunk should include:

records                         object array of JSON strings
act__{view}__layer{layer}        float array [rows, sae_width]

Full pipeline

python -m sgca_capability.pipeline \
  --output-dir /path/to/sgca_output \
  --mode full \
  --features-per-block 512 \
  --feature-selection variance \
  --n-groups-max 256 \
  --sample-topk-features 256 \
  --df-min-count 5 \
  --df-max-frac 0.40 \
  --ard-max-iter 500 \
  --ard-warmup-iter 60 \
  --ard-lambda 0.08 \
  --membership-quantile 0.90 \
  --min-group-size 10

Outputs are written to:

<output_dir>/ard_nmf_sae_multihot_groups/

Rerun from cached sparse matrix

Use this when ard_nmf_input_n*_d*_top*.npz and selected_column_df_idf.csv already exist and you only want to rerun ARD-NMF / coverage repair:

python -m sgca_capability.pipeline \
  --output-dir /path/to/sgca_output \
  --mode rerun-cached \
  --n-groups-max 256 \
  --sample-topk-features 256 \
  --ard-max-iter 500 \
  --ard-warmup-iter 60

or:

python scripts/rerun_cached_ard_nmf.py --output-dir /path/to/sgca_output

The cached rerun path also supports the original rerun script's explicit paths:

python scripts/rerun_cached_ard_nmf.py \
  --output-dir /path/to/sgca_output \
  --matrix-path /path/to/ard_nmf_input_n88003_d3072_top256.npz \
  --out-dir /path/to/sgca_output/ard_nmf_sae_multihot_groups \
  --selected-column-meta /path/to/selected_column_df_idf.csv

Smoke test

bash tests/test_smoke.sh

The smoke test creates a toy SGCA output directory, runs the full pipeline, then reruns from the cached sparse matrix.

Main outputs

q_matrix_soft_ard_nmf_sae.csv
q_matrix_hard_nmf_sae.csv
q_matrix_long_ard_nmf_sae.csv
ard_nmf_group_summary.csv
ard_nmf_group_features.csv
ard_nmf_component_diagnostics.csv
item_group_membership_summary.csv
coverage_repair_audit.csv
COVERAGE_REPAIR_REPORT.json
ard_nmf_matrices.npz
ARD_NMF_SAE_MULTIHOT_REPORT.md

Notes

The packaged pipeline focuses on SAE-ARD-NMF capability grouping and cached rerun repair.
Token-SAE redundancy analysis now lives in sgca_capability.token_sae_redundancy. It still needs the heavier optional dependencies (torch, transformers, av, etc.), so run it in the qwenscope environment.

Token-SAE redundancy

CUDA_VISIBLE_DEVICES=1 HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 TOKENIZERS_PARALLELISM=false \
/home/jiangbaoyang/anaconda3/envs/qwenscope/bin/python scripts/token_sae_redundancy.py \
  --output-dir /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT/redundancy_results/token_sae_eval_clean \
  --model /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_hub/Qwen3.5-2B-Base \
  --sae-path /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_hub/SAE-Res-Qwen3.5-2B-Base-W32K-L0_100 \
  --layers 10 \
  --overwrite

从原始 HF 数据生成 Q 矩阵

下面的命令会从 /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_data 重新处理数据，不复用旧的 QwenScope 输出。默认排除 Cheers-Training-Data，使用 qwenscope conda 环境，并通过 --require-cuda 强制使用 GPU。

先进入项目目录：

cd /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT

单卡一条龙版本：

CUDA_VISIBLE_DEVICES=1 HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 TOKENIZERS_PARALLELISM=false \
/home/jiangbaoyang/anaconda3/envs/qwenscope/bin/python scripts/build_hf_qmatrix.py \
  --mode all \
  --hf-data-root /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_data \
  --exclude-dataset Cheers-Training-Data \
  --output-dir /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT/hf_data_qmatrix_run_new \
  --model /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_hub/Qwen3.5-2B-Base \
  --sae-path /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_hub/SAE-Res-Qwen3.5-2B-Base-W32K-L0_100 \
  --layers 0,4,8,12,16,20 \
  --media-mode metadata \
  --require-cuda

推荐的多卡分片版本：

cd /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT

BASE=/home/jiangbaoyang/GitHub/ModelScope-CDM-IRT/hf_data_qmatrix_run_new_sharded
mkdir -p "$BASE/logs"

for i in 0 1 2 3 4 5 6; do
  gpu=$((i+1))
  tmux new-session -d -s mscdm_qm_s${i} -c /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT \
    "env CUDA_VISIBLE_DEVICES=${gpu} HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 TOKENIZERS_PARALLELISM=false OMP_NUM_THREADS=4 MKL_NUM_THREADS=4 \
    /home/jiangbaoyang/anaconda3/envs/qwenscope/bin/python scripts/build_hf_qmatrix.py \
      --mode extract \
      --hf-data-root /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_data \
      --exclude-dataset Cheers-Training-Data \
      --output-dir ${BASE}/shard_${i} \
      --model /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_hub/Qwen3.5-2B-Base \
      --sae-path /home/jiangbaoyang/HuggingFace-Download-Accelerator/hf_hub/SAE-Res-Qwen3.5-2B-Base-W32K-L0_100 \
      --layers 0,4,8,12,16,20 \
      --media-mode metadata \
      --record-chunk-size 512 \
      --checkpoint-every 200 \
      --sample-shard-index ${i} \
      --sample-shard-count 7 \
      --require-cuda \
      > ${BASE}/logs/shard_${i}.log 2>&1"
done

然后启动自动合并、聚类和 Q 矩阵生成：

tmux new-session -d -s mscdm_qm_finalize -c /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT \
  "/usr/bin/bash scripts/finalize_hf_qmatrix_run.sh \
  /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT/hf_data_qmatrix_run_new_sharded \
  /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT/hf_data_qmatrix_run_new_merged"

查看进度：

nvidia-smi
tmux ls
tail -f /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT/hf_data_qmatrix_run_new_sharded/logs/shard_0.log
tail -f /home/jiangbaoyang/GitHub/ModelScope-CDM-IRT/hf_data_qmatrix_run_new_sharded/logs/finalize.log

最终 Q 矩阵输出目录：

/home/jiangbaoyang/GitHub/ModelScope-CDM-IRT/hf_data_qmatrix_run_new_merged/ard_nmf_sae_multihot_groups/

这个 package-native runner 会重新写出 sae_vector_chunks/、qwen_scope_multiclass_clusters.csv，以及 ard_nmf_sae_multihot_groups/q_matrix_*。默认优先读取 Parquet/JSON/JSONL，并跳过容易重复的 CSV/TSV 导出文件；只有显式传入 --include-csv-tsv 时才会读取 CSV/TSV。original_scripts/ 只作为参考，不作为运行入口。

Psychometric post-analysis: IRT and CDM

After q_matrix_hard_nmf_sae.csv is generated, you can analyze item quality with model/human response data.

Expected response CSV format by default:

respondent_id,item_id,score
model_a,0,1
model_a,1,0
model_b,0,1

respondent_id: model, run, annotator, or examinee id.
item_id: should match run_row_id in the Q-matrix by default. Use --q-item-col and --response-item-col if you use sample_id or another id.
score: binary correctness. Non-binary values are binarized by --binarize-threshold.

Run both IRT and CDM:

python -m sgca_capability.psychometrics \
  --q-matrix /path/to/sgca_output/ard_nmf_sae_multihot_groups/q_matrix_hard_nmf_sae.csv \
  --responses /path/to/responses.csv \
  --out-dir /path/to/sgca_output/psychometrics \
  --analysis all \
  --irt-model 2pl \
  --cdm-model dina \
  --cdm-max-attributes 12

IRT outputs:

psychometrics/irt/irt_item_parameters.csv
psychometrics/irt/irt_respondent_abilities.csv
psychometrics/irt/irt_summary.json
psychometrics/irt/IRT_REPORT.md

CDM outputs:

psychometrics/cdm/cdm_item_parameters.csv
psychometrics/cdm/cdm_respondent_mastery_probabilities.csv
psychometrics/cdm/cdm_attribute_summary.csv
psychometrics/cdm/cdm_latent_classes.csv
psychometrics/cdm/cdm_summary.json
psychometrics/cdm/CDM_REPORT.md

Important: CDM enumerates latent mastery states, so the number of attributes must be small. Use --cdm-max-attributes or pass a reduced set of group columns. IRT can handle more items directly; CDM should usually be run on a selected subset of named/validated ability groups.

Toy psychometric smoke test

bash tests/test_smoke.sh
python examples/make_toy_responses.py \
  --q-matrix /tmp/sgca_capability_smoke/sgca_output/ard_nmf_sae_multihot_groups/q_matrix_hard_nmf_sae.csv \
  --out /tmp/sgca_capability_smoke/responses.csv
python -m sgca_capability.psychometrics \
  --q-matrix /tmp/sgca_capability_smoke/sgca_output/ard_nmf_sae_multihot_groups/q_matrix_hard_nmf_sae.csv \
  --responses /tmp/sgca_capability_smoke/responses.csv \
  --out-dir /tmp/sgca_capability_smoke/psychometrics \
  --analysis all \
  --cdm-max-attributes 5 \
  --irt-max-iter 300 \
  --cdm-max-iter 100

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
examples		examples
original_scripts		original_scripts
scripts		scripts
sgca_capability		sgca_capability
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_CN.md		README_CN.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SGCA Capability Project

What is included

original_scripts integration map

Install

Expected input directory

Full pipeline

Rerun from cached sparse matrix

Smoke test

Main outputs

Notes

Token-SAE redundancy

从原始 HF 数据生成 Q 矩阵

Psychometric post-analysis: IRT and CDM

Toy psychometric smoke test

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SGCA Capability Project

What is included

original_scripts integration map

Install

Expected input directory

Full pipeline

Rerun from cached sparse matrix

Smoke test

Main outputs

Notes

Token-SAE redundancy

从原始 HF 数据生成 Q 矩阵

Psychometric post-analysis: IRT and CDM

Toy psychometric smoke test

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages