Skip to content

WIP: packing#20

Open
tpoisonooo wants to merge 8 commits intomainfrom
packing
Open

WIP: packing#20
tpoisonooo wants to merge 8 commits intomainfrom
packing

Conversation

@tpoisonooo
Copy link
Collaborator

@tpoisonooo tpoisonooo commented Nov 12, 2025

新增

  • 引入 dataset packing feature,打包训练数据,手动调整 position_ids
  • omic_ids 合并进 omic_info_list
  • 增加 entropy loss
  • 增加打印 args,引入 texttable
  Module   |             Key             |                       Value
===========+=============================+===================================================
experiment | experiment_name             | Qwen3_1.7B_mini_pack1_dem1
           | output_dir                  | Qwen3_1.7B_mini_pack1_dem1
           | profile_log_dir             | None
           | report_to                   | ['swanlab']
           | swanlab                     | True
           | swanlab_mode                | local
           | swanlab_project             | BioMLLM
           | swanlab_team                | BioMLLM_report
           | test_code                   | False
model      | bf16                        | True
           | device                      | cuda
           | dna_rna_k_tokens            | 1024
           | dna_rna_model_path          | /mnt/shared-storage-user/ai4agr-share/lijinzhe/...
           | fp16                        | False
           | greater_is_better           | False
           | load_best_model_at_end      | False
           | no_load_pretrained          | False
           | protein_k_tokens            | 1024
           | protein_model_path          | /mnt/shared-storage-user/ai4agr-share/lijinzhe/...
           | text_model_path             | /mnt/shared-storage-user/ai4agr-share/lijinzhe/...
           | train_bio                   | False
           | train_llm                   | True
           | train_mlp                   | True
dataset    | all_reduce_loss             | False
           | batching_stretegy           | padding
           | eval_dataset_path           | /mnt/shared-storage-user/ai4agr-share/lijinzhe/...
           | eval_max_len                | 8192
           | eval_max_src_len            | 1024
           | eval_read_nums              | 12800000
           | max_len                     | 8192
           | max_src_len                 | 1024
           | meta_prompt                 | None
           | postfix                     | None
..

修复

  • parse_args 使用错误,实际 liger 没法关掉

移除

  • 删掉 omic_type == "pad"
  • 删掉 max_src_len
  • 删掉 shuffle
  • 删掉各处的 seed,在最开始统一设置种子数

测试

1B 训练加速 140%,8k 长度 4 卡 9 小时

image

@tpoisonooo tpoisonooo changed the title Packing WIP: packing Nov 12, 2025
@tpoisonooo tpoisonooo force-pushed the packing branch 2 times, most recently from c19fb5a to d8a862a Compare November 20, 2025 06:36
feat(dataset/omics_dataset.py): update collate_fn

feat(omics_dataset.py): fix

fix(omics_dataset.py): typo

feat(src/train.py): bugfix

test(src/train.py): run passed

feat(src/loss.py): add DEM loss

fix(loss.py): use entropy loss

feat(scripts): update train
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants