Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
3475 commits
Select commit Hold shift + click to select a range
6be9862
fix
wangbluo Oct 15, 2024
fd92789
fix
wangbluo Oct 15, 2024
bc7eead
fix
wangbluo Oct 15, 2024
83cf2f8
fix
wangbluo Oct 15, 2024
dcd41d0
Merge pull request #6071 from wangbluo/ring_attention
wangbluo Oct 15, 2024
62c13e7
[Ring Attention] Improve comments (#6085)
Edenzzzz Oct 16, 2024
cd61353
[pipeline] hotfix backward for multiple outputs (#6090)
ver217 Oct 16, 2024
2bcd0b6
[ckpt] add safetensors util
botbw Oct 14, 2024
3b1d7d1
[chore] refactor
botbw Oct 14, 2024
5ddad48
[fp8] add fallback and make compile option configurable (#6092)
ver217 Oct 18, 2024
58d8b8a
[misc] fit torch api upgradation and remove legecy import (#6093)
ver217 Oct 18, 2024
19baab5
[release] update version (#6094)
ver217 Oct 21, 2024
b10339d
fix lora ckpt save format (ColoTensor to Tensor)
BurkeHulk Oct 21, 2024
6d6cafa
pre-commit fix
BurkeHulk Oct 21, 2024
dee63cc
Merge pull request #6096 from BurkeHulk/hotfix/lora_ckpt
BurkeHulk Oct 21, 2024
80a8ca9
[extension] hotfix compile check (#6099)
ver217 Oct 24, 2024
4294ae8
[doc] sora solution news (#6100)
binmakeswell Oct 24, 2024
89a9a60
[MCTS] Add self-refined MCTS (#6098)
TongLi3701 Oct 24, 2024
c2e8f61
[checkpointio] fix hybrid plugin model save (#6106)
ver217 Oct 31, 2024
2f583c1
[pre-commit.ci] pre-commit autoupdate (#6078)
pre-commit-ci[bot] Oct 31, 2024
13ffa08
[release] update version (#6109)
ver217 Nov 4, 2024
a15ab13
[plugin] support get_grad_norm (#6115)
ver217 Nov 5, 2024
7a60161
update readme (#6116)
TongLi3701 Nov 6, 2024
30a9443
[Coati] Refine prompt for better inference (#6117)
TongLi3701 Nov 8, 2024
a259651
[zero] support extra dp (#6123)
ver217 Nov 12, 2024
c2fe313
[hotfix] fix flash attn window_size err (#6132)
duanjunwen Nov 14, 2024
cc40fe0
[fix] multi-node backward slowdown (#6134)
BurkeHulk Nov 14, 2024
5a03d26
[cli] support run as module option (#6135)
ver217 Nov 14, 2024
d4a4360
[checkpointio] support async model save (#6131)
ver217 Nov 14, 2024
8e08c27
[ckpt] Add async ckpt api (#6136)
wangbluo Nov 15, 2024
b90835b
[checkpointio] fix performance issue (#6139)
ver217 Nov 18, 2024
eb69e64
[async io]supoort async io (#6137)
flybird11111 Nov 18, 2024
5fa657f
[checkpointio] fix size compute
ver217 Nov 18, 2024
184a653
[checkpointio] fix pinned state dict
ver217 Nov 19, 2024
e0c68ab
[Zerobubble] merge main. (#6142)
duanjunwen Nov 19, 2024
5caad13
[doc] add hpc cloud intro (#6147)
Sze-qq Nov 20, 2024
cf519da
[optim] hotfix adam load (#6146)
ver217 Nov 20, 2024
152162a
[doc] update cloud link (#6148)
Sze-qq Nov 20, 2024
8fddbab
[checkpointio] disable buffering
ver217 Nov 21, 2024
8ecff0c
Merge pull request #6149 from ver217/hotfix/ckpt
wangbluo Nov 21, 2024
ab856fd
[checkpointio] fix zero optimizer async save memory (#6151)
ver217 Nov 25, 2024
6280cb1
[checkpointio] support debug log (#6153)
ver217 Dec 2, 2024
8d826a3
[fix] fix bug caused by perf version (#6156)
duanjunwen Dec 10, 2024
de3d371
[hotfix] fix zero comm buffer init (#6154)
ver217 Dec 10, 2024
e994c64
[checkpointio] fix async io (#6155)
flybird11111 Dec 16, 2024
aaafb38
[Device]Support npu (#6159)
flybird11111 Dec 17, 2024
130229f
[checkpointio]support asyncio for 3d (#6152)
flybird11111 Dec 23, 2024
fa9d031
[Hotfix] hotfix normalization (#6163)
duanjunwen Dec 23, 2024
5f82bfa
[doc] add bonus event (#6164)
binmakeswell Dec 23, 2024
8b0ed61
[hotfix] improve compatibility (#6165)
ver217 Dec 23, 2024
8369924
[news] release colossalai for sora (#6166)
binmakeswell Dec 23, 2024
af06d16
[checkpointio] support non blocking pin load (#6172)
ver217 Dec 25, 2024
a9bedc7
[Sharderformer] Support zbv in Sharderformer Policy (#6150)
duanjunwen Jan 2, 2025
7fdef9f
[pre-commit.ci] pre-commit autoupdate (#6113)
pre-commit-ci[bot] Jan 2, 2025
479067e
[release] update version (#6174)
ver217 Jan 3, 2025
ee81366
[checkpointio] support load-pin overlap (#6177)
ver217 Jan 7, 2025
5b094a8
[Inference]Fix example in readme (#6178)
GuangyaoZhang Jan 8, 2025
97e60cb
[checkpointio] gather tensor before unpad it if the tensor is both pa…
Lemon-412 Jan 21, 2025
ca0aa23
[Issue template] Add checkbox asking for details to reproduce error (…
Edenzzzz Jan 24, 2025
17062c8
[hotfix] fix hybrid checkpointio for sp+dp (#6184)
flybird11111 Feb 6, 2025
2b415e5
[shardformer] support ep for deepseek v3 (#6185)
ver217 Feb 11, 2025
5c09d72
[checkpointio] fix checkpoint for 3d (#6187)
flybird11111 Feb 12, 2025
ec73f1b
[CI] Cleanup Dist Optim tests with shared helper funcs (#6125)
Edenzzzz Feb 12, 2025
014837e
[shardformer] support pipeline for deepseek v3 and optimize lora save…
ver217 Feb 14, 2025
5ff5323
[hotfix] fix zero optim save (#6191)
ver217 Feb 14, 2025
ce0ec40
[checkpointio] fix for async io (#6189)
flybird11111 Feb 14, 2025
d20c8ff
Add GRPO and Support RLVR for PPO (#6186)
YeAnbang Feb 18, 2025
d54642a
[application] add lora sft example (#6192)
ver217 Feb 18, 2025
f8b9e88
[application] Update README (#6196)
TongLi3701 Feb 18, 2025
f73ae55
[application] add lora sft example data (#6198)
ver217 Feb 18, 2025
24dee8f
[doc] DeepSeek V3/R1 news (#6199)
binmakeswell Feb 19, 2025
9379cbd
[release] update version (#6195)
ver217 Feb 20, 2025
0171884
fix inference rebatching bug
YeAnbang Feb 20, 2025
53834b7
fix num_train_step update
YeAnbang Feb 20, 2025
7595c45
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Feb 20, 2025
b9e6055
Merge pull request #6208 from hpcaitech/grpo_dev
YeAnbang Feb 20, 2025
f32861c
[misc] update torch version (#6206)
ver217 Feb 24, 2025
56fe130
[hotfix] fix lora load (#6231)
ver217 Mar 1, 2025
6d676ee
[release] update version (#6236)
ver217 Mar 3, 2025
44d4053
[HotFix] update load lora model Readme; (#6240)
duanjunwen Mar 7, 2025
7ecdf9a
Update README.md (#6268)
Yanjia0 Apr 17, 2025
46ed5d8
[ci] update ci (#6254)
flybird11111 Apr 18, 2025
ddbbbaa
[upgrade]Upgrade transformers (#6320)
flybird11111 May 27, 2025
4271e3d
release
flybird11111 May 27, 2025
4577968
release
flybird11111 May 27, 2025
a9656e2
fix
flybird11111 May 27, 2025
d7a03bf
release
flybird11111 May 27, 2025
d3c40b9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 27, 2025
4afff92
fix
flybird11111 May 28, 2025
d322ff8
Merge pull request #6330 from flybird11111/main
BurkeHulk May 28, 2025
45dd5a7
release
flybird11111 May 29, 2025
cac878d
fix
flybird11111 May 29, 2025
5943843
fix
flybird11111 May 29, 2025
562767c
fix
flybird11111 May 30, 2025
c8aaa92
Update release_test_pypi_before_merge.yml
BurkeHulk May 30, 2025
948533f
fix
flybird11111 May 30, 2025
3f91597
Update release_test_pypi_before_merge.yml
BurkeHulk May 30, 2025
374dcd4
Update release_test_pypi_before_merge.yml
BurkeHulk May 30, 2025
0601023
Update release_pypi_after_merge.yml
BurkeHulk May 30, 2025
6f19618
[fix] fix_lazy_init for deepseek model in transformers
BurkeHulk Jun 2, 2025
fd56b22
Merge pull request #6334 from flybird11111/main
BurkeHulk Jun 2, 2025
c9cba49
fix CI machine tag
BurkeHulk Jun 2, 2025
067dd43
fix pre-commit err
BurkeHulk Jun 2, 2025
b4ec405
Merge pull request #6336 from BurkeHulk/fix/update-test-config
BurkeHulk Jun 3, 2025
6dfedea
Update release_test_pypi_before_merge.yml
BurkeHulk Jun 3, 2025
c4fe9e8
Update release_pypi_after_merge.yml
BurkeHulk Jun 3, 2025
b9535f3
Update version.txt
BurkeHulk Jun 3, 2025
0ba96e8
Update release_test_pypi_before_merge.yml
BurkeHulk Jun 3, 2025
916a8fe
Update release_test_pypi_before_merge.yml
BurkeHulk Jun 3, 2025
043c469
upgrade python
BurkeHulk Jun 3, 2025
91f08c6
upgrade python
BurkeHulk Jun 3, 2025
e00c9bb
upgrade python
BurkeHulk Jun 3, 2025
97f4bee
Merge pull request #6340 from hpcaitech/release/v0.5.0
BurkeHulk Jun 4, 2025
d097224
[feat] support qwen3 in shardformer
botbw Jul 10, 2025
e285eb6
[CI] install flash-attn 2.7.4.post1
botbw Jul 14, 2025
908c634
[CI] disable timm_regnetv_040 as aten::_unique2 is not supproted
botbw Jul 14, 2025
edd65a8
Merge pull request #6362 from hpcaitech/CI/test_build_on_schedule
BurkeHulk Jul 15, 2025
162bb42
[chat] add distributed impl (#6210)
ver217 Feb 21, 2025
7a2d455
[feature] fit RL style generation (#6213)
ver217 Feb 21, 2025
fa1272f
add reward related function
TongLi3701 Feb 23, 2025
40d6018
add simple grpo
TongLi3701 Feb 23, 2025
1f07b71
update grpo
TongLi3701 Feb 25, 2025
718c4b7
polish
TongLi3701 Feb 28, 2025
b7842f8
modify data loader
Mar 6, 2025
5f178a7
grpo consumer
Mar 6, 2025
9754a11
update loss
Mar 6, 2025
f8899dd
update reward fn
Mar 6, 2025
5c75d5b
update example
Mar 6, 2025
cc4cc78
update loader
Mar 6, 2025
1f15dc7
add algo selection
Mar 6, 2025
88eb6e5
add save
Mar 6, 2025
246f16d
update select algo
Mar 6, 2025
f71d422
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 6, 2025
bc538ba
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 6, 2025
fe017d3
update grpo
Mar 10, 2025
c8db826
update reward fn
Mar 10, 2025
a537aa1
update reward
Mar 10, 2025
a4862a2
fix reward score
Mar 11, 2025
b951d0b
add response length
Mar 11, 2025
69a1a32
detach
Mar 11, 2025
b19355f
fix tp bug
Mar 13, 2025
a2ae82a
fix consumer
Mar 13, 2025
30c7ddd
convert to 8 generation
Mar 13, 2025
bfc4582
print results
Mar 13, 2025
e224673
setup update
Mar 13, 2025
35dabd7
fix transformers backend
YeAnbang Mar 14, 2025
4551853
[Feature] Support Distributed LogProb for GRPO Training (#6247)
duanjunwen Mar 18, 2025
f983071
fix vllm
YeAnbang Mar 19, 2025
16e68a0
fix logprob, add filtering, temperature annealing, lr descent
YeAnbang Mar 21, 2025
23aac43
simplify vllm preprocessing input ids
YeAnbang Mar 21, 2025
c627b60
update logging
YeAnbang Mar 21, 2025
12da4d1
[feat] add microbatch forwarding (#6251)
YeAnbang Mar 28, 2025
5d79b9e
[Distributed RLHF] Integration of PP (#6257)
YeAnbang Apr 9, 2025
3bd6fa3
[hot-fix] Fix memory leakage bug, support TP+PP (#6258)
YeAnbang Apr 10, 2025
befd4f1
add prompt template (#6273)
TongLi3701 Apr 22, 2025
b34d707
[feat] Add final save at the end (#6274)
TongLi3701 Apr 23, 2025
5f913e8
[feat] Support DAPO (#6263)
YeAnbang Apr 25, 2025
673682e
fix checkpoint naming; add num_epoch parameter (#6277)
YeAnbang Apr 26, 2025
37a8be7
fix save issue (#6279)
TongLi3701 Apr 27, 2025
fb4e507
fix pp+tp, fix dataloader (#6280)
YeAnbang Apr 28, 2025
e181318
[feat] Support boxed math reward (#6284)
YeAnbang Apr 29, 2025
6a1bd83
[feat] Sync shard model (#6289)
TongLi3701 Apr 30, 2025
16600f3
Support evaluation during training
YeAnbang Apr 30, 2025
de0c267
reuse comm-group
YeAnbang Apr 30, 2025
1be993d
fix bug
YeAnbang Apr 30, 2025
9642b75
upgrade reward math verification
YeAnbang Apr 30, 2025
06b892b
rewrite reward fn
YeAnbang May 1, 2025
9544c51
[fix] revert reward update and evaluation (#6295)
YeAnbang May 7, 2025
4ac7d06
update pad seq (#6303)
TongLi3701 May 13, 2025
af4366f
Support evaluation during training
YeAnbang Apr 30, 2025
3416a4f
move logging to producer
YeAnbang May 14, 2025
5a6e4a6
[feat] Support prompt level dynamic (#6300)
TongLi3701 May 14, 2025
280aa0b
use consumer global step
YeAnbang May 15, 2025
0d0fef7
disable wandb tb syncing
YeAnbang May 15, 2025
f79dbdb
move prompt-level-filtering to buffer side
YeAnbang May 15, 2025
d19f1f2
move prompt-level-filtering to buffer side
YeAnbang May 15, 2025
88f49dd
remove redundant code and fix bugs
YeAnbang May 16, 2025
6ebd813
handle empty index
May 15, 2025
e7f61be
fix evaluation
YeAnbang May 16, 2025
654aefc
address conversation
YeAnbang May 16, 2025
6095274
support logging rollouts to wandb
YeAnbang May 16, 2025
9cbc5dd
upgrade reward functions
YeAnbang May 16, 2025
c7c73df
fix logging rollouts
YeAnbang May 17, 2025
06cfbe3
fix metric calculation
YeAnbang May 20, 2025
70c3daa
add uuid to rollout log
YeAnbang May 20, 2025
5bbfe15
fix empty tensor (#6319)
TongLi3701 May 20, 2025
4b1c515
fix missing tags parameter
YeAnbang May 21, 2025
2a39d3a
address conversation
YeAnbang May 28, 2025
382307a
fix default eval setting (#6321)
TongLi3701 May 22, 2025
6051001
address conversation
YeAnbang May 29, 2025
a246bf2
add overlength sample count (#6332)
TongLi3701 May 28, 2025
8d52441
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 29, 2025
a9a3f37
fix typ and parameter description
YeAnbang Jun 5, 2025
1771447
support code generation tasks
YeAnbang Jun 5, 2025
de40c73
fix bug, tested
YeAnbang Jun 9, 2025
9dbb0ff
remove debug code
YeAnbang Jun 9, 2025
72b2d98
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 9, 2025
6ae54a6
move out evaluation func (#6343)
TongLi3701 Jun 10, 2025
3a4681f
fix pp memory issue (#6344)
TongLi3701 Jun 11, 2025
3b3c48d
Manually schedule resources and support auto master address assigning
YeAnbang Jun 10, 2025
6a0b809
modify readme
YeAnbang Jun 10, 2025
79a7b99
update readme
YeAnbang Jun 10, 2025
80c576f
add ray timeout handling instruction
YeAnbang Jun 10, 2025
73384be
Update README.md
YeAnbang Jun 12, 2025
0f71c79
fix num_update_per_episode
YeAnbang Jun 12, 2025
a960990
optimize pp log_softmax OOM
YeAnbang Jun 13, 2025
245c8c2
implement memory efficient logprob
YeAnbang Jun 18, 2025
b314da1
fix small bug
YeAnbang Jun 19, 2025
685e0bd
add dp rank for multi-dp (#6351)
TongLi3701 Jun 19, 2025
594c2c6
[feat[ Support one-behind to reduce bubble time. Add profiling code (…
YeAnbang Jun 30, 2025
352a8e0
fix code evaluation
YeAnbang Jul 14, 2025
eafbc89
fix style
YeAnbang Jul 14, 2025
3d9dd34
add entropy (#6363)
YeAnbang Jul 17, 2025
c782976
hotfix entropy calculation (#6364)
YeAnbang Jul 22, 2025
118a66f
[Fix] Add L2 Regularization (#6372)
YeAnbang Jul 29, 2025
3746f73
fix missing or wrong file during rebase
YeAnbang Aug 5, 2025
32b2148
tested after rebasing, fix importance sampling bug
YeAnbang Aug 6, 2025
08a1244
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 6, 2025
e589ec5
support resume training
YeAnbang Aug 12, 2025
b6a5f67
reduce memory consumption
BurkeHulk Aug 13, 2025
9db9892
reduce memory consumption
BurkeHulk Aug 13, 2025
c83dc66
Update timeout
BurkeHulk Aug 14, 2025
94e972f
Update timeout
BurkeHulk Aug 14, 2025
bbc5fb4
fix ci
YeAnbang Aug 14, 2025
762150c
fix ci
YeAnbang Aug 14, 2025
99ba48f
Merge branch 'grpo-latest-rebase-main' of https://github.com/hpcaitec…
YeAnbang Aug 14, 2025
73bdfd8
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 14, 2025
4152c0b
fix dist log prob test
YeAnbang Aug 15, 2025
fe1f429
Merge branch 'grpo-latest-rebase-main' of https://github.com/hpcaitec…
YeAnbang Aug 15, 2025
b38248d
Merge pull request #6376 from hpcaitech/grpo-latest-rebase-main
BurkeHulk Aug 15, 2025
4ac2227
Merge pull request #6378 from hpcaitech/grpo-latest-rebase-fix-resume
YeAnbang Aug 18, 2025
48a673d
[Ring Attention] Add more detailed references (#6294)
Edenzzzz Aug 26, 2025
083766d
Add new implementations of RL algorithms (#6383)
sglucas Sep 3, 2025
e5fdefa
update B200 info/img/benchmark (#6385)
Yanjia0 Sep 26, 2025
b47b610
add code for zero-bubble implementation
YeAnbang Jul 9, 2025
dba0c0c
fix code evaluation
YeAnbang Jul 14, 2025
ddda79c
add entropy
YeAnbang Jul 16, 2025
2336d7f
fix racing condition
YeAnbang Jul 21, 2025
c865de3
cherry pick zero bubble RL
YeAnbang Nov 6, 2025
40b6a91
all tests passed
YeAnbang Nov 7, 2025
6f7e859
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 7, 2025
535eba8
update readme
YeAnbang Nov 7, 2025
4c53210
Merge branch 'grpo-zero-bubble-rebase' of https://github.com/hpcaitec…
YeAnbang Nov 7, 2025
1b65963
fix readme
YeAnbang Nov 10, 2025
7f91b7e
fix ci; specify flash-attn version
YeAnbang Nov 11, 2025
eb158eb
fix ci; remove test cases that failed on 3080 (those with tps), can p…
YeAnbang Nov 12, 2025
b1915d2
Merge pull request #6391 from hpcaitech/grpo-zero-bubble-rebase
YeAnbang Nov 13, 2025
85ad738
[doc] Update README.md (#6410)
Yanjia0 Apr 9, 2026
063b379
Update README.md (#6411)
Yanjia0 Apr 9, 2026
4f9953b
Update README.md (#6412)
Yanjia0 Apr 9, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
3 changes: 3 additions & 0 deletions .compatibility
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
2.3.0-12.1.0
2.4.0-12.4.1
2.5.1-12.4.1
4 changes: 4 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[run]
concurrency = multiprocessing
parallel = true
sigterm = true
12 changes: 12 additions & 0 deletions .cuda_ext.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{
"build": [
{
"torch_command": "pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121",
"cuda_image": "image-cloud.luchentech.com/hpcaitech/cuda-conda:12.1"
},
{
"torch_command": "pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124",
"cuda_image": "image-cloud.luchentech.com/hpcaitech/cuda-conda:12.4"
}
]
}
22 changes: 0 additions & 22 deletions .flake8

This file was deleted.

1 change: 1 addition & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
* @hpcaitech/colossalai-qa
29 changes: 29 additions & 0 deletions .github/ISSUE_TEMPLATE/bug-report.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,33 @@ body:
attributes:
value: >
#### Not suitable for your needs? [Open a blank issue](https://github.com/hpcaitech/ColossalAI/issues/new).
- type: checkboxes
attributes:
label: Is there an existing issue for this bug?
description: Please search [here](https://github.com/hpcaitech/ColossalAI/issues) to see if an open or closed issue already exists for the bug you have encountered.
options:
- label: I have searched the existing issues
required: true

- type: checkboxes
attributes:
label: The bug has not been fixed in the latest main branch
options:
- label: I have checked the latest main branch
required: true

- type: dropdown
id: share_script
attributes:
label: Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)
description: If not, please share your setting/training config, and/or point to the line in the repo that throws the error.
If the issue is not easily reproducible by us, it will reduce the likelihood of getting responses.
options:
- Yes, I will share a minimal reproducible script.
- No, I prefer not to share.
validations:
required: true

- type: textarea
attributes:
label: 🐛 Describe the bug
Expand All @@ -20,6 +47,8 @@ body:
A clear and concise description of what you expected to happen.
**Screenshots**
If applicable, add screenshots to help explain your problem.
**Optional: Affiliation**
Institution/email information helps better analyze and evaluate users to improve the project. Welcome to establish in-depth cooperation.
placeholder: |
A clear and concise description of what the bug is.
validations:
Expand Down
4 changes: 2 additions & 2 deletions .github/ISSUE_TEMPLATE/config.yml
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
blank_issues_enabled: true
contact_links:
- name: ❓ Simple question - Slack Chat
url: https://join.slack.com/t/colossalaiworkspace/shared_invite/zt-z7b26eeb-CBp7jouvu~r0~lcFzX832w
url: https://github.com/hpcaitech/public_assets/tree/main/colossalai/contact/slack
about: This issue tracker is not for technical support. Please use our Slack chat, and ask the community for help.
- name: ❓ Simple question - WeChat
url: https://github.com/hpcaitech/ColossalAI/blob/main/docs/images/WeChat.png
about: This issue tracker is not for technical support. Please use WeChat, and ask the community for help.
- name: 😊 Advanced question - GitHub Discussions
url: https://github.com/hpcaitech/ColossalAI/discussions
about: Use GitHub Discussions for advanced and unanswered technical questions, requiring a maintainer's answer.
about: Use GitHub Discussions for advanced and unanswered technical questions, requiring a maintainer's answer.
1 change: 1 addition & 0 deletions .github/ISSUE_TEMPLATE/documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ body:
**Expectation** What is your expected content about it?
**Screenshots** If applicable, add screenshots to help explain your problem.
**Suggestions** Tell us how we could improve the documentation.
**Optional: Affiliation** Institution/email information helps better analyze and evaluate users to improve the project. Welcome to establish in-depth cooperation.
placeholder: |
A clear and concise description of the issue.
validations:
Expand Down
2 changes: 2 additions & 0 deletions .github/ISSUE_TEMPLATE/feature_request.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ body:
If applicable, add screenshots to help explain your problem.
**Suggest a potential alternative/fix**
Tell us how we could improve this project.
**Optional: Affiliation**
Institution/email information helps better analyze and evaluate users to improve the project. Welcome to establish in-depth cooperation.
placeholder: |
A clear and concise description of your idea.
validations:
Expand Down
3 changes: 2 additions & 1 deletion .github/ISSUE_TEMPLATE/proposal.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ body:
- Bumping a critical dependency's major version;
- A significant improvement in user-friendliness;
- Significant refactor;
- Optional: Affiliation/email information helps better analyze and evaluate users to improve the project. Welcome to establish in-depth cooperation.
- ...

Please note this is not for feature request or bug template; such action could make us identify the issue wrongly and close it without doing anything.
Expand Down Expand Up @@ -43,4 +44,4 @@ body:
- type: markdown
attributes:
value: >
Thanks for contributing 🎉!
Thanks for contributing 🎉!
37 changes: 37 additions & 0 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
## 📌 Checklist before creating the PR

- [ ] I have created an issue for this PR for traceability
- [ ] The title follows the standard format: `[doc/gemini/tensor/...]: A concise description`
- [ ] I have added relevant tags if possible for us to better distinguish different PRs
- [ ] I have installed pre-commit: `pip install pre-commit && pre-commit install`


## 🚨 Issue number

> Link this PR to your issue with words like fixed to automatically close the linked issue upon merge
>
> e.g. `fixed #1234`, `closed #1234`, `resolved #1234`



## 📝 What does this PR do?

> Summarize your work here.
> if you have any plots/diagrams/screenshots/tables, please attach them here.



## 💥 Checklist before requesting a review

- [ ] I have linked my PR to an issue ([instruction](https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue))
- [ ] My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
- [ ] I have performed a self-review of my code
- [ ] I have added thorough tests.
- [ ] I have added docstrings for all the functions/methods I implemented

## ⭐️ Do you enjoy contributing to Colossal-AI?

- [ ] 🌝 Yes, I do.
- [ ] 🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.
9 changes: 0 additions & 9 deletions .github/reviewer_list.yml

This file was deleted.

165 changes: 165 additions & 0 deletions .github/workflows/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
# CI/CD

## Table of Contents

- [CI/CD](#cicd)
- [Table of Contents](#table-of-contents)
- [Overview](#overview)
- [Workflows](#workflows)
- [Code Style Check](#code-style-check)
- [Unit Test](#unit-test)
- [Example Test](#example-test)
- [Example Test on Dispatch](#example-test-on-dispatch)
- [Compatibility Test](#compatibility-test)
- [Compatibility Test on Dispatch](#compatibility-test-on-dispatch)
- [Release](#release)
- [User Friendliness](#user-friendliness)
- [Community](#community)
- [Configuration](#configuration)
- [Progress Log](#progress-log)

## Overview

Automation makes our development more efficient as the machine automatically run the pre-defined tasks for the contributors.
This saves a lot of manual work and allow the developer to fully focus on the features and bug fixes.
In Colossal-AI, we use [GitHub Actions](https://github.com/features/actions) to automate a wide range of workflows to ensure the robustness of the software.
In the section below, we will dive into the details of different workflows available.

## Workflows

Refer to this [documentation](https://docs.github.com/en/actions/managing-workflow-runs/manually-running-a-workflow) on how to manually trigger a workflow.
I will provide the details of each workflow below.

**A PR which changes the `version.txt` is considered as a release PR in the following context.**


### Code Style Check

| Workflow Name | File name | Description |
| ------------- | ----------------- | -------------------------------------------------------------------------------------------------------------- |
| `post-commit` | `post_commit.yml` | This workflow runs pre-commit checks for changed files to achieve code style consistency after a PR is merged. |

### Unit Test

| Workflow Name | File name | Description |
| ---------------------- | -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
| `Build on PR` | `build_on_pr.yml` | This workflow is triggered when a PR changes essential files and a branch is created/deleted. It will run all the unit tests in the repository with 4 GPUs. |
| `Build on Schedule` | `build_on_schedule.yml` | This workflow will run the unit tests everyday with 8 GPUs. The result is sent to Lark. |
| `Report test coverage` | `report_test_coverage.yml` | This PR will put up a comment to report the test coverage results when `Build` is done. |

To reduce the average time of the unit test on PR, `Build on PR` workflow manages testmon cache.

1. When creating a new branch, it copies `cache/main/.testmondata*` to `cache/<branch>/`.
2. When creating a new PR or change the base branch of a PR, it copies `cache/<base_ref>/.testmondata*` to `cache/_pull/<pr_number>/`.
3. When running unit tests for each PR, it restores testmon cache from `cache/_pull/<pr_number>/`. After the test, it stores the cache back to `cache/_pull/<pr_number>/`.
4. When a PR is closed, if it's merged, it copies `cache/_pull/<pr_number>/.testmondata*` to `cache/<base_ref>/`. Otherwise, it just removes `cache/_pull/<pr_number>`.
5. When a branch is deleted, it removes `cache/<ref>`.

### Example Test

| Workflow Name | File name | Description |
| -------------------------- | ------------------------------- | ------------------------------------------------------------------------------ |
| `Test example on PR` | `example_check_on_pr.yml` | The example will be automatically tested if its files are changed in the PR |
| `Test example on Schedule` | `example_check_on_schedule.yml` | This workflow will test all examples every Sunday. The result is sent to Lark. |
| `Example Test on Dispatch` | `example_check_on_dispatch.yml` | Manually test a specified example. |

#### Example Test on Dispatch

This workflow is triggered by manually dispatching the workflow. It has the following input parameters:
- `example_directory`: the example directory to test. Multiple directories are supported and must be separated by comma. For example, language/gpt, images/vit. Simply input language or simply gpt does not work.

### Compatibility Test

| Workflow Name | File name | Description |
| -------------------------------- | ------------------------------------ | -------------------------------------------------------------------------------------------------------------------- |
| `Compatibility Test on PR` | `compatibility_test_on_pr.yml` | Check Colossal-AI's compatibility when `version.txt` is changed in a PR. |
| `Compatibility Test on Schedule` | `compatibility_test_on_schedule.yml` | This workflow will check the compatibility of Colossal-AI against PyTorch specified in `.compatibility` every Sunday. |
| `Compatibility Test on Dispatch` | `compatibility_test_on_dispatch.yml` | Test PyTorch Compatibility manually. |


#### Compatibility Test on Dispatch
This workflow is triggered by manually dispatching the workflow. It has the following input parameters:
- `torch version`:torch version to test against, multiple versions are supported but must be separated by comma. The default is value is all, which will test all available torch versions listed in this [repository](https://github.com/hpcaitech/public_assets/tree/main/colossalai/torch_build/torch_wheels).
- `cuda version`: cuda versions to test against, multiple versions are supported but must be separated by comma. The CUDA versions must be present in our [DockerHub repository](https://hub.docker.com/r/hpcaitech/cuda-conda).

> It only test the compatibility of the main branch


### Release

| Workflow Name | File name | Description |
| ----------------------------------------------- | ------------------------------------------- | ------------------------------------------------------------------------------------------------------------- |
| `Draft GitHub Release Post` | `draft_github_release_post_after_merge.yml` | Compose a GitHub release post draft based on the commit history when a release PR is merged. |
| `Publish to PyPI` | `release_pypi_after_merge.yml` | Build and release the wheel to PyPI when a release PR is merged. The result is sent to Lark. |
| `Publish Nightly Version to PyPI` | `release_nightly_on_schedule.yml` | Build and release the nightly wheel to PyPI as `colossalai-nightly` every Sunday. The result is sent to Lark. |
| `Publish Docker Image to DockerHub after Merge` | `release_docker_after_merge.yml` | Build and release the Docker image to DockerHub when a release PR is merged. The result is sent to Lark. |
| `Check CUDA Extension Build Before Merge` | `cuda_ext_check_before_merge.yml` | Build CUDA extensions with different CUDA versions when a release PR is created. |
| `Publish to Test-PyPI Before Merge` | `release_test_pypi_before_merge.yml` | Release to test-pypi to simulate user installation when a release PR is created. |


### User Friendliness

| Workflow Name | File name | Description |
| ----------------------- | ----------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
| `issue-translate` | `translate_comment.yml` | This workflow is triggered when a new issue comment is created. The comment will be translated into English if not written in English. |
| `Synchronize submodule` | `submodule.yml` | This workflow will check if any git submodule is updated. If so, it will create a PR to update the submodule pointers. |
| `Close inactive issues` | `close_inactive.yml` | This workflow will close issues which are stale for 14 days. |

### Community

| Workflow Name | File name | Description |
| -------------------------------------------- | -------------------------------- | -------------------------------------------------------------------------------- |
| `Generate Community Report and Send to Lark` | `report_leaderboard_to_lark.yml` | Collect contribution and user engagement stats and share with Lark every Friday. |

## Configuration

This section lists the files used to configure the workflow.

1. `.compatibility`

This `.compatibility` file is to tell GitHub Actions which PyTorch and CUDA versions to test against. Each line in the file is in the format `${torch-version}-${cuda-version}`, which is a tag for Docker image. Thus, this tag must be present in the [docker registry](https://hub.docker.com/r/pytorch/conda-cuda) so as to perform the test.

2. `.cuda_ext.json`

This file controls which CUDA versions will be checked against CUDA extension built. You can add a new entry according to the json schema below to check the AOT build of PyTorch extensions before release.

```json
{
"build": [
{
"torch_command": "",
"cuda_image": ""
},
]
}
```

## Progress Log

- [x] Code style check
- [x] post-commit check
- [x] unit testing
- [x] test on PR
- [x] report test coverage
- [x] regular test
- [x] release
- [x] pypi release
- [x] test-pypi simulation
- [x] nightly build
- [x] docker build
- [x] draft release post
- [x] example check
- [x] check on PR
- [x] regular check
- [x] manual dispatch
- [x] compatibility check
- [x] check on PR
- [x] manual dispatch
- [x] auto test when release
- [x] community
- [x] contribution report
- [x] user engagement report
- [x] helpers
- [x] comment translation
- [x] submodule update
- [x] close inactive issue
18 changes: 0 additions & 18 deletions .github/workflows/assign_reviewer.yml

This file was deleted.

Loading