Skip to content

feat(chunked-prefill): add chunked-prefill for multi gpu and mac#429

Open
wasamtc wants to merge 60 commits intoGradientHQ:mainfrom
wasamtc:tangcong/mac_chunk_v3
Open

feat(chunked-prefill): add chunked-prefill for multi gpu and mac#429
wasamtc wants to merge 60 commits intoGradientHQ:mainfrom
wasamtc:tangcong/mac_chunk_v3

Conversation

@wasamtc
Copy link
Collaborator

@wasamtc wasamtc commented Feb 4, 2026

📋 PR Title Format

The PR title should follow the format:

type(scope): concise message (max 50 chars)

Where:

  • type is one of: feat, fix, docs, refactor, perf, test, chore.
  • scope is optional and describes the part of the codebase affected (e.g., auth, ui, api).
  • concise message is a short description of the change (max 50 chars).

📝 Change Type

Please select the type of change this PR introduces (choose one or more):

  • feat: New feature.
  • fix: Bug fix.
  • docs: Documentation only changes.
  • refactor: A code change that neither fixes a bug nor adds a feature.
  • perf: Performance improvement.
  • test: Adding missing tests or correcting existing tests.
  • chore: Maintenance tasks (e.g., updating dependencies).

💡 Description

Briefly describe the change, its purpose, and the problem it solves.

Key Changes

🔗 Related Issues

List any issues this PR closes or relates to:

✅ Checklist

Please ensure the following points are addressed before merging:

  • I have performed a self-review of my own code.
  • I have added/updated tests that prove my fix or feature works (if applicable).
  • I have updated the documentation (if necessary).
  • My code follows the project's style guidelines.

wasamtc and others added 30 commits January 14, 2026 09:24
The workflow has been disabled, preventing any actions from being performed.
…se send data use chunked_reqs+forward_reqs
wasamtc and others added 26 commits February 1, 2026 02:29
Re-enable scheduled builds with a cron job.
@wasamtc wasamtc requested a review from a team February 4, 2026 00:49
@wasamtc
Copy link
Collaborator Author

wasamtc commented Feb 4, 2026

you can use chunked-prefill with command

--enable-prefix-cache --chunked-prefill-size=xxx

@wasamtc
Copy link
Collaborator Author

wasamtc commented Feb 7, 2026

I tested the Time-To-First-Token (TTFT) latency corresponding to different chunked-prefill sizes, using a Mac Mini (64GB) and an RTX 4090 (24GB) as the test nodes, with an input prompt length of 2,000 tokens and the GPT-OSS-20B model. The results are as follows:

chunked-prefill-size TTFT(ms)
none 726.93
128 908.35
256 752.23
512 689.30
1024 739.05

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Support chunked prefill

1 participant