Skip to content

Add SGLang scheduler to SimAI Vidur: chunked prefill + RadixAttention prefix caching#223

Draft
Copilot wants to merge 2 commits intomasterfrom
copilot/merge-sglang-for-simulation
Draft

Add SGLang scheduler to SimAI Vidur: chunked prefill + RadixAttention prefix caching#223
Copilot wants to merge 2 commits intomasterfrom
copilot/merge-sglang-for-simulation

Conversation

Copy link

Copilot AI commented Feb 28, 2026

SimAI had no way to simulate SGLang's runtime scheduling behavior. This adds a first-class sglang replica scheduler to the Vidur inference simulator that models SGLang's two core performance features.

Changes

New: SglangReplicaScheduler

  • Chunked prefill – identical semantics to Sarathi-Serve; breaks long prompts into chunk_size-token chunks interleaved with decode iterations
  • RadixAttention prefix caching – parameterized via prefix_cache_hit_rate (0.0–1.0):
    • Reduces KV-block allocation: only ceil((1 − r) × prefill_tokens / block_size) fresh blocks are allocated per request
    • Fast-forwards num_processed_tokens past the cached portion in the first iteration, reducing the number of prefill chunks proportionally
    • Decode-phase block tracking accounts for the "virtual" cached token capacity

Config: SglangSchedulerConfig

New dataclass registered under ReplicaSchedulerType.SGLANG = 7:

Field Default Purpose
chunk_size 512 Prefill chunk size (tokens)
enable_prefix_caching True RadixAttention toggle
prefix_cache_hit_rate 0.0 Fraction of prefill tokens served from cache
max_tokens_in_batch 4096 Batch token budget

Example

python -m vidur.main \
  --replica_config_model_name meta-llama/Meta-Llama-3-8B \
  --replica_scheduler_config_type sglang \
  --sglang_scheduler_config_chunk_size 512 \
  --sglang_scheduler_config_enable_prefix_caching \
  --sglang_scheduler_config_prefix_cache_hit_rate 0.7 \
  --sglang_scheduler_config_max_tokens_in_batch 4096 \
  ...

Choosing prefix_cache_hit_rate: 0.0 for random prompts, 0.3–0.5 for few-shot workloads, 0.7–0.95 for workloads with long shared system prompts. See README-vidur.md for the full guidance table.

Notes

  • Since the simulator operates at token-count granularity (not actual token values), exact radix-tree prefix matching is approximated by the hit-rate parameter. The memory savings are modeled correctly; execution time for the first prefill chunk is conservatively overestimated by the cached-token portion.
  • README.md and README-vidur.md updated with CLI reference and usage guidance.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Co-authored-by: tianhao909 <48342395+tianhao909@users.noreply.github.com>
Copilot AI changed the title [WIP] Integrate sglang for seamless simulation Add SGLang scheduler to SimAI Vidur: chunked prefill + RadixAttention prefix caching Feb 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants