A tiny LLM runtime microcore for KV cache, token budget, and batch scheduling.
LLMSched is not a full inference engine. It is a scheduler simulator that captures the core resource control plane of LLM serving: who gets KV cache, how many tokens to allocate, when to batch, and when to reject.
Inference resource scheduling authority — whoever controls token allocation, KV cache leases, and batch merging holds the key to inference performance.
No real model inference. Pure simulator:
requests.jsonl
↓
token budget
↓
KV cache allocation
↓
batch scheduling
↓
mock decode step
↓
trace.jsonl + scheduling report
cargo run -- run --requests examples/requests.jsonl- Token Budget — global and per-request token allocation with overflow protection
- KV Cache Lease — allocate/free/evict semantics for GPU memory pages
- Batch Scheduler — priority queue + deadline-aware continuous batching
- Telemetry — structured trace output (JSONL) for downstream training
| Input | Output |
|---|---|
plan.json (from Apeinx-IR) |
trace.jsonl (to ApexTrain-Core) |
requests.jsonl |
scheduling report (Markdown) |
| Layer | Choice |
|---|---|
| Control logic | Rust |
| CLI | clap |
| Config | TOML |
| Trace format | JSONL |
| Underlying kernels | C ABI → KernelLab |
| Logging | tracing |
llmsched-core/
├── crates/llmsched-core/src/ # Core: queue, budget, kv, scheduler, telemetry
├── crates/llmsched-cli/src/ # CLI entry point
├── ffi/ # KernelLab C ABI bindings
├── examples/requests.jsonl # Sample workload
└── reports/ # Output trace + report
KernelLab → (C ABI) → LLMSched → (trace.jsonl) → ApexTrain-Core
↑
Apeinx-IR → (plan.json)
TBD