Skip to content

Comments

feat(pymllm): VocabParallelEmbedding & pymllm's cuda infra init#640

Open
chenghuaWang wants to merge 10 commits intoUbiquitousLearning:mainfrom
chenghuaWang:wch-main
Open

feat(pymllm): VocabParallelEmbedding & pymllm's cuda infra init#640
chenghuaWang wants to merge 10 commits intoUbiquitousLearning:mainfrom
chenghuaWang:wch-main

Conversation

@chenghuaWang
Copy link
Collaborator

@chenghuaWang chenghuaWang commented Feb 18, 2026

Summary by CodeRabbit

  • New Features

    • Centralized global config with CLI overrides, multi-process orchestrator and Engine (sync/async generation), memory-backed KV cache and allocation, tensor-parallel-aware layers/embeddings, mobile detection and consolidated mobile public API, and new CLI entry points (pymllm, pymllm-server, show-config).
  • Dependencies

    • apache-tvm-ffi pinned and CUDA optional extras expanded.
  • Documentation

    • README and kernel examples updated to a unified decorator-based pattern; mobile docs refreshed.
  • Tests

    • Added tensor-parallel embedding and store_cache tests; mobile-targeted tests updated.

…mple

- Replaced the previous JIT utility functions with a streamlined `jit` decorator for kernel registration.
- Updated the README.md to reflect the new recommended pattern for CPU kernel implementation.
- Simplified the example for using the JIT compilation with a focus on clarity and ease of use.
- Updated `apache-tvm-ffi` version to `0.1.8` in `pyproject.toml` and `mllm-kernel/pyproject.toml`.
- Refactored mobile module imports and structure, moving scripts to `pymllm.mobile` and removing unused backends.
- Introduced new classes and methods for quantization and model deployment in the Qualcomm backend.
- Added new README files for mobile and Qualcomm transformer components.
- Added `flashinfer-python` to the optional `cuda` dependencies in `pyproject.toml`.
- Introduced new configuration files for server, model, and layers to centralize runtime settings.
- Created initial structure for various layers and components to support future development.
- Added main entry points for `pymllm` and `mllm-kernel` in their respective `pyproject.toml` files.
- Implemented a configuration module for `pymllm` to manage global settings, including server, model, runtime, and cache configurations.
- Introduced the `VocabParallelEmbedding` layer and utility functions for weight management in the layers module.
- Created initial tests for the `VocabParallelEmbedding` layer to validate functionality with tensor parallelism.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 18, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Reworks packaging and CLI, adds a mobile subpackage, expands layer primitives (TP-aware layers, RoPE, norms, MLP), introduces typed GlobalConfig/ServerConfig with CLI overrides, implements an orchestrator with multi-process components and parallel state, adds KV/mem-cache and store_cache CUDA kernel, and updates tests and FFI bindings.

Changes

Cohort / File(s) Summary
Configuration
pymllm/configs/__init__.py, pymllm/configs/global_config.py, pymllm/configs/server_config.py, pymllm/configs/model_config.py, pymllm/configs/quantization_config.py
Add GlobalConfig singleton with CLI helpers, new ServerConfig validation, ModelConfig proxy, and QuantizationConfig dataclass; export via package init.
Orchestrator / IPC / Processes
pymllm/orchestrator/__init__.py, pymllm/orchestrator/group_coordinator.py, pymllm/orchestrator/parallel_state.py, pymllm/orchestrator/ipc_utils.py, pymllm/orchestrator/*_process.py
New GroupCoordinator, parallel_state initialization/accessors (TP/DP/PP), IPC utilities, and multiple subprocess-process classes (tokenizer, scheduler, model_runner, detokenizer, async_disk_io, request_response) with lifecycle and ZMQ wiring.
Layer primitives & utils
pymllm/layers/__init__.py, pymllm/layers/base.py, pymllm/layers/utils.py, pymllm/layers/embedding.py, pymllm/layers/linear.py, pymllm/layers/mlp.py, pymllm/layers/layer_norm.py, pymllm/layers/rms_norm.py, pymllm/layers/rope.py
Introduce MllmBaseLayer, set_weight_attrs, VocabParallelEmbedding, Column/Row/Linear variants, MLP/ParallelMLP, LayerNorm/RMS/GemmaRMSNorm, and RoPE utilities with TP-aware partitioning, weight loaders, and communication hooks.
KV / Mem-cache subsystem
pymllm/mem_cache/__init__.py, pymllm/mem_cache/memory_pool.py, pymllm/mem_cache/radix_cache.py
Add KVPool, TokenToKVPoolAllocator, ReqToTokenPool, factories, and a radix-tree radix_cache with SWA/tombstone eviction and hash utilities.
mllm-kernel CUDA & JIT
mllm-kernel/mllm_kernel/cuda/csrc/store_cache.cuh, mllm-kernel/mllm_kernel/cuda/jit/store_cache.py, mllm-kernel/mllm_kernel/cuda/jit/__init__.py, mllm-kernel/README.md
New warp-parallel store_cache CUDA kernel and Python JIT interface, arch/PDL checks, can_use_store_cache API, and exported symbols; README switched to @mllm_kernel.jit usage.
FFI & minor C++ changes
mllm/ffi/Extension.cc, mllm/ffi/vendors/tvm-ffi
Direct ToDLPack use and add ParameterFileObj FFI registration; update tvm-ffi submodule reference.
Mobile package & import refactors
pymllm/mobile/__init__.py, pymllm/mobile/backends/..., pymllm/mobile/ffi/base.py, pymllm/mobile/tests/*
Add mobile package initializer re-exporting FFI/nn primitives, move many qualcomm imports to pymllm.mobile.*, adjust FFI lib path, and update tests to use mobile namespace.
Engine / Server & IO structs
pymllm/engine/launch.py, pymllm/engine/io_struct.py, pymllm/server/launch.py
New Engine class with multi-process launch/orchestration, IO dataclasses for requests/batches, and server launch entrypoint.
Packaging & CLI entrypoints
pyproject.toml, mllm-kernel/pyproject.toml, pymllm/__init__.py, pymllm/__main__.py, mllm-kernel/mllm_kernel/__main__.py
Pin/bump apache-tvm-ffi, update optional CUDA extras, add/modify project scripts (pymllm, mllm-convertor, mllm-service, pymllm-server), and add mobile-availability detection and CLI show-config.
Tests & removals
pymllm/tests/test_vocab_parallel_embedding.py, pymllm/mobile/tests/*, pymllm/backends/cuda/tilelang_compile_test.py (deleted), mllm-kernel/tests/*
Add multi-process TP-8 tests for VocabParallelEmbedding and CUDA store_cache tests; update mobile tests; remove legacy tilelang compile test.
Misc & housekeeping
.codespellrc, .gitignore, .claude/..., pymllm/utils/mllm_convertor_server/service.py
Add "flashinfer" ignore, adjust .gitignore, add CODEOWNERS skill doc, and remove a license header in one file.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant RR as RequestResponseProcess
    participant Tokenizer
    participant Scheduler
    participant ModelRunner
    participant Detokenizer
    participant AsyncIO as AsyncDiskIoProcess

    Client->>RR: submit GenerateReqInput
    RR->>Tokenizer: forward tokenization request (ZMQ)
    Tokenizer->>Scheduler: send tokenized batch
    Scheduler->>ModelRunner: dispatch batch for forward
    ModelRunner->>AsyncIO: (optional) async weight IO / checkpoints
    ModelRunner-->>Scheduler: return token ids / partial results
    Scheduler->>Detokenizer: send finished tokens
    Detokenizer-->>RR: send decoded text chunks
    RR-->>Client: stream final/partial outputs
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested reviewers

  • liang1232018
  • oreomaker
  • yirongjie

Poem

🐰 I hopped through configs, threads, and tape,

I stitched mobile paths and tuned each shape.
Processes lined up, ZMQ buzzing bright,
Kernels and caches hummed into the night.
Hooray — little rabbit celebrates the flight! 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning No pull request description was provided by the author, making it impossible to verify completeness against the repository template. Add a comprehensive pull request description following the provided template, including context, changes, and testing information.
Docstring Coverage ⚠️ Warning Docstring coverage is 43.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the primary changes: introducing VocabParallelEmbedding and initializing pymllm's CUDA infrastructure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (9)
pymllm/mobile/backends/qualcomm/transformers/qwen3/train.py (1)

31-35: ⚠️ Potential issue | 🟡 Minor

--output_dir is not marked required=True — crashes with an opaque TypeError when omitted.

args.output_dir defaults to None. Line 50 then calls os.makedirs(None, exist_ok=True), raising TypeError: expected str, bytes or os.PathLike object, not NoneType instead of a clear argparse error.

🐛 Proposed fix
     parser.add_argument(
         "--output_dir",
         type=str,
+        required=True,
         help="Directory to save the quantized model",
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/train.py` around lines 31
- 35, The --output_dir argument is optional but later used in os.makedirs
causing a TypeError when omitted; update the parser.add_argument for
"--output_dir" (the one in train.py) to include required=True (or alternatively
provide a sensible default) so args.output_dir is never None, and ensure any
subsequent call (os.makedirs(args.output_dir, exist_ok=True) in this file)
receives a valid path; keep the change localized to the parser.add_argument for
"--output_dir" and/or add an explicit argparse error if args.output_dir is falsy
before calling os.makedirs.
pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py (1)

136-160: ⚠️ Potential issue | 🟠 Major

Fix apply_rotary_pos_emb function calls to rotate_half.

rotate_half() requires 4 positional arguments (x, x_observer, x2_neg_fake_quant, concat_observer), but apply_rotary_pos_emb() calls it with only 1 argument at lines 158–159:

q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)

This causes TypeError: rotate_half() missing 3 required positional arguments for any caller. LlamaAttention.forward (lines 397–420) bypasses this by calling rotate_half() directly with all required arguments, but the public function remains broken and unusable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py` around
lines 136 - 160, apply_rotary_pos_emb currently calls rotate_half with a single
argument but rotate_half requires four positional args (x, x_observer,
x2_neg_fake_quant, concat_observer); update apply_rotary_pos_emb to call
rotate_half with the same four arguments used in LlamaAttention.forward (i.e.,
pass through the observer/quant args or supply None defaults) so the calls
become rotate_half(q, x_observer, x2_neg_fake_quant, concat_observer) and
rotate_half(k, x_observer, x2_neg_fake_quant, concat_observer) (or explicitly
pass None for those three if observers are not available), ensuring the function
signature and callers are updated accordingly.
pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py (1)

114-138: ⚠️ Potential issue | 🟠 Major

Remove apply_rotary_pos_emb function — it is dead code with signature mismatch.

apply_rotary_pos_emb calls rotate_half(q) and rotate_half(k) with 1 argument each (lines 136–137), but rotate_half requires 4 arguments (x, x_observer, x2_neg_fake_quant, concat_observer). This would raise TypeError if called. The function is not invoked anywhere in the codebase and is absent from __all__ exports. Qwen3Attention.forward applies RoPE manually with correct argument passing. Remove this unused function.

Note: The same issue exists identically in pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py and pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py` around
lines 114 - 138, The function apply_rotary_pos_emb is dead code and wrong: it
calls rotate_half(q) and rotate_half(k) with only one argument while rotate_half
requires four, and Qwen3Attention.forward applies RoPE correctly; remove the
entire apply_rotary_pos_emb function definition from modeling_qwen3.py to avoid
the incorrect call and dead export, and likewise remove the identical
apply_rotary_pos_emb definitions from modeling_qwen2.py and modeling_llama.py so
no mismatched helper remains in the qualcomm backend.
pymllm/mobile/backends/qualcomm/transformers/llama/train.py (1)

32-35: ⚠️ Potential issue | 🟠 Major

--output_dir will crash with TypeError if omitted.

--output_dir has no default and no required=True. When args.output_dir is None, line 50 (os.makedirs(args.output_dir, exist_ok=True)) raises TypeError: expected str, bytes or os.PathLike object, not NoneType. The same issue exists in qwen2/train.py.

🐛 Proposed fix
 parser.add_argument(
     "--output_dir",
     type=str,
+    required=True,
     help="Directory to save the quantized model",
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/llama/train.py` around lines 32
- 35, The CLI argument "--output_dir" is missing a default or required=True so
args.output_dir can be None and later break on os.makedirs(args.output_dir,
exist_ok=True); update the argument definition in train.py (and mirror the same
change in qwen2/train.py) to either set a sensible default (e.g., "./output") or
mark it required=True and then add a defensive check before calling os.makedirs
(e.g., ensure args.output_dir is not None and raise a clear error or use the
default), referencing the argument parsing block that defines "--output_dir" and
the code that calls os.makedirs(args.output_dir, exist_ok=True).
pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py (3)

291-298: ⚠️ Potential issue | 🟡 Minor

Misleading streaming comment and trust_remote_code=True security note.

  • Line 292 comment states streaming=True is used, but the actual MsDataset.load() call has no such argument. This will cause the full dataset to be downloaded locally — the opposite of what the comment describes.
  • trust_remote_code=True on line 297 permits execution of arbitrary Python code bundled with the dataset. While common in research settings, it should be an explicit opt-in with a clear docstring warning rather than a silent default in a library.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines
291 - 298, The comment above the MsDataset.load call is misleading and the use
of trust_remote_code=True is unsafe; update the MsDataset.load invocation (the
call to MsDataset.load in runner.py) to pass streaming=True to match the comment
and avoid downloading the entire dataset, and change trust_remote_code=True to a
safe default (False) or wire it to an explicit opt-in flag (e.g.,
allow_trust_remote_code) exposed in the function/class API with a clear
docstring warning about executing remote code so callers must consciously opt
in; also update the inline comment to accurately reflect the behavior.

166-185: ⚠️ Potential issue | 🔴 Critical

Add QLinearW8A16_PerChannelSym handling to enable_fake_quant and disable_fake_quant functions.

QLinearW8A16_PerChannelSym is included in freeze_qwen3_linear_weight (line 147) and convert_weight (line 189), but is missing from enable_fake_quant and disable_fake_quant (lines 166–185). Unlike QLinearLPBQ, which defines enable_fakequant() and disable_fakequant() methods, QLinearW8A16_PerChannelSym currently lacks these methods. Both classes should provide consistent interfaces for managing fake-quantization state during calibration. Add QLinearW8A16_PerChannelSym to the type checks in both functions, and implement the corresponding methods in the class if they do not exist.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines
166 - 185, Add QLinearW8A16_PerChannelSym to the fake-quant toggles and
implement matching methods on the class: update enable_fake_quant and
disable_fake_quant to check isinstance(..., QLinearW8A16_PerChannelSym) and call
the new methods; then add enable_fakequant(self) and disable_fakequant(self) to
the QLinearW8A16_PerChannelSym class to mirror QLinearLPBQ’s behavior (toggle
whatever internal flag or buffer used for fake-quant state). Ensure the
implementation is consistent with how QLinearLPBQ, ActivationQDQ, QRMSNorm, and
QEmbedding manage fake-quant state so freeze_qwen3_linear_weight and
convert_weight (which already reference QLinearW8A16_PerChannelSym) behave
correctly during calibration.

69-73: ⚠️ Potential issue | 🟡 Minor

The diagnostic logging loop is dead code and contains an AttributeError.

FakeQuantize registers scale as a buffer (via register_buffer), not as a parameter. Therefore, module.fake_quant.named_parameters() will never yield scale, making the loop body at lines 70–73 unreachable dead code.

Additionally, line 72 references module.scale, but ActivationQDQ does not expose a .scale property — it should be module.fake_quant.scale.

Fix both issues:

🐛 Proposed fix
-            for key, value in module.fake_quant.named_parameters():
+            for key, value in module.fake_quant.named_buffers():
                 if value is module.fake_quant.scale:
-                    print(f"{module._get_name()}.{key}: {module.scale}")
+                    print(f"{module._get_name()}.{key}: {module.fake_quant.scale}")
                     break
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines 69
- 73, The loop using module.fake_quant.named_parameters() is dead because
FakeQuantize.register_buffer registers scale as a buffer, and it also wrongly
references module.scale; update the diagnostic to iterate
module.fake_quant.named_buffers() (or check both named_parameters() and
named_buffers() if desired) and reference module.fake_quant.scale when printing;
specifically modify the block around the loop that uses
module.fake_quant.named_parameters() and the print that uses module.scale so it
inspects named_buffers() and prints module.fake_quant.scale (or the buffer name)
instead.
pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py (2)

366-369: ⚠️ Potential issue | 🟡 Minor

Pre-existing typo: layer_dix should be layer_idx.

Line 369 assigns self.layer_dix = layer_idx (note "dix" vs "idx"). This typo propagates to Lines 384 and 404 where self.layer_dix is read. While internally consistent, it's misleading and will cause confusion for anyone extending Qwen2DecoderLayer.

Not introduced by this PR, but flagging for awareness.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py` around
lines 366 - 369, The Qwen2DecoderLayer constructor uses a typo'd attribute name
self.layer_dix instead of self.layer_idx; update the constructor in class
Qwen2DecoderLayer to assign self.layer_idx = layer_idx and then change any
reads/writes of self.layer_dix (e.g., at the sites currently reading it on lines
where it’s accessed) to self.layer_idx so the attribute name is consistent and
clear across the class (keep the class name Qwen2DecoderLayer and parameter
layer_idx the same).

90-123: ⚠️ Potential issue | 🟡 Minor

Fix the arity mismatch in apply_rotary_pos_emb: it calls rotate_half with 1 argument but the function requires 4.

rotate_half on line 90 requires 4 parameters (x, x_observer, x2_neg_fake_quant, concat_observer), but apply_rotary_pos_emb calls it on lines 121–122 with only 1 argument. This will raise TypeError at runtime if apply_rotary_pos_emb is ever invoked.

The actual RoPE implementation in the attention forward method (lines 281–299) calls rotate_half with all 4 arguments correctly, so the bug only affects the unused apply_rotary_pos_emb function. Suggest either updating apply_rotary_pos_emb to pass all required arguments or removing it if it's not needed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py` around
lines 90 - 123, The apply_rotary_pos_emb function calls rotate_half(q) and
rotate_half(k) but rotate_half requires four parameters (x, x_observer,
x2_neg_fake_quant, concat_observer); update apply_rotary_pos_emb to accept and
forward those observers/quantizers (e.g., add parameters x_observer,
x2_neg_fake_quant: ActivationQDQ, concat_observer: ConcatObserver to
apply_rotary_pos_emb signature and call rotate_half(q, x_observer,
x2_neg_fake_quant, concat_observer) and rotate_half(k, x_observer,
x2_neg_fake_quant, concat_observer)), or if apply_rotary_pos_emb is unused
remove it—ensure the fix mirrors how rotate_half is invoked in the attention
forward implementation.
🧹 Nitpick comments (32)
pymllm/orchestrator/parallel_state.py (3)

36-37: Silent early return when dist is not initialized may mask configuration errors.

If a caller invokes initialize_model_parallel(tp=4, dp=2, ...) without first calling dist.init_process_group, the function silently returns without setting any groups or config. Downstream code relying on get_tp_group() etc. will get None with no indication of why. Consider logging a warning so that misconfigured setups are easier to diagnose.

♻️ Proposed change
     if not dist.is_initialized():
+        logger.warning(
+            "torch.distributed is not initialized; "
+            "skipping model parallel group setup."
+        )
         return

As per coding guidelines, "Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/parallel_state.py` around lines 36 - 37, The early return
guarded by dist.is_initialized() in initialize_model_parallel(tp=..., dp=...)
should not be silent; update it to log a warning (using the module logger) that
process groups were not initialized and advise calling
torch.distributed.init_process_group before initialize_model_parallel, so
downstream calls like get_tp_group() don't fail silently; ensure the message
includes which function returned early (initialize_model_parallel) and the check
(dist.is_initialized()) to aid debugging.

20-25: No thread safety or re-initialization guard on the global state.

initialize_model_parallel mutates module-level globals without any guard against concurrent or repeated calls. A second call would leak the prior dist.new_group handles. Consider adding an early check via model_parallel_is_initialized() and raising if already initialized, or explicitly destroying old groups first.

♻️ Example guard
 def initialize_model_parallel(
     tensor_model_parallel_size: int = 1,
     data_parallel_size: int = 1,
     pipeline_model_parallel_size: int = 1,
     backend: str = "nccl",
 ) -> None:
+    if model_parallel_is_initialized():
+        raise RuntimeError(
+            "Model parallel groups are already initialized. "
+            "Call destroy_model_parallel() first."
+        )
+
     global _TP_GROUP, _DP_GROUP, _PP_GROUP

Also applies to: 177-179

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/parallel_state.py` around lines 20 - 25, The
initialize_model_parallel function mutates module-level globals and may leak
prior dist.new_group handles on concurrent or repeated calls; add a
re-initialization guard by checking model_parallel_is_initialized() at the top
of initialize_model_parallel and either raise a clear exception if already
initialized or call a cleanup routine (e.g., destroy_model_parallel or a new
teardown function) to explicitly destroy existing groups before creating new
ones; also protect the initialization logic with a simple threading.Lock to
ensure thread-safety when creating dist.new_group handles and apply the same
guard/cleanup pattern to the other initialization site(s) that mutate the same
globals.

66-72: Same assert-based validation concern as in group_coordinator.py.

This is the primary invariant for the entire parallelism setup. If the product doesn't match world_size, the system will silently compute incorrect groups under -O. Raise a ValueError instead.

♻️ Proposed fix
-    assert (
-        tensor_model_parallel_size * data_parallel_size * pipeline_model_parallel_size
-        == world_size
-    ), (
-        f"TP({tensor_model_parallel_size}) * DP({data_parallel_size}) * "
-        f"PP({pipeline_model_parallel_size}) != World({world_size})"
-    )
+    total = tensor_model_parallel_size * data_parallel_size * pipeline_model_parallel_size
+    if total != world_size:
+        raise ValueError(
+            f"TP({tensor_model_parallel_size}) * DP({data_parallel_size}) * "
+            f"PP({pipeline_model_parallel_size}) = {total} != World({world_size})"
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/parallel_state.py` around lines 66 - 72, Replace the
assert-based check that multiplies tensor_model_parallel_size,
data_parallel_size, and pipeline_model_parallel_size against world_size with an
explicit runtime validation that raises a ValueError when the product doesn't
equal world_size; locate the expression that currently uses assert (the block
referencing tensor_model_parallel_size * data_parallel_size *
pipeline_model_parallel_size == world_size and the f-string message) and change
it to compute the product, compare to world_size, and raise
ValueError(f"TP({tensor_model_parallel_size}) * DP({data_parallel_size}) *
PP({pipeline_model_parallel_size}) != World({world_size})") when they mismatch.
pymllm/orchestrator/group_coordinator.py (2)

72-77: assert for input validation is stripped under python -O; prefer raising exceptions in public API functions.

divide() and split_tensor_along_dim() are public API (exported via __init__.py). assert statements are removed when Python runs with the -O flag, silently disabling these guards and leading to incorrect silent behavior (e.g., truncated results from integer division). Use explicit ValueError raises for robustness.

♻️ Example for `divide`; apply the same pattern to `split_tensor_along_dim`
 def divide(numerator: int, denominator: int) -> int:
     """Divide and ensure divisibility."""
-    assert numerator % denominator == 0, (
-        f"{numerator} is not divisible by {denominator}"
-    )
+    if numerator % denominator != 0:
+        raise ValueError(f"{numerator} is not divisible by {denominator}")
     return numerator // denominator

As per coding guidelines, "Ensure functions that can fail return appropriate error codes or raise exceptions" and "Validate inputs for public APIs and critical internal functions."

Also applies to: 80-98

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/group_coordinator.py` around lines 72 - 77, The public
functions divide and split_tensor_along_dim currently use assert for input
validation which is removed under python -O; replace those asserts with explicit
exception raises (e.g., raise ValueError(...)) that include the same descriptive
message so invalid inputs (non-divisible numerator/denominator or bad split
parameters) raise a clear exception instead of failing silently; update the
divide function to raise ValueError when numerator % denominator != 0 and apply
the same pattern to split_tensor_along_dim to validate its inputs and raise
ValueError with contextual messages while preserving the existing return
behavior.

19-37: local_rank is stored but never used within the class.

self.local_rank is assigned but never referenced by any method in GroupCoordinator. If it's only needed by external callers (e.g., for device placement), that's fine, but consider documenting that intent. Otherwise, it's dead state that may mislead future maintainers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/group_coordinator.py` around lines 19 - 37, The
constructor currently assigns self.local_rank but nothing in GroupCoordinator
uses it; either remove the local_rank parameter and attribute from
GroupCoordinator.__init__ (and update callers) to avoid dead state, or
explicitly document its purpose for external use by adding a docstring/param
comment to the GroupCoordinator class and keeping self.local_rank; if you choose
removal, delete the local_rank argument and any assignment to self.local_rank
and update instantiations, and if you choose to keep it, add a clear docstring
on GroupCoordinator and/or expose it via a property (e.g., def local_rank(self):
return self._local_rank) so the intent is explicit (references:
GroupCoordinator.__init__, self.local_rank, rank_in_group, device_group).
pymllm/orchestrator/__init__.py (1)

25-48: Categorical grouping in __all__ is reasonable; ignore the sort lint.

The static analysis tool flags that __all__ is not alphabetically sorted (RUF022), but the current categorical grouping (GroupCoordinator, TP, DP, PP, State) is intentional and more readable for this API surface. Consider adding a # noqa: RUF022 comment if you want to suppress the lint.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/__init__.py` around lines 25 - 48, The linter flags the
__all__ list as unsorted (RUF022) even though the categorical grouping is
intentional; suppress the warning by adding a noqa comment for RUF022 to the
__all__ definition (e.g., append "# noqa: RUF022" to the __all__ assignment) so
the grouped exports (including symbols like "GroupCoordinator", "divide",
"get_tp_group", "data_parallel_all_reduce", "get_pp_group",
"initialize_model_parallel", etc.) keep their categorical order without lint
noise.
pymllm/mobile/backends/qualcomm/nn.py (1)

4-11: Add docstrings and document placeholder intent for QnnSoftmax and QnnRoPE.

Both classes are bare subclasses that add no behavior. Per coding guidelines, public classes must have docstrings explaining purpose. Without any comment it is also unclear whether these are intentional stubs awaiting Qualcomm-specific overrides or permanent thin wrappers.

♻️ Proposed fix: add minimal docstrings
 class QnnSoftmax(Softmax):
+    """Qualcomm-specific Softmax layer.
+
+    Thin wrapper around Softmax; reserved for Qualcomm QNN-specific
+    operator overrides.
+    """
     def __init__(self):
         super().__init__()


 class QnnRoPE(RoPE):
+    """Qualcomm-specific Rotary Position Embedding layer.
+
+    Thin wrapper around RoPE; reserved for Qualcomm QNN-specific
+    operator overrides.
+    """
     def __init__(self):
         super().__init__()

As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/nn.py` around lines 4 - 11, The public
classes QnnSoftmax and QnnRoPE are empty subclasses with no docstrings; add
concise class docstrings to each explaining their intent (e.g.,
"Qualcomm-specific subclass of Softmax/RoPE; currently a thin
wrapper/placeholder for future Qualcomm overrides") and note any expected
behavior or why they exist (compatibility shim, vendor-specific hooks), so
readers know they are intentional stubs and not omissions; ensure the docstrings
appear directly under the class definitions for QnnSoftmax and QnnRoPE.
pymllm/mobile/ffi/base.py (1)

9-25: Add a pre-flight existence check and a descriptive error in _load_lib().

tvm_ffi.load_module(lib_path) is called without checking whether lib_path exists. On any environment where the native library hasn't been built or installed, the module import fails with a raw OS/TVM error that gives no indication of which path was searched or what the user should do. This is worsened by the module-level _LIB = _load_lib() at line 25, which means the error fires on any import of this module.

♻️ Proposed fix: descriptive pre-flight check
     lib_path = os.path.join(parent_dir, "lib", lib_name)
+    if not os.path.exists(lib_path):
+        raise RuntimeError(
+            f"MllmFFIExtension library not found at '{lib_path}'. "
+            "Ensure the native library is built and installed correctly."
+        )
     return tvm_ffi.load_module(lib_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/ffi/base.py` around lines 9 - 25, The current _load_lib() calls
tvm_ffi.load_module(lib_path) without verifying the file exists, causing opaque
import-time errors; update _load_lib to check os.path.exists(lib_path) (or
os.path.isfile) before calling tvm_ffi.load_module and, if the file is missing,
raise a clear FileNotFoundError/RuntimeError that includes the full lib_path,
the platform-determined lib_name, and a short hint about building or installing
the native library; keep the _LIB = _load_lib() behavior but ensure the new
error message surfaces the searched path and recovery steps so users know what
to do.
pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py (2)

379-382: Typo: layer_dix should be layer_idx.

self.layer_dix = layer_idx (line 382) is a persistent typo. It is used consistently as layer_dix within the class (lines 394, 397, 417), so there is no immediate runtime error, but it clashes with the attribute name layer_idx used by the parent Qwen3Attention and the rest of the decoder stack.

🧹 Proposed rename (all occurrences in this class)
-        self.layer_dix = layer_idx
+        self.layer_idx = layer_idx

Then update lines 394, 397, 417 from self.layer_dixself.layer_idx.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py` around
lines 379 - 382, The class Qwen3DecoderLayer has a persistent typo: the
attribute was set as self.layer_dix instead of self.layer_idx; update the
attribute assignment in Qwen3DecoderLayer.__init__ to self.layer_idx = layer_idx
and rename all uses of self.layer_dix to self.layer_idx within this class
(including places that interact with Qwen3Attention) so the decoder layer
attribute matches the expected layer_idx name used elsewhere.

105-111: Unused x_observer parameter in rotate_half.

x_observer is accepted but never referenced in the function body. If it is no longer needed, remove it; if it was meant to be used (e.g., to trigger an observer forward pass on x), add the missing call.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py` around
lines 105 - 111, The parameter x_observer in function rotate_half is unused;
either remove it from rotate_half's signature and all callers if observers are
not needed, or invoke it to trigger observation (e.g., call x_observer(x) before
returning) so the input tensor is recorded; update the rotate_half definition
and ensure symbols x2_neg_fake_quant and concat_observer are still applied in
the existing order (use x_observer(x) then return
concat_observer(torch.cat((x2_neg_fake_quant(-x2), x1), dim=-1))) and adjust any
call sites accordingly.
pymllm/mobile/backends/qualcomm/transformers/core/qlinear.py (3)

219-234: Use the already-defined num_steps constant instead of repeating 2**bitwidth_of_scale.

num_steps is computed at line 221 but 2**bitwidth_of_scale is used again directly at line 231, breaking the named-constant rule.

🧹 Proposed fix
-            max=2**bitwidth_of_scale,
+            max=num_steps,

As per coding guidelines, "Use named constants instead of magic numbers."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/core/qlinear.py` around lines
219 - 234, Replace the duplicated magic expression 2**bitwidth_of_scale with the
already-defined constant num_steps in the quantization loop: inside the loop in
qlinear.py where q_scales is computed (the torch.clamp call that currently uses
max=2**bitwidth_of_scale), change that max to num_steps to use the named
constant and avoid repeating the exponentiation.

32-36: Redundant is not None guard inside the outer null-check.

The outer if self.weight_quant is not None: at line 32 already ensures the inner branch is never reached with None. The and self.weight_quant is not None on line 36 adds nothing.

🧹 Proposed cleanup
-    if (
-        isinstance(self.weight_quant, PerBlockParamFakeQuantize)
-        and self.weight_quant is not None
-    ):
+    if isinstance(self.weight_quant, PerBlockParamFakeQuantize):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/core/qlinear.py` around lines 32
- 36, Remove the redundant null-check: inside the outer guard "if
self.weight_quant is not None:" drop the duplicated "and self.weight_quant is
not None" from the inner condition that tests "isinstance(self.weight_quant,
PerBlockParamFakeQuantize)"; leave only the isinstance check so the inner branch
reads like a single isinstance(self.weight_quant, PerBlockParamFakeQuantize)
test referencing the existing self.weight_quant variable.

205-215: Replace assert with an explicit ValueError for runtime data validation.

assert statements are silently stripped when Python runs with -O; a data-validity check on user-provided tensors should survive in optimized mode.

🛡️ Proposed fix
-assert self.weight.shape[-1] % self.block_size[1] == 0
-assert linear_zero_point.sum() == 0
+if self.weight.shape[-1] % self.block_size[1] != 0:
+    raise ValueError(
+        f"in_features ({self.weight.shape[-1]}) must be divisible by "
+        f"block_size ({self.block_size[1]})"
+    )
+if linear_zero_point.sum() != 0:
+    raise ValueError("LPBQ deployment requires symmetric quantization (zero_point must be all zeros)")

As per coding guidelines, "Ensure functions that can fail return appropriate error codes or raise exceptions."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/core/qlinear.py` around lines
205 - 215, Replace the two runtime assertions in QLinear's quantization block
with explicit checks that raise ValueError: check that self.weight.shape[-1] %
self.block_size[1] == 0 and raise ValueError with a clear message if not, and
check that linear_zero_point.sum() == 0 and raise ValueError with a clear
message if not; keep the rest of the flow (calling _quantize_affine and
assigning weight_int4) unchanged so invalid tensor shapes or zero-point sums
still raise informative exceptions at runtime.
mllm-kernel/pyproject.toml (1)

3-3: apache-tvm-ffi is unpinned in [build-system].requires but pinned to 0.1.8 in [project].dependencies.

During the scikit-build-core build step a different (potentially incompatible) version of apache-tvm-ffi could be resolved, while the installed package enforces == 0.1.8 at runtime. If the build scripts rely on the same API surface, pin the build-system requirement to match.

🔧 Proposed fix
 requires = [
-  "scikit-build-core>=0.11.0", "apache-tvm-ffi", "torch-c-dlpack-ext"
+  "scikit-build-core>=0.11.0", "apache-tvm-ffi == 0.1.8", "torch-c-dlpack-ext"
 ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm-kernel/pyproject.toml` at line 3, The build-system requirement for
apache-tvm-ffi is unpinned while the runtime dependency in
[project].dependencies is pinned to 0.1.8; update the [build-system].requires
entry for "apache-tvm-ffi" in pyproject.toml to pin it to
"apache-tvm-ffi==0.1.8" so the build resolver uses the same version as the
runtime and avoids API/compatibility mismatches.
pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py (1)

127-133: x_observer parameter accepted but never used in rotate_half.

Same issue as the Qwen3 variant: the x_observer argument (line 128) is passed at every call-site but the function body ignores it entirely. Either remove it from the signature (and all call-sites) or add the intended forward/observation logic.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py` around
lines 127 - 133, The rotate_half function currently accepts x_observer but never
uses it; either remove x_observer from the signature and from all call-sites, or
apply the observer to the input (e.g., call x = x_observer(x) or
x_observer.observe(x) as appropriate) before splitting/rotating so the input is
recorded; update the function rotate_half and all callers to match the chosen
approach and keep the existing ActivationQDQ x2_neg_fake_quant and
ConcatObserver concat_observer usage intact.
pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py (2)

305-340: tqdm progress bar never closed.

pbar is created on line 305 but never explicitly closed. The final display update (e.g., elapsed time) will be missing, and if multiple calibration runs occur in the same process, stale bars may accumulate. Use a context manager:

♻️ Proposed fix
-        pbar = tqdm(total=num_samples, desc="Calibrating")
-        for entry in dataset:
-            ...
-            pbar.update(1)
+        with tqdm(total=num_samples, desc="Calibrating") as pbar:
+            for entry in dataset:
+                ...
+                pbar.update(1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines
305 - 340, The tqdm progress bar instantiated as pbar in the calibration loop is
never closed; wrap the progress bar in a context manager or explicitly close it
to avoid stale bars (e.g., use with tqdm(total=num_samples, desc="Calibrating")
as pbar: around the for entry in dataset loop or call pbar.close() after the
loop). Update the block where pbar is created and used (variable pbar, the for
entry in dataset loop, and samples_processed updates) so the progress bar is
safely closed regardless of early breaks or exceptions.

19-195: Significant code duplication across runner modules.

The utility functions recompute_scale_zp, validate_concat_observer_fn, disable_qdq_observer, enable_qdq_observer, enable_fake_quant, disable_fake_quant, and convert_weight (lines 19–195) are byte-for-byte identical in qwen3/runner.py, llama/runner.py, and likely qwen2/runner.py. This violates DRY and makes any future fix/enhancement require triple (or more) edits.

Consider extracting these into a shared pymllm/mobile/backends/qualcomm/transformers/core/runner_utils.py module and importing from there.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines 19
- 195, These functions (recompute_scale_zp, validate_concat_observer_fn,
disable_qdq_observer, enable_qdq_observer, enable_fake_quant,
disable_fake_quant, convert_weight) are duplicated across runner modules—extract
them into a shared module (e.g., create a new runner_utils.py under
transformers/core) and move the implementations there, then replace the in-file
definitions in qwen3/runner.py (and the identical ones in llama/runner.py and
qwen2/runner.py) with imports from that shared module; ensure the shared
functions keep the same signatures and continue to reference ActivationQDQ,
ConcatObserver, QLinearLPBQ, QLinearW8A16_PerChannelSym, QRMSNorm, QEmbedding,
etc., so callers (model.apply or named_modules loops) work unchanged.
pyproject.toml (2)

24-24: Exact-version pin apache-tvm-ffi == 0.1.8 prevents receiving security/bug-fix patches.

Strict == pinning in dependencies (not just build-system.requires) means any downstream environment that already has a newer minor release will fail to resolve. Consider a compatible-release specifier:

♻️ Proposed loosening
-  "apache-tvm-ffi == 0.1.8",
+  "apache-tvm-ffi >= 0.1.8, < 0.2.0",

If ABI breakage is the concern, at minimum document why strict pinning is required here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` at line 24, The dependency line pins apache-tvm-ffi exactly
to "apache-tvm-ffi == 0.1.8", which blocks patch/minor updates and can break
resolution in downstream environments; change this to a compatible-release
specifier such as "apache-tvm-ffi ~= 0.1.8" or "apache-tvm-ffi >=0.1.8,<0.2.0"
in pyproject.toml to allow receiving bug/security fixes while preventing ABI
bumps, and if you truly require an exact pin keep the existing strict pin but
add a short justification comment in the project docs explaining why
exact-version pinning of apache-tvm-ffi is necessary.

21-23: pytest and pytest-html should not be runtime dependencies.

Test runner packages installed into every end-user environment bloat the install footprint and create unnecessary version conflicts. Move them to an optional [dev] or [test] group:

♻️ Proposed fix
 dependencies=[
   "packaging",
-  "pytest",
-  "pytest-html",
   "apache-tvm-ffi == 0.1.8",
   ...
 ]

+[project.optional-dependencies]
+dev = ["pytest", "pytest-html"]
 cuda = ["tilelang", "flashinfer-python"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 21 - 23, The pyproject currently lists "pytest"
and "pytest-html" as regular runtime dependencies; remove those two entries from
the main dependencies list and add them to a development/test-only group instead
(for example add them under tool.poetry.dev-dependencies if using Poetry, or
under project.optional-dependencies.test if following PEP 621). Ensure you
delete the "pytest" and "pytest-html" lines from the dependencies block and add
equivalent entries in the chosen dev/test section so they are only installed for
development/testing.
pymllm/layers/utils.py (1)

3-10: Inconsistent type hint style: Dict import unused in favor of | union syntax.

Line 3 imports Dict from typing, but Line 10 uses Dict[str, Any] | None with the | union operator (Python 3.10+). Since you're already using 3.10+ syntax, consider using the built-in dict instead of the imported Dict for consistency:

✏️ Suggested change
-from typing import Any, Dict
+from typing import Any
 
 import torch
 
 
 def set_weight_attrs(
     weight: torch.Tensor,
-    weight_attrs: Dict[str, Any] | None,
+    weight_attrs: dict[str, Any] | None,
 ) -> None:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/layers/utils.py` around lines 3 - 10, The type hint style is
inconsistent: remove the unused Dict import and change the annotation on
set_weight_attrs (parameter weight_attrs) from Dict[str, Any] | None to the
modern built-in form dict[str, Any] | None to match Python 3.10+ union syntax;
keep the Any import and update the function signature accordingly (refer to the
function name set_weight_attrs and parameter weight_attrs).
pymllm/tests/test_vocab_parallel_embedding.py (2)

61-62: Hardcoded MASTER_PORT may cause conflicts in CI.

Port 29500 is a common default and may already be in use by other distributed tests or services running in parallel. Consider using a dynamic free port or at minimum reading from an environment variable with a fallback.

✏️ Suggested change
     os.environ["MASTER_ADDR"] = "localhost"
-    os.environ["MASTER_PORT"] = "29500"
+    os.environ["MASTER_PORT"] = os.environ.get("MASTER_PORT", "29500")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/tests/test_vocab_parallel_embedding.py` around lines 61 - 62, Replace
the hardcoded MASTER_PORT assignment in the test setup (currently setting
os.environ["MASTER_PORT"] = "29500") with logic that first reads an optional env
var and otherwise selects a free ephemeral port at runtime; e.g., use
os.environ.get("MASTER_PORT") as primary and fall back to a dynamically
discovered free port (via a helper like find_free_port() that binds a socket to
port 0 to obtain an available port) before assigning os.environ["MASTER_PORT"],
keeping os.environ["MASTER_ADDR"] assignment as-is.

22-23: Module-level logging.basicConfig(force=True) can interfere with other tests.

Using force=True overwrites the root logger configuration, which can affect logging behavior of other test modules when run in the same pytest session. Consider scoping this to a fixture or using a module-specific logger instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/tests/test_vocab_parallel_embedding.py` around lines 22 - 23, Remove
the module-level call that forcefully reconfigures the root logger; instead stop
using logging.basicConfig(..., force=True) and avoid touching the root logger
with logging.getLogger().setLevel at import time. Update this test module to
either configure a module-scoped logger (use logging.getLogger(__name__) and set
its level) or move root-level configuration into a pytest fixture (use caplog or
a test-scoped setup) so other tests aren’t affected; look for the existing calls
logging.basicConfig and logging.getLogger().setLevel in this file and replace
them with a module logger or a fixture-based configuration.
pymllm/layers/base.py (1)

8-27: Consider adding a class-level docstring for MllmBaseLayer.

This is the base class for all layers in the pymllm.layers package and is part of the public API (re-exported from pymllm.layers.__init__). A brief docstring explaining its role, how subclasses should override weight_loader and forward, and the purpose of quant_recipe would help consumers.

As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/layers/base.py` around lines 8 - 27, Add a concise class-level
docstring to MllmBaseLayer describing its role as the base class for layers in
pymllm.layers, documenting the quant_recipe attribute's purpose (optional
QuantRecipe used for quantization settings), describing that subclasses should
override weight_loader to implement custom weight-loading/sharding behavior and
forward to implement the layer's forward pass, and briefly list expected
parameters/returns/behaviour for weight_loader (param: Parameter, loaded_weight:
torch.Tensor) and forward (raise NotImplementedError by default). Reference
MllmBaseLayer, weight_loader, forward, and the quant_recipe attribute in the
docstring so consumers of the public API understand usage and extension points.
pymllm/layers/__init__.py (1)

7-11: __all__ is not sorted (Ruff RUF022).

✏️ Suggested fix
 __all__ = [
     "MllmBaseLayer",
-    "set_weight_attrs",
     "VocabParallelEmbedding",
+    "set_weight_attrs",
 ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/layers/__init__.py` around lines 7 - 11, The __all__ list is not
alphabetically sorted (Ruff RUF022); reorder the exported symbols in the __all__
list so they are lexicographically sorted — e.g., place "MllmBaseLayer",
"VocabParallelEmbedding", and "set_weight_attrs" in alphabetical order (adjust
the list containing MllmBaseLayer, set_weight_attrs, VocabParallelEmbedding
accordingly) to satisfy the linter.
pymllm/quantization/quant_recipe.py (1)

1-3: Add a docstring to the public QuantRecipe class.

This class is referenced as a type annotation in MllmBaseLayer (in pymllm/layers/base.py). A brief docstring explaining its intended purpose would help consumers understand it's a placeholder for quantization configuration.

✏️ Suggested change
 class QuantRecipe:
+    """Configuration recipe for quantization.
+
+    This is a placeholder class that will be extended with quantization
+    parameters and methods as quantization support matures.
+    """
+
     def __init__(self):
         pass

As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/quantization/quant_recipe.py` around lines 1 - 3, Add a brief
docstring to the public QuantRecipe class describing its purpose as a
placeholder/structured container for quantization configuration and how it is
intended to be used (e.g., fields or subclasses should represent quantization
settings) so consumers and type annotations (such as MllmBaseLayer in
pymllm/layers/base.py) understand its role; place the docstring immediately
under the class declaration for QuantRecipe and keep it concise, describing
expected responsibilities, typical usage, and any important invariants or
extension points.
pymllm/layers/embedding.py (3)

73-112: Prefer raising exceptions over assert for runtime validation in public API methods.

assert statements are stripped when Python runs with -O (optimized mode), silently disabling these shape checks. Since weight_loader is part of the public API and loads external data, these checks should use explicit exceptions (e.g., ValueError / RuntimeError) to guarantee validation at all times.

Example for one assertion (apply same pattern to others)
-        assert param.data.shape == loaded_weight.shape, (
-            f"Shape mismatch: param {param.data.shape} vs "
-            f"loaded {loaded_weight.shape}"
-        )
+        if param.data.shape != loaded_weight.shape:
+            raise ValueError(
+                f"Shape mismatch: param {param.data.shape} vs "
+                f"loaded {loaded_weight.shape}"
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/layers/embedding.py` around lines 73 - 112, Replace all runtime
validation assert statements in the weight_loader method with explicit
exceptions so checks are always enforced; for each assert in weight_loader (the
shape equality check when no sharding, the loaded_weight vocab-size check using
output_dim vs self.num_embeddings, and the shard shape equality check), raise an
appropriate exception (ValueError or RuntimeError) with the same descriptive
message instead of using assert; ensure both branches (output_dim is None or
tp_size == 1 and the sharded branch that uses
self.vocab_start_index/self.vocab_end_index or .narrow with
self.num_embeddings_per_partition) perform the validations via exceptions before
calling param.data.copy_(...) so errors are never skipped under -O.

31-31: Use explicit Optional[int] or int | None instead of implicit Optional.

PEP 484 prohibits implicit Optional. This was also flagged by Ruff (RUF013).

Proposed fix
-        padding_idx: int = None,
+        padding_idx: int | None = None,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/layers/embedding.py` at line 31, The parameter padding_idx currently
uses an implicit Optional via "padding_idx: int = None"; update its type
annotation to an explicit Optional by changing it to "padding_idx: Optional[int]
= None" (or "padding_idx: int | None = None" for Python 3.10+), and add the
corresponding import from typing (Optional) if you choose that form; ensure you
update the signature where padding_idx is declared so Ruff (RUF013) and PEP 484
rules are satisfied.

44-50: Redundant divisibility check: both the if (Line 44) and divide() (Line 50) validate the same condition.

The explicit ValueError on Line 44 is the better check since it gives a clear message and can't be stripped. The subsequent divide() call on Line 50 internally asserts the same condition. This is harmless but worth noting — the divide() assert is now dead code for this path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/layers/embedding.py` around lines 44 - 50, Keep the explicit
divisibility check and clear ValueError for num_embeddings vs tp_size, but
remove the redundant runtime assertion by avoiding the call to divide() here:
set self.num_embeddings_per_partition using integer division of num_embeddings
by self.tp_size (e.g., compute num_embeddings // self.tp_size) instead of
calling divide(), so the check remains in this function (ValueError in the if)
and the duplicate assert inside divide() is not hit for this path; update
references to num_embeddings_per_partition accordingly.
pymllm/mobile/__init__.py (1)

1-45: Consider defining __all__ to control the public API surface.

Without __all__, a wildcard import (from pymllm.mobile import *) will also pull in internal submodule names like ffi, convertor, utils, quantize, nn, service, and backends. Defining __all__ explicitly would keep the public surface intentional and consistent with how pymllm/__init__.py manages its exports.

Example
__all__ = [
    # dtypes
    "float32", "float16", "bfloat16",
    "int8", "int16", "int32", "int64",
    "uint8", "uint16", "uint32", "uint64",
    "boolean",
    # devices
    "cpu", "cuda", "qnn",
    # core
    "Tensor", "empty", "echo", "device",
    "is_torch_available", "is_numpy_available",
    "from_torch", "from_numpy",
    "zeros", "ones", "arange", "random",
    # ops
    "matmul",
    # submodules (if intentionally public)
    "ffi", "nn", "backends", "convertor", "utils", "quantize", "service",
]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mobile/__init__.py` around lines 1 - 45, Add an explicit __all__ list
in pymllm.mobile.__init__ to control the public API surface: include the public
symbols currently re-exported from .ffi (float32, float16, bfloat16, int8,
int16, int32, int64, uint8, uint16, uint32, uint64, boolean, cpu, cuda, qnn,
Tensor, empty, echo, device, is_torch_available, is_numpy_available, from_torch,
from_numpy, zeros, ones, arange, random), plus matmul from .nn.functional;
decide whether to also expose submodules (ffi, nn, backends, convertor, utils,
quantize, service) and only include them in __all__ if they are intentionally
public. Ensure the name __all__ is defined at module level so wildcard imports
and documentation reflect the intended public API.
pymllm/__init__.py (1)

26-27: Add a docstring to the public is_mobile_available() function.

Per coding guidelines, public APIs should have clear docstrings explaining purpose and return value.

Proposed fix
 def is_mobile_available() -> bool:
+    """Check whether the mobile (on-device) FFI libraries are available.
+
+    Returns:
+        True if the platform-specific MllmFFIExtension shared library
+        exists under pymllm/lib/, False otherwise.
+    """
     return _has_mobile_libs()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/__init__.py` around lines 26 - 27, Add a concise docstring to the
public function is_mobile_available() that explains its purpose (checking if
mobile-specific dependencies are present), what it returns (bool indicating
availability), and any notes about it delegating to the internal helper
_has_mobile_libs(); place the docstring immediately below the def
is_mobile_available() signature and follow project docstring style (one-line
summary and a short return description).
pymllm/configs/__init__.py (1)

12-21: __all__ is not sorted (Ruff RUF022).

♻️ Proposed fix
 __all__ = [
-    # Main singleton
-    "GlobalConfig",
-    "get_global_config",
-    # Sub configs
-    "ServerConfig",
-    "ModelConfig",
-    "RuntimeConfig",
     "CacheConfig",
+    "GlobalConfig",
+    "ModelConfig",
+    "RuntimeConfig",
+    "ServerConfig",
+    "get_global_config",
 ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/configs/__init__.py` around lines 12 - 21, The __all__ list in
pymllm.configs.__init__ is not alphabetized; update the __all__ definition so
the exported symbol names are sorted alphabetically (e.g., "CacheConfig",
"GlobalConfig", "ModelConfig", "RuntimeConfig", "ServerConfig",
"get_global_config") while preserving it as a list of strings so Ruff RUF022 is
satisfied and the module exports remain the same.
pymllm/configs/global_config.py (1)

173-173: Using base_gpu_id < 0 as an undocumented CPU sentinel is confusing.

The mapping device = "cuda" if base_gpu_id >= 0 else "cpu" encodes CPU selection as a negative GPU index, a convention that is non-obvious and unmentioned in the ServerConfig field comment (# Base GPU index used for process-to-device mapping). A typo like -1 intended as "auto" would silently switch the device to CPU.

Consider an explicit device field on ServerConfig (e.g., device: Literal["cuda","cpu","auto"] = "auto") or at least document this sentinel contract on the base_gpu_id field.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/configs/global_config.py` at line 173, The code currently treats
server_config.base_gpu_id < 0 as a hidden "use CPU" sentinel; update
ServerConfig to include an explicit device field (e.g., device:
Literal["cuda","cpu","auto"] = "auto") and change the runtime.device assignment
to honor server_config.device first (if "cuda" set device="cuda", if "cpu" set
device="cpu", if "auto" fall back to using base_gpu_id >= 0 ? "cuda" : "cpu");
also add a brief comment on the ServerConfig.device field explaining the
semantics so the contract is explicit (refer to ServerConfig,
server_config.base_gpu_id, and runtime.device).

- Updated `.codespellrc` to include 'flashinfer' in the ignore words list.
- Introduced new files for `launch_server`, `prepare`, and various layer implementations including `LayerNorm`, `RMSNorm`, and `MLP`.
- Added `ColumnParallelLinear` and `RowParallelLinear` classes for efficient linear operations in tensor parallelism.
- Implemented rotary embedding functions in `rope.py` for enhanced model performance.
- Created caching mechanisms in `param_disk_cache.py` and `radix_cache.py` for improved memory management.
- Refactored `GroupCoordinator` to enhance broadcasting functionality in distributed settings.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
pymllm/layers/rms_norm.py (1)

19-20: weight initialized with torch.empty — no safe default if weights aren't loaded.

RMSNorm.weight (and GemmaRMSNorm.weight at line 47) use torch.empty, which contains arbitrary memory. LayerNorm initializes weight to torch.ones and bias to torch.zeros, providing a usable identity default. For RMSNorm, weight = 1 is the standard identity initialization. Consider using torch.ones for consistency and safety against accidental use before weight loading.

♻️ Proposed fix
-        self.weight = Parameter(torch.empty(hidden_size))
+        self.weight = Parameter(torch.ones(hidden_size))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/layers/rms_norm.py` around lines 19 - 20, RMSNorm and GemmaRMSNorm
initialize self.weight with torch.empty which may contain garbage if weights
aren't loaded; change both initializations to use a safe identity default
(torch.ones) so the Parameter is created with ones (e.g., replace
Parameter(torch.empty(hidden_size)) with Parameter(torch.ones(hidden_size)))
while keeping the subsequent set_weight_attrs(self.weight, {"weight_loader":
self.weight_loader}) calls intact.
pymllm/layers/linear.py (2)

86-111: Shape checks use assert — silently disabled under python -O.

The assert statements in weight_loader (lines 96-100, 107-111) serve as data-integrity guards. In optimized mode (python -O / PYTHONOPTIMIZE), assertions are stripped, which would silently allow shape-mismatched weight loads, potentially causing hard-to-debug corruption. The same pattern exists in RowParallelLinear.weight_loader and VocabParallelEmbedding.weight_loader.

Consider converting to explicit if/raise ValueError(...) for safety-critical checks. This is a pattern-level suggestion applicable to all weight_loader methods.

♻️ Example fix (ColumnParallelLinear.weight_loader)
         if output_dim is None or self.tp_size == 1:
-            assert param.data.shape == loaded_weight.shape, (
-                f"Shape mismatch: param {param.data.shape} vs "
-                f"loaded {loaded_weight.shape}"
-            )
+            if param.data.shape != loaded_weight.shape:
+                raise ValueError(
+                    f"Shape mismatch: param {param.data.shape} vs "
+                    f"loaded {loaded_weight.shape}"
+                )
             param.data.copy_(loaded_weight)
         else:
             shard_weight = loaded_weight.narrow(
                 output_dim,
                 self.output_start_index,
                 self.out_features_per_partition,
             )
-            assert param.data.shape == shard_weight.shape, (
-                f"Shard shape mismatch: param {param.data.shape} vs "
-                f"shard {shard_weight.shape}"
-            )
+            if param.data.shape != shard_weight.shape:
+                raise ValueError(
+                    f"Shard shape mismatch: param {param.data.shape} vs "
+                    f"shard {shard_weight.shape}"
+                )
             param.data.copy_(shard_weight)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/layers/linear.py` around lines 86 - 111, The weight_loader methods use
assert for shape checks which are disabled with python -O; replace the asserts
in ColumnParallelLinear.weight_loader (and the same pattern in
RowParallelLinear.weight_loader and VocabParallelEmbedding.weight_loader) with
explicit runtime checks that raise a ValueError containing the expected and
actual shapes and context (e.g., include param.data.shape and
loaded_weight.shape or shard_weight.shape and mention shard vs full load),
keeping the same control flow for the tp_size/output_dim branch and copying
behavior when shapes match.

72-84: Remove bias_flag from ColumnParallelLinear and RowParallelLinear.

self.bias_flag is set in both branches (lines 73, 83 in ColumnParallelLinear; lines 178, 182 in RowParallelLinear) but never read. The canonical way to check for bias is if self.bias is not None:, which is already done in RowParallelLinear.forward (line 218). Since the information is already captured in whether self.bias is a Parameter or None, bias_flag is redundant dead state and should be removed.

Note: The Linear class (lines 224–251) does not use bias_flag and does not need changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/layers/linear.py` around lines 72 - 84, The classes
ColumnParallelLinear and RowParallelLinear define and set self.bias_flag but
never read it; remove this redundant attribute and rely on the existing bias
presence check (if self.bias is not None). Update the constructors in
ColumnParallelLinear and RowParallelLinear to stop assigning self.bias_flag in
both branches (when bias is truthy and when registering None via
register_parameter("bias", None)), leaving only the existing Parameter creation
and set_weight_attrs path and the register_parameter("bias", None) path; ensure
any forward logic continues to check bias via self.bias is not None (e.g.,
RowParallelLinear.forward) and no other code references bias_flag.
pymllm/layers/__init__.py (1)

18-35: Sort __all__ alphabetically (Ruff RUF022).

Ruff reports __all__ is not in isort-style sorted order, which will cause CI linting to fail.

♻️ Proposed fix
 __all__ = [
-    "MllmBaseLayer",
-    "set_weight_attrs",
-    "VocabParallelEmbedding",
     "ColumnParallelLinear",
+    "GemmaRMSNorm",
+    "LayerNorm",
     "Linear",
-    "RowParallelLinear",
     "MLP",
+    "MllmBaseLayer",
     "ParallelMLP",
-    "LayerNorm",
     "RMSNorm",
-    "GemmaRMSNorm",
-    "apply_rope",
+    "RowParallelLinear",
+    "VocabParallelEmbedding",
     "apply_llama31_rope",
-    "apply_rope_pos_ids",
     "apply_llama31_rope_pos_ids",
+    "apply_rope",
+    "apply_rope_pos_ids",
     "apply_rope_with_cos_sin_cache",
+    "set_weight_attrs",
 ]

As per coding guidelines, "Adhere to language-specific best practices and idioms (e.g., PEP 8 for Python)" — maintaining a sorted __all__ is the idiomatic convention enforced by the project's Ruff configuration (RUF022).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/layers/__init__.py` around lines 18 - 35, The __all__ list in the
module is not alphabetically sorted and triggers Ruff RUF022; reorder the
entries in the __all__ list (the list containing "MllmBaseLayer",
"set_weight_attrs", "VocabParallelEmbedding", "ColumnParallelLinear", "Linear",
"RowParallelLinear", "MLP", "ParallelMLP", "LayerNorm", "RMSNorm",
"GemmaRMSNorm", "apply_rope", "apply_llama31_rope", "apply_rope_pos_ids",
"apply_llama31_rope_pos_ids", "apply_rope_with_cos_sin_cache") into strict
alphabetical order (by string) to satisfy isort-style sorting and eliminate the
lint error.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pymllm/layers/base.py`:
- Line 4: Remove the unused import set_weight_attrs: delete the line that
imports set_weight_attrs from pymllm.layers.utils in this module (the import of
set_weight_attrs is unused in this file, so just remove that identifier to clean
up imports).

In `@pymllm/layers/embedding.py`:
- Line 31: The parameter annotation for padding_idx in pymllm.layers.embedding
(the function/class signature where padding_idx: int = None is declared) uses an
implicit Optional which violates PEP 484; update the annotation to be explicit
by changing padding_idx to either "int | None = None" (and add "from __future__
import annotations" at the top) or to "Optional[int] = None" and add "from
typing import Optional" to imports so the type is explicit and Ruff RUF013 is
resolved.

---

Nitpick comments:
In `@pymllm/layers/__init__.py`:
- Around line 18-35: The __all__ list in the module is not alphabetically sorted
and triggers Ruff RUF022; reorder the entries in the __all__ list (the list
containing "MllmBaseLayer", "set_weight_attrs", "VocabParallelEmbedding",
"ColumnParallelLinear", "Linear", "RowParallelLinear", "MLP", "ParallelMLP",
"LayerNorm", "RMSNorm", "GemmaRMSNorm", "apply_rope", "apply_llama31_rope",
"apply_rope_pos_ids", "apply_llama31_rope_pos_ids",
"apply_rope_with_cos_sin_cache") into strict alphabetical order (by string) to
satisfy isort-style sorting and eliminate the lint error.

In `@pymllm/layers/linear.py`:
- Around line 86-111: The weight_loader methods use assert for shape checks
which are disabled with python -O; replace the asserts in
ColumnParallelLinear.weight_loader (and the same pattern in
RowParallelLinear.weight_loader and VocabParallelEmbedding.weight_loader) with
explicit runtime checks that raise a ValueError containing the expected and
actual shapes and context (e.g., include param.data.shape and
loaded_weight.shape or shard_weight.shape and mention shard vs full load),
keeping the same control flow for the tp_size/output_dim branch and copying
behavior when shapes match.
- Around line 72-84: The classes ColumnParallelLinear and RowParallelLinear
define and set self.bias_flag but never read it; remove this redundant attribute
and rely on the existing bias presence check (if self.bias is not None). Update
the constructors in ColumnParallelLinear and RowParallelLinear to stop assigning
self.bias_flag in both branches (when bias is truthy and when registering None
via register_parameter("bias", None)), leaving only the existing Parameter
creation and set_weight_attrs path and the register_parameter("bias", None)
path; ensure any forward logic continues to check bias via self.bias is not None
(e.g., RowParallelLinear.forward) and no other code references bias_flag.

In `@pymllm/layers/rms_norm.py`:
- Around line 19-20: RMSNorm and GemmaRMSNorm initialize self.weight with
torch.empty which may contain garbage if weights aren't loaded; change both
initializations to use a safe identity default (torch.ones) so the Parameter is
created with ones (e.g., replace Parameter(torch.empty(hidden_size)) with
Parameter(torch.ones(hidden_size))) while keeping the subsequent
set_weight_attrs(self.weight, {"weight_loader": self.weight_loader}) calls
intact.

import torch
from torch import nn
from torch.nn import Parameter
from pymllm.layers.utils import set_weight_attrs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if the file exists and view the first 20 lines to see the import
echo "=== First 20 lines of pymllm/layers/base.py ==="
head -20 pymllm/layers/base.py

echo ""
echo "=== Search for any usage of set_weight_attrs in the file ==="
rg -n 'set_weight_attrs' pymllm/layers/base.py || echo "No matches found"

echo ""
echo "=== Full file line count ==="
wc -l pymllm/layers/base.py

Repository: UbiquitousLearning/mllm

Length of output: 971


Remove unused import set_weight_attrs.

Line 4 imports set_weight_attrs from pymllm.layers.utils, but it is never used in this file. Remove the import to keep the base module clean.

♻️ Proposed fix
-from pymllm.layers.utils import set_weight_attrs
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from pymllm.layers.utils import set_weight_attrs
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/layers/base.py` at line 4, Remove the unused import set_weight_attrs:
delete the line that imports set_weight_attrs from pymllm.layers.utils in this
module (the import of set_weight_attrs is unused in this file, so just remove
that identifier to clean up imports).

self,
num_embeddings: int,
embedding_dim: int,
padding_idx: int = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

PEP 484 violation: implicit Optional for padding_idx.

padding_idx: int = None should be padding_idx: int | None = None (or Optional[int] = None). Ruff correctly flags this as RUF013. The file doesn't currently import from __future__ import annotations or Optional, so either add the import or use the Optional form.

✏️ Proposed fix
+from __future__ import annotations
+
 import torch
 import torch.nn.functional as F
 from torch.nn import Parameter
-        padding_idx: int = None,
+        padding_idx: int | None = None,
🧰 Tools
🪛 Ruff (0.15.1)

[warning] 31-31: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/layers/embedding.py` at line 31, The parameter annotation for
padding_idx in pymllm.layers.embedding (the function/class signature where
padding_idx: int = None is declared) uses an implicit Optional which violates
PEP 484; update the annotation to be explicit by changing padding_idx to either
"int | None = None" (and add "from __future__ import annotations" at the top) or
to "Optional[int] = None" and add "from typing import Optional" to imports so
the type is explicit and Ruff RUF013 is resolved.

chenghuaWang and others added 3 commits February 19, 2026 09:28
- Introduced a new architecture diagram image `pymllm-arch.png` in the assets directory.
- Updated `README.md` to include the architecture diagram.
- Created initial `launch.py` files in both the engine and server directories for future functionality.
- Added an empty `scheduler.py` file in the orchestrator directory to support scheduling features.
- Updated `apache-tvm-ffi` version to `0.1.8.post2` in both `pyproject.toml` files.
- Added `pyzmq` to the optional `cuda` dependencies in `pymllm`.
- Introduced `pymllm-server` script for server launch functionality.
- Refactored configuration imports in `pymllm/configs/__init__.py` to streamline access to model and quantization configurations.
- Created new configuration files for model and quantization settings to support enhanced model management.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

🧹 Nitpick comments (14)
pymllm/tests/test_vocab_parallel_embedding.py (5)

87-87: world_size is unused in both worker functions.

Both embedding_forward_tp8_worker_cuda and weight_loading_tp8_worker_cuda declare world_size but never reference it (Ruff ARG001). Remove the parameter from both signatures and the corresponding call sites in run_worker_tp8_cuda.

Also applies to: 147-147

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/tests/test_vocab_parallel_embedding.py` at line 87, Remove the unused
world_size parameter from the worker function signatures and their invocations:
delete the world_size argument from embedding_forward_tp8_worker_cuda and
weight_loading_tp8_worker_cuda, and update run_worker_tp8_cuda (and any other
call sites) to call these functions with only (rank, local_rank); ensure any
function definitions, type hints, and references to world_size are removed so no
unused-argument lint errors remain.

197-277: Both test methods are identical boilerplate — extract a helper.

test_forward_pass_tp8_real and test_weight_loading_tp8_real share ~30 lines of identical process-spawning, joining, and result-checking logic. Extract a shared _run_tp8_test(test_func) helper to eliminate the duplication.

♻️ Proposed refactor
+    def _run_tp8_test(self, test_func: Callable, port: int) -> None:
+        world_size = 8
+        manager = mp.Manager()
+        return_dict = manager.dict()
+        processes = [
+            mp.Process(
+                target=run_worker_tp8_cuda,
+                args=(rank, rank, world_size, test_func, return_dict, port),
+            )
+            for rank in range(world_size)
+        ]
+        for p in processes:
+            p.start()
+        for p in processes:
+            p.join(timeout=120)
+            if p.is_alive():
+                p.terminate()
+                p.join()
+        for rank in range(world_size):
+            result = return_dict.get(rank, "TIMEOUT")
+            expected = "PASSED" if rank == 0 else "OK"
+            assert result == expected, f"Rank {rank} failed: {result}"

     def test_forward_pass_tp8_real(self):
         """Test forward pass with real TP=8 using 8 processes on CUDA."""
-        world_size = 8
-        local_world_size = 8
-        mp.set_start_method("spawn", force=True)
-        manager = mp.Manager()
-        return_dict = manager.dict()
-        processes = []
-        for rank in range(world_size):
-            local_rank = rank
-            p = mp.Process(...)
-            p.start()
-            processes.append(p)
-        for p in processes:
-            p.join(timeout=120)
-            ...
-        for rank in range(world_size):
-            ...
+        self._run_tp8_test(embedding_forward_tp8_worker_cuda, _get_free_port())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/tests/test_vocab_parallel_embedding.py` around lines 197 - 277, Both
tests duplicate the same multiprocessing boilerplate; extract a helper
_run_tp8_test(test_worker_fn) that encapsulates setting
mp.set_start_method("spawn", force=True), creating mp.Manager/return_dict,
spawning mp.Process for ranks 0..7 with local_rank==rank calling
run_worker_tp8_cuda with the provided worker function
(embedding_forward_tp8_worker_cuda or weight_loading_tp8_worker_cuda),
joining/terminating processes with timeout, and performing the same assertions
on return_dict; then replace test_forward_pass_tp8_real to call
_run_tp8_test(embedding_forward_tp8_worker_cuda) and
test_weight_loading_tp8_real to call
_run_tp8_test(weight_loading_tp8_worker_cuda).

48-48: Unused local_world_size parameter; import traceback should be top-level.

  • local_world_size is never used in the function body (Ruff ARG001). Either remove it or document its intended use.
  • import traceback inside the except block works but defers the import unnecessarily. Move it to the top of the file alongside other stdlib imports.
✏️ Proposed fix
+import traceback
 import os
 import logging
 ...
 def run_worker_tp8_cuda(
     rank: int,
     local_rank: int,
     world_size: int,
-    local_world_size: int,
     test_func: Callable,
     return_dict: dict,
+    master_port: int = 29500,
 ):
     except Exception as e:
-        import traceback
-
         return_dict[rank] = f"ERROR: {e}\n{traceback.format_exc()}"

Also applies to: 79-82

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/tests/test_vocab_parallel_embedding.py` at line 48, Remove the unused
local_world_size parameter from the function signature (it's unused and triggers
ARG001) or if it was intended for test configuration, add a comment explaining
its purpose and use it accordingly; also move the import traceback out of the
except block(s) and place it with the other top-level stdlib imports so
traceback is imported once at module import time (look for occurrences of
local_world_size in the test function signature and the traceback import inside
the except blocks around the same test).

293-308: Test methods in TestVocabParallelEmbeddingCUDA are missing docstrings.

test_cuda_forward and test_cuda_weight_loader have no docstrings, inconsistent with the other worker functions in the file and with the coding guidelines requiring docstrings for public functions.

✏️ Proposed fix
     def test_cuda_forward(self):
+        """Verify forward pass shape and device placement in TP=1 CUDA mode."""
         layer = VocabParallelEmbedding(1000, 512).cuda()

     def test_cuda_weight_loader(self):
+        """Verify that weight_loader correctly copies weights in TP=1 CUDA mode."""
         layer = VocabParallelEmbedding(100, 64).cuda()

As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/tests/test_vocab_parallel_embedding.py` around lines 293 - 308, Add
concise docstrings to the two test methods test_cuda_forward and
test_cuda_weight_loader in TestVocabParallelEmbeddingCUDA: for test_cuda_forward
describe that it verifies the VocabParallelEmbedding moves to CUDA and returns a
tensor of shape (batch, seq, embed_dim) on GPU; for test_cuda_weight_loader
describe that it checks load_weight correctly copies a CPU tensor into the
module weight on CUDA and that values match after moving back to CPU; keep them
short (one or two sentences) and follow the file's existing docstring style.

98-98: Replace magic numbers with named constants.

8 (TP size), 29500 (master port), 1024/64 (vocab/embed dims) appear as bare literals in multiple places. Define module-level constants to make intent clear and changes local.

✏️ Proposed fix
+_TP_SIZE_TEST = 8
+_MASTER_PORT_BASE = 29500
+_VOCAB_SIZE = 1024
+_EMBED_DIM = 64

Then replace occurrences like vocab_size // 8_VOCAB_SIZE // _TP_SIZE_TEST, etc.

As per coding guidelines, "Use named constants instead of magic numbers."

Also applies to: 111-111, 167-167, 177-177

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/tests/test_vocab_parallel_embedding.py` at line 98, Replace the
hard-coded literals with module-level named constants: introduce constants such
as _TP_SIZE_TEST (for TP size 8), _MASTER_PORT_TEST (for 29500), _VOCAB_SIZE
(for 29500-related vocab), and _EMBED_DIM/_TOKEN_DIM (for 1024/64) at the top of
the test module and use those constants where the literals appear (e.g., replace
tp_size == 8 -> tp_size == _TP_SIZE_TEST, vocab_size // 8 -> _VOCAB_SIZE //
_TP_SIZE_TEST, and any uses of 1024/64 -> _EMBED_DIM/_TOKEN_DIM, and replace the
master port literal with _MASTER_PORT_TEST); update all occurrences referenced
in the review (the assertions and calculations around tp_size, master port,
vocab/embed dimensions) so intent is clear and changes remain local to the test
module.
pymllm/orchestrator/parallel_state.py (3)

50-56: Replace assert with an explicit ValueError for input validation.

assert statements are stripped when Python is run with the -O flag, silently disabling this critical product-size check. As per coding guidelines, functions that can fail should raise exceptions.

🛡️ Proposed fix
-    assert (
-        tensor_model_parallel_size * data_parallel_size * pipeline_model_parallel_size
-        == world_size
-    ), (
-        f"TP({tensor_model_parallel_size}) * DP({data_parallel_size}) * "
-        f"PP({pipeline_model_parallel_size}) != World({world_size})"
-    )
+    if (
+        tensor_model_parallel_size * data_parallel_size * pipeline_model_parallel_size
+        != world_size
+    ):
+        raise ValueError(
+            f"TP({tensor_model_parallel_size}) * DP({data_parallel_size}) * "
+            f"PP({pipeline_model_parallel_size}) != World({world_size})"
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/parallel_state.py` around lines 50 - 56, Replace the
runtime-only assert with explicit input validation that raises a ValueError:
check that tensor_model_parallel_size * data_parallel_size *
pipeline_model_parallel_size == world_size and if not raise ValueError with the
same descriptive message (use the existing f-string built from
tensor_model_parallel_size, data_parallel_size, pipeline_model_parallel_size,
and world_size) so the check remains active even when Python is run with -O;
update the code around the current assert in parallel_state.py accordingly.

30-35: initialize_model_parallel and communication helpers are missing docstrings.

The primary public entry point and the three communication helpers (tensor_model_parallel_all_reduce, tensor_model_parallel_all_gather, data_parallel_all_reduce) have no docstrings documenting parameters, behavior, or the passthrough semantics for the world_size=1 case. As per coding guidelines, public APIs, classes, and functions must have clear docstrings explaining purpose, parameters, returns, and errors.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/parallel_state.py` around lines 30 - 35, Add
comprehensive docstrings to the public API function initialize_model_parallel
and the three communication helper functions tensor_model_parallel_all_reduce,
tensor_model_parallel_all_gather, and data_parallel_all_reduce: for each
docstring describe the function’s purpose, parameters (including
tensor_model_parallel_size, data_parallel_size, pipeline_model_parallel_size,
backend, and the tensor argument for helpers), return value, raised errors, and
the passthrough semantics when world_size == 1 (i.e., no-op behavior / return
input unchanged). Use the project’s docstring style (short summary line,
parameter list, returns, and notes about behavior), and place them immediately
above the corresponding def lines (initialize_model_parallel,
tensor_model_parallel_all_reduce, tensor_model_parallel_all_gather,
data_parallel_all_reduce).

155-156: model_parallel_is_initialized() returns False for the primary single-GPU use case.

When initialize_model_parallel() is called with all sizes equal to 1 (the documented default/expected path), no groups are created, so this predicate always returns False. Callers cannot distinguish "never initialized" from "initialized for single-GPU mode", which can silently suppress downstream logic that gates on this check.

♻️ Proposed fix: use a dedicated flag
+_INITIALIZED: bool = False

 def initialize_model_parallel(
     ...
 ) -> None:
-    global _TP_GROUP, _DP_GROUP, _PP_GROUP
-    global _TP_RANK, _TP_SIZE, _DP_RANK, _DP_SIZE, _PP_RANK, _PP_SIZE
+    global _TP_GROUP, _DP_GROUP, _PP_GROUP, _INITIALIZED
+    global _TP_RANK, _TP_SIZE, _DP_RANK, _DP_SIZE, _PP_RANK, _PP_SIZE
     ...
+    _INITIALIZED = True

 def model_parallel_is_initialized() -> bool:
-    return _TP_GROUP is not None or _DP_GROUP is not None or _PP_GROUP is not None
+    return _INITIALIZED
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/parallel_state.py` around lines 155 - 156, The current
model_parallel_is_initialized() wrongly infers initialization by checking
_TP_GROUP/_DP_GROUP/_PP_GROUP and thus returns False for the valid single-GPU
path; add a dedicated boolean flag (e.g., _MODEL_PARALLEL_INITIALIZED) that
initialize_model_parallel() sets to True unconditionally when it completes (even
if it creates no groups), have model_parallel_is_initialized() return that flag
instead of checking the group variables, and ensure any teardown/reset code
clears _MODEL_PARALLEL_INITIALIZED so future checks reflect actual state;
reference initialize_model_parallel, model_parallel_is_initialized, _TP_GROUP,
_DP_GROUP, and _PP_GROUP when applying this change.
pymllm/orchestrator/request_response_process.py (1)

8-10: Add a class-level docstring to RequestResponseProcess.

The module docstring explains the intent but the class itself lacks a docstring, violating the coding guideline for public APIs.

💡 Suggested scaffold
 class RequestResponseProcess:
-    def __init__(self):
-        pass
+    """Main orchestrator thread handling incoming requests and outgoing responses.
+
+    Must be used only as the main thread of the orchestrator.
+    """
+
+    def __init__(self) -> None:
+        pass

As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/request_response_process.py` around lines 8 - 10, Add a
class-level docstring to RequestResponseProcess describing its purpose and
behavior as a public API entry point; update the class definition (class
RequestResponseProcess and its __init__ method) to include a concise docstring
that explains what the class does, expected inputs/state, any important
attributes, return behavior of its public methods, and possible
errors/exceptions raised so callers and docs are clear.
pymllm/configs/quantization_config.py (1)

9-18: Consider adding __post_init__ runtime validation for kv_cache_dtype.

Literal type hints are not enforced at runtime, so an invalid string (e.g., "float32") would silently be accepted and only fail later when the engine tries to use it.

💡 Suggested addition
+    _VALID_KV_DTYPES = {"auto", "float16", "bfloat16", "fp8_e4m3", "fp8_e5m2"}
+
+    def __post_init__(self) -> None:
+        if self.kv_cache_dtype not in self._VALID_KV_DTYPES:
+            raise ValueError(
+                f"Invalid kv_cache_dtype {self.kv_cache_dtype!r}. "
+                f"Must be one of {self._VALID_KV_DTYPES}."
+            )

As per coding guidelines, "Validate inputs for public APIs and critical internal functions."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/configs/quantization_config.py` around lines 9 - 18, The
kv_cache_dtype field on QuantizationConfig is only type-hinted and can accept
invalid runtime strings; add a __post_init__ method on the QuantizationConfig
dataclass that validates self.kv_cache_dtype is one of the allowed literals
("auto","float16","bfloat16","fp8_e4m3","fp8_e5m2") and raise a ValueError with
a clear message if not; place this validation inside the QuantizationConfig
class so it runs on instantiation and include the invalid value in the error for
easier debugging.
pymllm/configs/model_config.py (1)

28-31: Shorten the AttributeError message to resolve Ruff TRY003.

The multi-line string concatenation in the raise is flagged by Ruff (TRY003). A tighter single-line message resolves the warning without losing clarity.

💡 Proposed fix
-        raise AttributeError(
-            f"'{type(self).__name__}' has no attribute '{name}' "
-            f"(also not found on hf_config)"
-        )
+        raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/configs/model_config.py` around lines 28 - 31, The raised
AttributeError message in the code uses multi-line concatenation which triggers
Ruff TRY003; update the raise in the relevant method (e.g., the __getattr__ in
model_config.py that references type(self).__name__, name and hf_config) to a
single-line f-string such as: raise AttributeError(f"'{type(self).__name__}' has
no attribute '{name}' (also not found on hf_config)") so the message remains
clear but is a single string literal.
pymllm/orchestrator/tokenizer_process.py (1)

1-3: Add module and class docstrings.

TokenizerProcess is a public API class but has neither a module-level docstring nor a class docstring explaining its purpose and intended lifecycle. The same pattern applies to all four of the other bare placeholder files in this orchestrator (AsyncDiskIoProcess, DetokenizerProcess, ModelRunnerProcess).

💡 Suggested scaffold
+"""Tokenizer worker process for the orchestrator pipeline."""
+
+
 class TokenizerProcess:
-    def __init__(self):
-        pass
+    """Runs tokenization in a dedicated process.
+
+    Converts raw text requests into token sequences and forwards them
+    to the model runner. Intended to be managed by the orchestrator.
+    """
+
+    def __init__(self) -> None:
+        pass

As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/tokenizer_process.py` around lines 1 - 3, Add a
module-level docstring at the top of tokenizer_process.py describing the
module's role in the orchestrator (responsible for running tokenization in a
separate process and its expected lifecycle), and add a class docstring to
TokenizerProcess that explains its purpose, public API expectations, lifecycle
(init, start/stop behavior if any), parameters (none for __init__ currently),
return values, and possible errors; apply the same pattern to the other
placeholder classes mentioned (AsyncDiskIoProcess, DetokenizerProcess,
ModelRunnerProcess) so each public class and its module have clear, concise
docstrings documenting usage and responsibilities.
pymllm/configs/__init__.py (1)

8-14: __all__ is not sorted (Ruff RUF022).

♻️ Proposed fix
 __all__ = [
-    "GlobalConfig",
-    "get_global_config",
-    "ServerConfig",
-    "ModelConfig",
     "QuantizationConfig",
+    "GlobalConfig",
+    "ModelConfig",
+    "QuantizationConfig",
+    "ServerConfig",
+    "get_global_config",
 ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/configs/__init__.py` around lines 8 - 14, The __all__ list in
pymllm/configs/__init__.py is not alphabetically ordered; update the __all__
variable so its items are sorted lexicographically (e.g., arrange
"GlobalConfig", "ModelConfig", "QuantizationConfig", "ServerConfig",
"get_global_config" or whatever alphabetical order is correct for the exact
names) to satisfy Ruff RUF022 and ensure predictable exports; locate the __all__
definition and reorder the entries accordingly.
pymllm/configs/global_config.py (1)

281-295: Duplicate ServerConfig import inside read_args.

ServerConfig is already imported at the top of the module (line 21). The in-function import on line 282 is redundant.

♻️ Proposed fix
     # Server: reconstruct to preserve validation behavior.
-    from pymllm.configs.server_config import ServerConfig
-
     server_updates: dict[str, Any] = {}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/configs/global_config.py` around lines 281 - 295, Remove the redundant
in-function import of ServerConfig inside the read_args block and rely on the
module-level import already present; specifically delete the line "from
pymllm.configs.server_config import ServerConfig" within the code that builds
server_updates (the block that references cfg.server, server_updates,
server_values, and reassigns cfg.server = ServerConfig(**server_values)), and
run linters/tests to ensure no unused-import or name-resolution issues remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pymllm/configs/global_config.py`:
- Around line 307-316: The code mutates cfg.server.model_path directly which
bypasses ServerConfig.__post_init__ (so tokenizer_path stays stale); instead of
assigning cfg.server.model_path = cfg.model.model_path, reinstantiate the
ServerConfig (e.g., replace cfg.server with ServerConfig(... ) or call its
constructor/validator) using the updated model_path so __post_init__ runs and
tokenizer_path is set correctly; update the branch that checks
model_model_overridden and server_model_overridden to reconstruct cfg.server
from cfg.server fields with model_path = cfg.model.model_path rather than doing
a bare field assignment.

In `@pymllm/configs/server_config.py`:
- Around line 97-101: The code currently sets served_model_name =
str(self.model_path) even when model_path is None, producing the string "None"
and masking an unset model; update the initialization so served_model_name is
only assigned from model_path when model_path is not None (e.g. if
self.served_model_name is None and self.model_path is not None:
self.served_model_name = str(self.model_path)), leaving served_model_name as
None otherwise; keep the existing tokenizer_path handling and ensure
ServerConfig._validate() is still called afterwards.
- Line 43: Change the type annotation of the field max_prefill_tokens in
server_config.py from int to Optional[int] to match its default None and the
existing guard in _validate; import Optional from typing (or use
typing.Optional) if not already imported, and update any related type
hints/annotations that reference max_prefill_tokens to use Optional[int] so
static type checkers no longer report that int cannot be None.

In `@pymllm/engine/launch.py`:
- Around line 29-32: Replace the TODO marker in the _launch_processes method
docstring with the required format by changing "TODO issue processes here" to
"TODO: issue processes here" (or expand it to a real actionable sentence),
ensuring the TODO: prefix is used as per guidelines in the _launch_processes
function/method documentation.
- Around line 5-15: The file imports unused symbols that create unconditional
ImportError risk; move the imports for zmq, torch.multiprocessing as mp,
TokenizerProcess, DetokenizerProcess, ModelRunnerProcess, and AsyncDiskIoProcess
into a typing-only guard (e.g., inside a TYPE_CHECKING block) or remove them
until they are actually needed so importing pymllm.engine.launch won’t fail when
optional dependencies like pyzmq/torch aren’t installed; update the top of the
module to from typing import TYPE_CHECKING and place those six imports inside if
TYPE_CHECKING: (or delete them) while leaving used imports (AutoConfig,
snapshot_download, get_global_config, RequestResponseProcess) untouched.

In `@pymllm/orchestrator/parallel_state.py`:
- Around line 36-41: Reset and destroy stale global parallel group and rank
state at the start of the initializer: at function entry clear _TP_GROUP,
_DP_GROUP, _PP_GROUP and their rank/size globals (_TP_RANK, _TP_SIZE, _DP_RANK,
_DP_SIZE, _PP_RANK, _PP_SIZE) by checking if an existing ProcessGroup exists and
calling torch.distributed.destroy_process_group on it before reassigning new
values; then proceed to set _TP_SIZE/_DP_SIZE/_PP_SIZE and recreate the groups
so the accessor functions won't return stale ProcessGroup objects. Ensure the
logic references the existing globals (_TP_GROUP, _DP_GROUP, _PP_GROUP) and uses
torch.distributed.destroy_process_group() to clean them up before replacement.

In `@pymllm/server/launch.py`:
- Around line 10-13: main() calls engine.launch() but Engine lacks that method;
add a public launch(self) method to the Engine class (pymllm/engine/launch.py)
that orchestrates startup by invoking the existing private helpers in the
correct order: call _set_default_torch_dtype(), _config_logging(),
_check_model_and_tokenizer(), _maybe_download(), and then _launch_processes();
ensure launch handles/propagates exceptions consistently with the rest of the
class (no signature change) so main() can call engine.launch() without raising
AttributeError.

In `@pymllm/tests/test_vocab_parallel_embedding.py`:
- Line 202: mp.set_start_method("spawn", force=True) is being invoked inside
individual test functions; remove that call from the test (e.g., tests in
test_vocab_parallel_embedding.py) and instead set the start method in a
session-scoped pytest fixture in conftest.py (or wrap the call under a script
guard). Create a fixture (session scope) that calls mp.set_start_method("spawn",
force=True) once before tests run and ensure tests reference no direct call to
mp.set_start_method; alternatively protect the existing call with if __name__ ==
"__main__" so it never executes during pytest collection. Ensure you remove or
comment out any other in-test occurrences (the line at 243 as noted) so
multiprocessing start method is set only once at session setup.
- Around line 231-236: The loop that checks per-rank results uses
return_dict.get(rank, "TIMEOUT") which allows a timed-out worker to produce
"TIMEOUT" and slip past the non-zero-rank assertion; update the checks in the
block that iterates ranks (the loop using return_dict and rank) so that non-zero
ranks explicitly fail on the "TIMEOUT" sentinel (for example by asserting result
== "PASSED" or asserting result not in ("TIMEOUT",) and "ERROR" not in
str(result)); apply the same change to the second occurrence of this loop (the
one around lines 272-277) so hung/killed workers are treated as failures rather
than silently passing.
- Around line 62-63: Both tests (test_forward_pass_tp8_real and
test_weight_loading_tp8_real) hardcode os.environ["MASTER_PORT"]="29500",
causing TIME_WAIT collisions; instead, in the parent test process choose a free
port (e.g., via socket.bind(('',0)) to get an ephemeral port), set
os.environ["MASTER_PORT"] to that value in the parent, and pass it into the
spawned worker processes so each run uses a unique port; update the test
functions to compute the free port before spawning, thread that port through the
worker spawn args (or ensure worker reads MASTER_PORT from the parent env) and
remove the hardcoded "29500" usage.

In `@pyproject.toml`:
- Line 36: Document that flashinfer-python is Linux-only by updating the cuda
extras entry in pyproject.toml: annotate the cuda = ["tilelang",
"flashinfer-python", "pyzmq"] extras list with either a platform environment
marker (e.g., mark "flashinfer-python" for sys_platform == "linux") or add an
inline comment explaining the Linux-only constraint so CI and contributors see
the limitation; modify the extras definition that contains the "cuda" key and
the "flashinfer-python" identifier accordingly.
- Around line 3-4: The pyproject declares conflicting pins for apache-tvm-ffi:
the build-system lists "apache-tvm-ffi == 0.1.8" while the runtime dependencies
pin "apache-tvm-ffi == 0.1.8.post2"; make them consistent by choosing one
canonical pin and updating the other entry to match (e.g., change the
build-system entry to "apache-tvm-ffi == 0.1.8.post2" or change the runtime pin
to "== 0.1.8"), ensuring both the build-system and the runtime dependencies
reference the exact same apache-tvm-ffi version.
- Around line 22-23: Remove the test-only packages "pytest" and "pytest-html"
from the main dependencies list and add them to an appropriate development/test
optional group (for example add them under [tool.poetry.dev-dependencies] or
under [project.optional-dependencies]["test"] depending on the project format),
ensuring the names "pytest" and "pytest-html" appear only in that dev/test
section and not in the runtime dependencies block.

---

Duplicate comments:
In `@pymllm/configs/global_config.py`:
- Around line 1-321: Strip all trailing spaces from this file (every line ending
with whitespace) so it conforms to the repo style; inspect and fix occurrences
inside the GlobalConfig class (including the `@dataclass` decorator line and blank
lines), and elsewhere in functions like make_args, read_args, _parse_bool,
_unwrap_optional, and get_global_config; you can verify with grep -Pn ' +$'
pymllm/configs/global_config.py and then remove the trailing whitespace (or run
an editor/formatter that trims trailing whitespace) and commit the cleaned file.
- Around line 47-65: The class-level singleton via __new__/get_instance/reset
causes dataclass __init__ to reset fields on each direct GlobalConfig() call;
remove the __new__, get_instance, and reset methods from GlobalConfig and
instead create a single module-level instance (e.g., _global_config =
GlobalConfig() right after the class definition) and expose an accessor
get_global_config() that returns that instance; alternatively, if you must keep
instantiation, protect __post_init__ by checking an _initialized flag at the
start and set it using object.__setattr__(self, "_initialized", True) so
dataclass __init__ won’t overwrite already-initialized sub-configs (reference
symbols: GlobalConfig, __new__, __post_init__, get_instance, reset, and
_initialized).

In `@pymllm/orchestrator/async_disk_io_process.py`:
- Around line 1-3: Add a module-level docstring at the top of
pymllm.orchestrator.async_disk_io_process describing the module's responsibility
(async disk I/O helper for the orchestrator) and add a class docstring for
AsyncDiskIoProcess that follows the same scaffold used in tokenizer_process.py:
one-line summary, short description of purpose, and brief notes about public
methods/usage. Ensure the wording mirrors TokenizerProcess's style and includes
any important behavior or thread/async considerations for AsyncDiskIoProcess.

In `@pymllm/orchestrator/detokenizer_process.py`:
- Around line 1-3: Add a module-level docstring at the top of
pymllm/orchestrator/detokenizer_process.py describing the module purpose, and
add a class docstring to the DetokenizerProcess class that follows the same
scaffold used in tokenizer_process.py (brief summary, responsibilities, any key
methods or usage notes); ensure the wording and style match the existing
tokenizer_process.py docstrings for consistency.

In `@pymllm/orchestrator/model_runner_process.py`:
- Around line 1-3: Add a module-level docstring at the top of
pymllm/orchestrator/model_runner_process.py describing the module's purpose and
responsibilities, and add a class docstring for ModelRunnerProcess that explains
what the class does, its primary responsibilities, and any important usage
notes; follow the same scaffold and tone used in tokenizer_process.py so the
module and the ModelRunnerProcess class docstrings are consistent with the
project's documentation style.

In `@pymllm/orchestrator/parallel_state.py`:
- Around line 98-110: The PP group construction currently uses contiguous ranges
which collides with TP group partitioning when tensor_model_parallel_size > 1;
update the ranks computation in the loop that builds _PP_GROUP so each pipeline
group picks ranks by striding across the world (e.g. ranks = [i + k *
num_pp_groups for k in range(pipeline_model_parallel_size)]) instead of
list(range(start, start + pipeline_model_parallel_size)), keeping the same loop
and using GroupCoordinator to set _PP_GROUP and _PP_RANK; ensure you still use
world_size, world_rank, local_rank and backend as before and break when the
matching group is found.

In `@pymllm/tests/test_vocab_parallel_embedding.py`:
- Around line 164-187: The reconstruction check is self-referential because each
rank creates its own full_weight via torch.randn; to fix, set a deterministic
RNG seed before creating full_weight so all processes generate the same tensor:
call torch.manual_seed(42) immediately before the line full_weight =
torch.randn(vocab_size, embed_dim) (so load_weight, layer.weight,
expected_shard, and the rank‑0 reconstruction use the same source tensor across
ranks).

---

Nitpick comments:
In `@pymllm/configs/__init__.py`:
- Around line 8-14: The __all__ list in pymllm/configs/__init__.py is not
alphabetically ordered; update the __all__ variable so its items are sorted
lexicographically (e.g., arrange "GlobalConfig", "ModelConfig",
"QuantizationConfig", "ServerConfig", "get_global_config" or whatever
alphabetical order is correct for the exact names) to satisfy Ruff RUF022 and
ensure predictable exports; locate the __all__ definition and reorder the
entries accordingly.

In `@pymllm/configs/global_config.py`:
- Around line 281-295: Remove the redundant in-function import of ServerConfig
inside the read_args block and rely on the module-level import already present;
specifically delete the line "from pymllm.configs.server_config import
ServerConfig" within the code that builds server_updates (the block that
references cfg.server, server_updates, server_values, and reassigns cfg.server =
ServerConfig(**server_values)), and run linters/tests to ensure no unused-import
or name-resolution issues remain.

In `@pymllm/configs/model_config.py`:
- Around line 28-31: The raised AttributeError message in the code uses
multi-line concatenation which triggers Ruff TRY003; update the raise in the
relevant method (e.g., the __getattr__ in model_config.py that references
type(self).__name__, name and hf_config) to a single-line f-string such as:
raise AttributeError(f"'{type(self).__name__}' has no attribute '{name}' (also
not found on hf_config)") so the message remains clear but is a single string
literal.

In `@pymllm/configs/quantization_config.py`:
- Around line 9-18: The kv_cache_dtype field on QuantizationConfig is only
type-hinted and can accept invalid runtime strings; add a __post_init__ method
on the QuantizationConfig dataclass that validates self.kv_cache_dtype is one of
the allowed literals ("auto","float16","bfloat16","fp8_e4m3","fp8_e5m2") and
raise a ValueError with a clear message if not; place this validation inside the
QuantizationConfig class so it runs on instantiation and include the invalid
value in the error for easier debugging.

In `@pymllm/orchestrator/parallel_state.py`:
- Around line 50-56: Replace the runtime-only assert with explicit input
validation that raises a ValueError: check that tensor_model_parallel_size *
data_parallel_size * pipeline_model_parallel_size == world_size and if not raise
ValueError with the same descriptive message (use the existing f-string built
from tensor_model_parallel_size, data_parallel_size,
pipeline_model_parallel_size, and world_size) so the check remains active even
when Python is run with -O; update the code around the current assert in
parallel_state.py accordingly.
- Around line 30-35: Add comprehensive docstrings to the public API function
initialize_model_parallel and the three communication helper functions
tensor_model_parallel_all_reduce, tensor_model_parallel_all_gather, and
data_parallel_all_reduce: for each docstring describe the function’s purpose,
parameters (including tensor_model_parallel_size, data_parallel_size,
pipeline_model_parallel_size, backend, and the tensor argument for helpers),
return value, raised errors, and the passthrough semantics when world_size == 1
(i.e., no-op behavior / return input unchanged). Use the project’s docstring
style (short summary line, parameter list, returns, and notes about behavior),
and place them immediately above the corresponding def lines
(initialize_model_parallel, tensor_model_parallel_all_reduce,
tensor_model_parallel_all_gather, data_parallel_all_reduce).
- Around line 155-156: The current model_parallel_is_initialized() wrongly
infers initialization by checking _TP_GROUP/_DP_GROUP/_PP_GROUP and thus returns
False for the valid single-GPU path; add a dedicated boolean flag (e.g.,
_MODEL_PARALLEL_INITIALIZED) that initialize_model_parallel() sets to True
unconditionally when it completes (even if it creates no groups), have
model_parallel_is_initialized() return that flag instead of checking the group
variables, and ensure any teardown/reset code clears _MODEL_PARALLEL_INITIALIZED
so future checks reflect actual state; reference initialize_model_parallel,
model_parallel_is_initialized, _TP_GROUP, _DP_GROUP, and _PP_GROUP when applying
this change.

In `@pymllm/orchestrator/request_response_process.py`:
- Around line 8-10: Add a class-level docstring to RequestResponseProcess
describing its purpose and behavior as a public API entry point; update the
class definition (class RequestResponseProcess and its __init__ method) to
include a concise docstring that explains what the class does, expected
inputs/state, any important attributes, return behavior of its public methods,
and possible errors/exceptions raised so callers and docs are clear.

In `@pymllm/orchestrator/tokenizer_process.py`:
- Around line 1-3: Add a module-level docstring at the top of
tokenizer_process.py describing the module's role in the orchestrator
(responsible for running tokenization in a separate process and its expected
lifecycle), and add a class docstring to TokenizerProcess that explains its
purpose, public API expectations, lifecycle (init, start/stop behavior if any),
parameters (none for __init__ currently), return values, and possible errors;
apply the same pattern to the other placeholder classes mentioned
(AsyncDiskIoProcess, DetokenizerProcess, ModelRunnerProcess) so each public
class and its module have clear, concise docstrings documenting usage and
responsibilities.

In `@pymllm/tests/test_vocab_parallel_embedding.py`:
- Line 87: Remove the unused world_size parameter from the worker function
signatures and their invocations: delete the world_size argument from
embedding_forward_tp8_worker_cuda and weight_loading_tp8_worker_cuda, and update
run_worker_tp8_cuda (and any other call sites) to call these functions with only
(rank, local_rank); ensure any function definitions, type hints, and references
to world_size are removed so no unused-argument lint errors remain.
- Around line 197-277: Both tests duplicate the same multiprocessing
boilerplate; extract a helper _run_tp8_test(test_worker_fn) that encapsulates
setting mp.set_start_method("spawn", force=True), creating
mp.Manager/return_dict, spawning mp.Process for ranks 0..7 with local_rank==rank
calling run_worker_tp8_cuda with the provided worker function
(embedding_forward_tp8_worker_cuda or weight_loading_tp8_worker_cuda),
joining/terminating processes with timeout, and performing the same assertions
on return_dict; then replace test_forward_pass_tp8_real to call
_run_tp8_test(embedding_forward_tp8_worker_cuda) and
test_weight_loading_tp8_real to call
_run_tp8_test(weight_loading_tp8_worker_cuda).
- Line 48: Remove the unused local_world_size parameter from the function
signature (it's unused and triggers ARG001) or if it was intended for test
configuration, add a comment explaining its purpose and use it accordingly; also
move the import traceback out of the except block(s) and place it with the other
top-level stdlib imports so traceback is imported once at module import time
(look for occurrences of local_world_size in the test function signature and the
traceback import inside the except blocks around the same test).
- Around line 293-308: Add concise docstrings to the two test methods
test_cuda_forward and test_cuda_weight_loader in TestVocabParallelEmbeddingCUDA:
for test_cuda_forward describe that it verifies the VocabParallelEmbedding moves
to CUDA and returns a tensor of shape (batch, seq, embed_dim) on GPU; for
test_cuda_weight_loader describe that it checks load_weight correctly copies a
CPU tensor into the module weight on CUDA and that values match after moving
back to CPU; keep them short (one or two sentences) and follow the file's
existing docstring style.
- Line 98: Replace the hard-coded literals with module-level named constants:
introduce constants such as _TP_SIZE_TEST (for TP size 8), _MASTER_PORT_TEST
(for 29500), _VOCAB_SIZE (for 29500-related vocab), and _EMBED_DIM/_TOKEN_DIM
(for 1024/64) at the top of the test module and use those constants where the
literals appear (e.g., replace tp_size == 8 -> tp_size == _TP_SIZE_TEST,
vocab_size // 8 -> _VOCAB_SIZE // _TP_SIZE_TEST, and any uses of 1024/64 ->
_EMBED_DIM/_TOKEN_DIM, and replace the master port literal with
_MASTER_PORT_TEST); update all occurrences referenced in the review (the
assertions and calculations around tp_size, master port, vocab/embed dimensions)
so intent is clear and changes remain local to the test module.

Comment on lines +307 to +316
# Keep model path synchronized when only one side is explicitly overridden.
server_model_overridden = "server__model_path" in parsed
model_model_overridden = "model__model_path" in parsed
if server_model_overridden and not model_model_overridden:
cfg.model.model_path = cfg.server.model_path
elif model_model_overridden and not server_model_overridden:
cfg.server.model_path = cfg.model.model_path

cfg._initialized = True
return cfg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Direct mutation of cfg.server.model_path may leave tokenizer_path stale.

ServerConfig.__post_init__ sets tokenizer_path = model_path when tokenizer_path is None. After cfg.server is already reconstructed on line 295 (with tokenizer_path resolved to the original model_path), lines 311 and 313 directly assign cfg.server.model_path without triggering __post_init__. If a user provides only --model.model_path (not --server.*), line 313 silently diverges model_path and tokenizer_path.

The fix is to reconstruct ServerConfig for this path too, instead of a bare field assignment:

🐛 Proposed fix
-    elif model_model_overridden and not server_model_overridden:
-        cfg.server.model_path = cfg.model.model_path
+    elif model_model_overridden and not server_model_overridden:
+        # Rebuild so __post_init__ / _validate re-run with the new model_path.
+        server_values = {
+            dc_field.name: getattr(cfg.server, dc_field.name)
+            for dc_field in fields(cfg.server)
+        }
+        server_values["model_path"] = cfg.model.model_path
+        cfg.server = ServerConfig(**server_values)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/configs/global_config.py` around lines 307 - 316, The code mutates
cfg.server.model_path directly which bypasses ServerConfig.__post_init__ (so
tokenizer_path stays stale); instead of assigning cfg.server.model_path =
cfg.model.model_path, reinstantiate the ServerConfig (e.g., replace cfg.server
with ServerConfig(... ) or call its constructor/validator) using the updated
model_path so __post_init__ runs and tokenizer_path is set correctly; update the
branch that checks model_model_overridden and server_model_overridden to
reconstruct cfg.server from cfg.server fields with model_path =
cfg.model.model_path rather than doing a bare field assignment.

max_queued_requests: Optional[int] = None
max_total_tokens: Optional[int] = None
chunked_prefill_size: Optional[int] = None
max_prefill_tokens: int = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

max_prefill_tokens annotation should be Optional[int], not int.

The default is None and _validate already guards with if self.max_prefill_tokens is not None, making it clearly optional. The current annotation causes a type-checker error (int cannot be None).

🐛 Proposed fix
-    max_prefill_tokens: int = None
+    max_prefill_tokens: Optional[int] = None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
max_prefill_tokens: int = None
max_prefill_tokens: Optional[int] = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/configs/server_config.py` at line 43, Change the type annotation of
the field max_prefill_tokens in server_config.py from int to Optional[int] to
match its default None and the existing guard in _validate; import Optional from
typing (or use typing.Optional) if not already imported, and update any related
type hints/annotations that reference max_prefill_tokens to use Optional[int] so
static type checkers no longer report that int cannot be None.

Comment on lines +97 to +101
if self.tokenizer_path is None:
self.tokenizer_path = self.model_path
if self.served_model_name is None:
self.served_model_name = str(self.model_path)
self._validate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

str(model_path) when model_path=None silently sets served_model_name to the string "None".

When GlobalConfig.__post_init__ calls ServerConfig(model_path=None), this path runs and sets served_model_name = "None" (a non-empty, non-None string). Any consumer checking if cfg.server.served_model_name will see a truthy string, masking the "no model configured" state.

🐛 Proposed fix
         if self.served_model_name is None:
-            self.served_model_name = str(self.model_path)
+            if self.model_path is not None:
+                self.served_model_name = str(self.model_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/configs/server_config.py` around lines 97 - 101, The code currently
sets served_model_name = str(self.model_path) even when model_path is None,
producing the string "None" and masking an unset model; update the
initialization so served_model_name is only assigned from model_path when
model_path is not None (e.g. if self.served_model_name is None and
self.model_path is not None: self.served_model_name = str(self.model_path)),
leaving served_model_name as None otherwise; keep the existing tokenizer_path
handling and ensure ServerConfig._validate() is still called afterwards.

world_size = 8
local_world_size = 8 # Single node with 8 GPUs

mp.set_start_method("spawn", force=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

mp.set_start_method must not be called inside individual test methods.

mp.set_start_method("spawn", force=True) modifies global multiprocessing state for the entire process. Calling it inside a pytest test method (even with force=True) is fragile: other tests sharing the same process may be affected, and the call order matters.

Move the call to a session-scoped fixture in conftest.py or guard it under if __name__ == "__main__".

Also applies to: 243-243

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/tests/test_vocab_parallel_embedding.py` at line 202,
mp.set_start_method("spawn", force=True) is being invoked inside individual test
functions; remove that call from the test (e.g., tests in
test_vocab_parallel_embedding.py) and instead set the start method in a
session-scoped pytest fixture in conftest.py (or wrap the call under a script
guard). Create a fixture (session scope) that calls mp.set_start_method("spawn",
force=True) once before tests run and ensure tests reference no direct call to
mp.set_start_method; alternatively protect the existing call with if __name__ ==
"__main__" so it never executes during pytest collection. Ensure you remove or
comment out any other in-test occurrences (the line at 243 as noted) so
multiprocessing start method is set only once at session setup.

Comment on lines +231 to +236
for rank in range(world_size):
result = return_dict.get(rank, "TIMEOUT")
if rank == 0:
assert result == "PASSED", f"Rank {rank} failed: {result}"
else:
assert "ERROR" not in str(result), f"Rank {rank} error: {result}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Timed-out non-zero ranks silently pass the assertion.

When a worker exceeds the 120-second join timeout and is killed, return_dict[rank] is never set. return_dict.get(rank, "TIMEOUT") returns "TIMEOUT". For non-zero ranks, the assertion is "ERROR" not in str(result)"TIMEOUT" passes that check, so a hung or killed worker is silently ignored.

🔧 Proposed fix
         for rank in range(world_size):
             result = return_dict.get(rank, "TIMEOUT")
             if rank == 0:
                 assert result == "PASSED", f"Rank {rank} failed: {result}"
             else:
-                assert "ERROR" not in str(result), f"Rank {rank} error: {result}"
+                assert result == "OK", f"Rank {rank} failed: {result}"

Also applies to: 272-277

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/tests/test_vocab_parallel_embedding.py` around lines 231 - 236, The
loop that checks per-rank results uses return_dict.get(rank, "TIMEOUT") which
allows a timed-out worker to produce "TIMEOUT" and slip past the non-zero-rank
assertion; update the checks in the block that iterates ranks (the loop using
return_dict and rank) so that non-zero ranks explicitly fail on the "TIMEOUT"
sentinel (for example by asserting result == "PASSED" or asserting result not in
("TIMEOUT",) and "ERROR" not in str(result)); apply the same change to the
second occurrence of this loop (the one around lines 272-277) so hung/killed
workers are treated as failures rather than silently passing.

Comment on lines +3 to 4
"scikit-build-core>=0.11.0", "apache-tvm-ffi == 0.1.8"
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

apache-tvm-ffi version mismatch between build-system and runtime.

The build-system requires apache-tvm-ffi == 0.1.8 (Line 3) while the runtime dependency pins == 0.1.8.post2 (Line 24). Although .post2 is a post-release fix of the same core, the two pins are inconsistent and may confuse dependency resolvers or CI environments that install both.

💡 Proposed fix
-  "scikit-build-core>=0.11.0", "apache-tvm-ffi == 0.1.8"
+  "scikit-build-core>=0.11.0", "apache-tvm-ffi == 0.1.8.post2"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"scikit-build-core>=0.11.0", "apache-tvm-ffi == 0.1.8"
]
"scikit-build-core>=0.11.0", "apache-tvm-ffi == 0.1.8.post2"
]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 3 - 4, The pyproject declares conflicting pins
for apache-tvm-ffi: the build-system lists "apache-tvm-ffi == 0.1.8" while the
runtime dependencies pin "apache-tvm-ffi == 0.1.8.post2"; make them consistent
by choosing one canonical pin and updating the other entry to match (e.g.,
change the build-system entry to "apache-tvm-ffi == 0.1.8.post2" or change the
runtime pin to "== 0.1.8"), ensuring both the build-system and the runtime
dependencies reference the exact same apache-tvm-ffi version.

Comment on lines 22 to 23
"pytest",
"pytest-html",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Move pytest and pytest-html out of the main dependencies list.

Listing test-only packages as required runtime dependencies forces them onto every end user. They should be in an optional group (e.g., dev or test).

💡 Proposed fix
 dependencies=[
   "packaging",
-  "pytest",
-  "pytest-html",
   "apache-tvm-ffi == 0.1.8.post2",
   ...
 ]

 [project.optional-dependencies]
 cuda = ["tilelang", "flashinfer-python", "pyzmq"]
+dev = ["pytest", "pytest-html"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` around lines 22 - 23, Remove the test-only packages "pytest"
and "pytest-html" from the main dependencies list and add them to an appropriate
development/test optional group (for example add them under
[tool.poetry.dev-dependencies] or under [project.optional-dependencies]["test"]
depending on the project format), ensuring the names "pytest" and "pytest-html"
appear only in that dev/test section and not in the runtime dependencies block.


[project.optional-dependencies]
cuda = ["tilelang"]
cuda = ["tilelang", "flashinfer-python", "pyzmq"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

flashinfer-python is Linux-only — document the platform constraint.

The official FlashInfer documentation explicitly notes that "the package currently used by FlashInfer is named flashinfer-python, not flashinfer", confirming the package name is correct. However, FlashInfer's prerequisites list "OS: Linux only", meaning pip install pymllm[cuda] will not work on Windows or macOS even for developers.

Consider adding an environment marker or a comment in pyproject.toml to document this platform restriction so CI and contributors are aware.

As per coding guidelines, "Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pyproject.toml` at line 36, Document that flashinfer-python is Linux-only by
updating the cuda extras entry in pyproject.toml: annotate the cuda =
["tilelang", "flashinfer-python", "pyzmq"] extras list with either a platform
environment marker (e.g., mark "flashinfer-python" for sys_platform == "linux")
or add an inline comment explaining the Linux-only constraint so CI and
contributors see the limitation; modify the extras definition that contains the
"cuda" key and the "flashinfer-python" identifier accordingly.

- Added a new `store_cache` CUDA kernel for efficient key/value tensor storage in a cache.
- Introduced Python interface for the `store_cache` kernel, enabling its use in PyTorch.
- Created benchmarks to compare `store_cache` performance against standard PyTorch indexing.
- Updated `.gitignore` to exclude `.claude` directory and added `.pytest_cache` to `mllm-kernel`.
- Added tests for `store_cache` functionality to ensure correctness and performance.
- Refactored memory management in `KVPool` to utilize the new `store_cache` kernel when applicable.
- Enhanced readability by formatting socket creation parameters across multiple lines in the `init_sockets` method of `TokenizerProcess`.
- Maintained functionality while improving code clarity for future maintenance.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 20

🧹 Nitpick comments (21)
.claude/skills/update-codeowners/SKILL.md (1)

38-38: Nitpick: "exact same" is wordy.

Consider replacing "exact same" with "identical" or simply "the same" for tighter prose.

✏️ Proposed wording fix
-  - [ ] No accidental duplicate rule for the exact same path pattern.
+  - [ ] No accidental duplicate rule for the same path pattern.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/update-codeowners/SKILL.md at line 38, Replace the wording
"exact same" in the checklist item "- [ ] No accidental duplicate rule for the
exact same path pattern." with a tighter phrase such as "identical" or "the
same" (e.g., "- [ ] No accidental duplicate rule for the same path pattern."),
ensuring the checklist line text is updated accordingly to improve concision.
mllm-kernel/mllm_kernel/cuda/jit/store_cache.py (4)

56-62: Redundant manual cache alongside the @jit decorator's internal cache_once.

_KERNEL_CACHE duplicates the memoization that cache_once inside the @jit decorator already provides for each kernel variant (since _make_store_cache_kernel creates a distinct decorated function per row_bytes). The manual dict is harmless but adds a second caching layer. A brief inline comment explaining why both are needed (e.g., "caches the wrapper function itself, not just the compiled module") would help future readers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm-kernel/mllm_kernel/cuda/jit/store_cache.py` around lines 56 - 62, The
current manual cache _KERNEL_CACHE duplicates the memoization already provided
by the `@jit` decorator's internal cache_once for each kernel variant; update the
code to keep the manual dict but add a concise inline comment above
_KERNEL_CACHE explaining its purpose: that _KERNEL_CACHE stores the Python-level
wrapper function returned by _make_store_cache_kernel (so callers reuse the
exact wrapper object) while cache_once/@jit handles the compiled module
memoization, and reference _get_kernel, _make_store_cache_kernel and
cache_once/@jit in that comment to clarify why both layers exist.

18-24: PDL capability check doesn't account for multi-GPU setups and can't distinguish sm_90 vs sm_90a.

Two concerns:

  1. torch.cuda.get_device_capability() defaults to the current device. With @cache_once (no device argument in the signature), the result for device 0 is cached and reused even when the kernel later runs on a different GPU with a different compute capability.

  2. The comment says "sm_90a (Hopper)", but get_device_capability() returns (9, 0) for both sm_90 and sm_90a — PyTorch cannot distinguish them. PDL specifically requires sm_90a. Practically this is low-risk (most Hopper GPUs are sm_90a), but it's worth a comment.

Suggested improvement for multi-GPU awareness
 `@cache_once`
-def _is_arch_support_pdl() -> bool:
+def _is_arch_support_pdl(device_index: int = 0) -> bool:
     if not torch.cuda.is_available():
         return False
-    major, minor = torch.cuda.get_device_capability()
-    # PDL requires sm_90a (Hopper) or later
+    major, minor = torch.cuda.get_device_capability(device_index)
+    # PDL requires sm_90a (Hopper) or later.
+    # NOTE: get_device_capability cannot distinguish sm_90 from sm_90a;
+    # we assume sm_90a when (major, minor) >= (9, 0).
     return major > 9 or (major == 9 and minor >= 0)

Then pass the device index from the tensor in _make_store_cache_kernel / store_cache.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm-kernel/mllm_kernel/cuda/jit/store_cache.py` around lines 18 - 24, The
PDL capability check _is_arch_support_pdl currently reads
torch.cuda.get_device_capability() for the current device and caches that result
once, which breaks on multi-GPU systems and also cannot distinguish sm_90 vs
sm_90a; update _is_arch_support_pdl to accept a device index/torch.device
parameter (or make the cache key include device) and query
torch.cuda.get_device_capability(device) so results are per-GPU, then propagate
that device argument from call sites in _make_store_cache_kernel and store_cache
(pass the tensor.device.index or tensor.device) so the correct GPU capability is
checked at runtime; also add a short comment near the check noting that PyTorch
returns (9,0) for both sm_90 and sm_90a and thus this check is conservative for
PDL which requires sm_90a.

91-127: store_cache skips the row_bytes % 4 validation that can_use_store_cache enforces.

If a caller invokes store_cache directly without first calling can_use_store_cache, an invalid row_bytes (e.g., not a multiple of 4) will propagate to JIT compilation and produce an opaque error. A lightweight guard here would improve the developer experience.

Proposed guard
 def store_cache(
     k: torch.Tensor,
     v: torch.Tensor,
     k_cache: torch.Tensor,
     v_cache: torch.Tensor,
     indices: torch.Tensor,
     *,
     row_bytes: int = 0,
     num_split: int = 0,
 ) -> None:
     ...
     row_bytes = row_bytes or k.shape[-1] * k.element_size()
+    if row_bytes % 4 != 0:
+        raise ValueError(
+            f"row_bytes must be a multiple of 4, got {row_bytes}"
+        )
     kernel = _get_kernel(row_bytes)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm-kernel/mllm_kernel/cuda/jit/store_cache.py` around lines 91 - 127, The
store_cache function currently omits the row_bytes alignment check present in
can_use_store_cache; add a guard in store_cache (after computing row_bytes =
row_bytes or k.shape[-1] * k.element_size() and before calling _get_kernel or
JIT) that validates row_bytes % 4 == 0 and raises a clear ValueError (or
TypeError) with a descriptive message if the check fails so callers who bypass
can_use_store_cache get an immediate, useful error; reference the store_cache
function and can_use_store_cache behavior when implementing the guard and keep
the check before selecting num_split or invoking kernel(k, v, k_cache, v_cache,
indices, num_split).

65-88: @cache_once permanently caches False for transient JIT compilation failures.

If _get_kernel raises due to a transient issue (e.g., resource pressure during compilation), can_use_store_cache returns False and that result is cached forever for that row_bytes. A subsequent call will never retry. This is acceptable if JIT failures are always deterministic, but worth a docstring note so callers are aware.

Also, the static analysis hint about return True in a try block (TRY300) is a valid style point — moving it into else makes the intent clearer:

Minor style fix
     try:
         _get_kernel(row_bytes)
-        return True
     except Exception as e:
         logger.warning(
             "Failed to load JIT store_cache kernel with row_bytes=%d: %s",
             row_bytes,
             e,
         )
         return False
+    else:
+        return True
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm-kernel/mllm_kernel/cuda/jit/store_cache.py` around lines 65 - 88,
can_use_store_cache currently uses `@cache_once` which will permanently cache
False when _get_kernel raises transient JIT compilation errors; update the
function to avoid caching transient failures (e.g., remove or modify `@cache_once`
for error paths, or catch exceptions and re-raise after logging so the cache
only stores successful results), add a docstring note on caching semantics for
callers, and move the successful-return into a try/except/else pattern so the
return True resides in the else block rather than inside the try — refer to
can_use_store_cache, the `@cache_once` decorator, and _get_kernel to locate and
change the behavior.
mllm-kernel/benchmarks/bench_store_cache.py (2)

36-89: Profiler context doesn't capture complete CUDA timing — synchronize is outside the recording window.

torch.cuda.synchronize() on line 57 runs after the with profile(...) as prof block has exited, so any in-flight CUDA work from the last iteration might not be attributed to the profiler trace. Modern PyTorch's profiler typically flushes on exit, but for correctness of the timing summary you could move the synchronize inside:

Suggested adjustment
     with profile(
         activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
         record_shapes=False,
         profile_memory=False,
         with_stack=False,
     ) as prof:
         for _ in range(iters):
             fn()
-    torch.cuda.synchronize()
+        torch.cuda.synchronize()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm-kernel/benchmarks/bench_store_cache.py` around lines 36 - 89, _in
_profile_path_, the final torch.cuda.synchronize() is executed after the
profiler context exits, so in-flight CUDA work from the last iteration may not
be captured; move the synchronization into the profiler block immediately after
the iters loop (i.e., call torch.cuda.synchronize() while still inside the with
profile(...) as prof: block right after running fn() iters times) and remove or
keep the external one accordingly so that the profiler records completed CUDA
activity for the last iteration.

62-71: time_attr and sort_key always hold the same value — the branching is redundant.

Both variables resolve to "self_cuda_time_total" or "self_device_time_total" based on the same condition. You can drop sort_key and reuse time_attr.

Simplification
     time_attr = (
         "self_cuda_time_total"
         if events and hasattr(events[0], "self_cuda_time_total")
         else "self_device_time_total"
     )
-    sort_key = (
-        "self_cuda_time_total"
-        if time_attr == "self_cuda_time_total"
-        else "self_device_time_total"
-    )
     total_self_device_us = sum(float(getattr(evt, time_attr, 0.0)) for evt in events)
     avg_self_device_us = total_self_device_us / max(iters, 1)
 
     print(f"\n=== {name} ===")
     print(
         prof.key_averages().table(
-            sort_by=sort_key,
+            sort_by=time_attr,
             row_limit=row_limit,
         )
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm-kernel/benchmarks/bench_store_cache.py` around lines 62 - 71, The code
defines both time_attr and sort_key using the same condition, making sort_key
redundant; update the benchmarking logic to remove sort_key and use time_attr
everywhere instead (replace any usage of sort_key with time_attr), keeping the
original conditional that checks events and hasattr(events[0],
"self_cuda_time_total") to set time_attr; ensure any sorting or attribute
lookups that previously referenced sort_key now reference time_attr so behavior
is unchanged.
mllm-kernel/mllm_kernel/cuda/csrc/store_cache.cuh (1)

138-149: User-supplied num_split that doesn't match alignment causes an unrecoverable Panic.

get_kernel falls through all if constexpr / if checks and hits Panic when num_split doesn't satisfy the alignment requirement for kElementBytes. For example, calling store_cache(..., num_split=4) with row_bytes=128 (not divisible by 512) triggers the panic.

The Python-side auto-selection is safe, but the num_split parameter is user-facing. Consider either:

  • Validating num_split in the Python store_cache function before dispatching, or
  • Replacing Panic with a descriptive RuntimeCheck that includes the alignment requirement.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm-kernel/mllm_kernel/cuda/csrc/store_cache.cuh` around lines 138 - 149,
The get_kernel template can hit Panic when a user-provided num_split doesn't
satisfy kElementBytes alignment (e.g., num_split=4 with row_bytes=128); change
handling so this is recoverable: either add explicit validation of num_split in
the Python-facing store_cache dispatcher (check row_bytes % (num_split * 128) ==
0 and raise a clear Python error) or replace the Panic call inside get_kernel
with a descriptive runtime check (use RuntimeCheck or equivalent to emit an
error message naming get_kernel, the requested num_split, kElementBytes and the
required alignment) so callers get a clear, non-crashing diagnostic; update
references to get_kernel, store_cache, store_kernel and Panic accordingly.
pymllm/orchestrator/request_response_process.py (2)

91-92: Use TypeError for type-check failures, not ValueError.

The check isinstance(request.rid, str) is validating the type of rid, not its value. TypeError is the conventional exception for this.

Proposed fix
         if not isinstance(request.rid, str):
-            raise ValueError("RequestResponseProcess currently accepts single requests only.")
+            raise TypeError("RequestResponseProcess currently accepts single requests only.")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/request_response_process.py` around lines 91 - 92, The
type-check for request.rid in RequestResponseProcess is raising ValueError but
should raise TypeError for type failures; update the conditional in
request_response_process.py (the block that checks "if not
isinstance(request.rid, str):") to raise TypeError with an appropriate message
(e.g., indicating that rid must be a str) instead of ValueError so the exception
semantics match the type validation.

68-82: create_zmq_socket from ipc_utils is typed for sync zmq.Context/zmq.Socket, but async variants are passed here.

This works at runtime because zmq.asyncio.Context subclasses zmq.Context, but the type annotations will cause type-checker warnings. Consider adding an overload or Union in ipc_utils.py, or use # type: ignore with a comment here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/request_response_process.py` around lines 68 - 82, The
type-checker warning is caused by passing a zmq.asyncio.Context into
create_zmq_socket (typed for sync zmq.Context) from start in
request_response_process.py; fix by updating ipc_utils.create_zmq_socket
signature to accept Union[zmq.Context, zmq.asyncio.Context] (and corresponding
Socket/AsyncSocket unions or add overloads) so both sync and async contexts are
typed, or if you prefer a localized change, add a precise "# type:
ignore[assignment]" (with a short comment referencing create_zmq_socket
expecting sync Context) on the two create_zmq_socket calls in start to suppress
the warning; reference the start method and create_zmq_socket when making the
change.
pymllm/orchestrator/scheduler_process.py (1)

180-195: Unused batch parameter in process_batch_result.

The batch argument is not referenced in the method body (only result is used). Either remove it or add a brief comment about planned future use.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/scheduler_process.py` around lines 180 - 195, The
process_batch_result method currently accepts an unused parameter batch; either
remove batch from the signature of process_batch_result and update all callers
(e.g., places that call SchedulerProcess.process_batch_result) to stop passing
the batch, or keep the parameter and add a one-line comment inside
process_batch_result explaining that batch is reserved for future features
(e.g., KV-cache or completion checks) to avoid lint warnings; update any type
hints or tests accordingly so the codebase remains consistent with the chosen
approach.
pymllm/engine/io_struct.py (2)

85-89: __getitem__ returns self for single-item batches — callers may mutate the original.

When batch_size == 1, __getitem__(0) returns self rather than a copy. Any mutation by the caller (e.g., appending to output_token_ids in _forward_batch) will modify the original GenerateReqInput. This is a common aliasing trap. Consider returning a shallow copy if immutability is expected.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/engine/io_struct.py` around lines 85 - 89, The __getitem__ in
GenerateReqInput currently returns self when batch_size == 1, causing callers
(e.g., _forward_batch) that mutate fields like output_token_ids to alias and
modify the original; change __getitem__ so that when batch_size == 1 it returns
a shallow copy of the GenerateReqInput instance instead of self (either by
invoking an existing clone/copy helper or by constructing a new GenerateReqInput
with the same field values), ensuring mutable fields (like output_token_ids) are
new lists to avoid aliasing.

8-27: BaseReq and BaseBatchReq lack class-level docstrings.

These are public API dataclasses depended upon by other modules. A brief docstring explaining purpose and the rid/rids semantics would improve discoverability. As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/engine/io_struct.py` around lines 8 - 27, Add concise class-level
docstrings to the public dataclasses BaseReq and BaseBatchReq describing their
purpose and the semantics of the rid/rids fields (that rid is an optional single
or list of request identifiers and regenerate_rid returns/refreshes them, and
that rids is a list of request identifiers with regenerate_rids refreshing
them), and include brief notes on return types for regenerate_rid and
regenerate_rids; update the docstrings near the class definitions of BaseReq and
BaseBatchReq to improve discoverability and comply with the public API
documentation guideline.
pymllm/orchestrator/model_runner_process.py (1)

50-56: Consider adding error handling around recv_pyobj/send_pyobj in the event loop.

If a malformed message arrives (deserialization failure) or the socket encounters a transient error, the entire process will crash with an unhandled exception. A try/except inside the loop with logging would improve resilience — the outer try/finally in run_model_runner_process will still trigger shutdown, but the process will be lost.

♻️ Suggested resilience improvement
     def event_loop(self) -> None:
         """Infinite loop: recv batch -> forward -> sample -> send result."""
         logger.info("ModelRunnerProcess event loop started")
         while True:
-            batch = self._recv_from_scheduler.recv_pyobj()
-            result = self._forward_batch(batch)
-            self._send_to_scheduler.send_pyobj(result)
+            try:
+                batch = self._recv_from_scheduler.recv_pyobj()
+                result = self._forward_batch(batch)
+                self._send_to_scheduler.send_pyobj(result)
+            except zmq.ZMQError:
+                logger.exception("ZMQ error in ModelRunnerProcess event loop")
+                break
+            except Exception:
+                logger.exception("Unexpected error in ModelRunnerProcess event loop")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/model_runner_process.py` around lines 50 - 56, Wrap the
body of ModelRunnerProcess.event_loop in a try/except that catches exceptions
from self._recv_from_scheduler.recv_pyobj, self._forward_batch, and
self._send_to_scheduler.send_pyobj, log the error with context (including the
exception) and continue the loop instead of letting the process crash; on
deserialization/malformed message errors drop or ack the offending batch and
continue, and on persistent socket errors optionally backoff or break to let
run_model_runner_process invoke shutdown. Ensure logging references event_loop,
_recv_from_scheduler.recv_pyobj, _forward_batch and
_send_to_scheduler.send_pyobj so the failing component is clear.
pymllm/orchestrator/ipc_utils.py (1)

14-18: No cleanup for stale IPC socket files on abnormal termination.

If a process crashes without cleanly closing its ZMQ sockets, the socket files under _IPC_DIR will linger. On restart, ZMQ will rebind successfully, but the directory itself may accumulate stale files over time. Consider adding a cleanup_ipc_dir() helper (or an atexit hook) that removes known socket files during orderly shutdown or startup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/ipc_utils.py` around lines 14 - 18, Add a cleanup helper
to remove stale IPC socket files and wire it into startup/shutdown: implement a
new function cleanup_ipc_dir() that lists files in _IPC_DIR (e.g., with glob or
os.listdir), removes only known socket file patterns (or all files in that
directory) using os.remove with exception handling, call cleanup_ipc_dir() from
_ensure_ipc_dir() at startup to purge leftovers, and also register it with
atexit.register(cleanup_ipc_dir) so sockets created by this process are removed
on orderly shutdown; reference _IPC_DIR, _ensure_ipc_dir, and the new
cleanup_ipc_dir function when making changes.
pymllm/engine/launch.py (2)

30-36: Engine class and __init__ lack docstrings.

Per coding guidelines, public APIs and classes must have docstrings explaining purpose, parameters, and errors. The Engine class has no class docstring, and __init__ has no description of what it configures or what errors it may raise (ValueError if model/tokenizer paths are missing, ValueError for unsupported dtype).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/engine/launch.py` around lines 30 - 36, Add docstrings to the Engine
class and its __init__ method: for the class describe its responsibility
(managing subprocesses, request/response process, and model setup) and for
__init__ document parameters (none), what it configures (_subprocesses,
_rr_process, logging, default torch dtype, model/tokenizer checks), and
enumerate raised exceptions (ValueError when model/tokenizer paths are missing
or when an unsupported dtype is configured). Reference the Engine class and its
__init__ constructor and ensure the docs mention the side effects of calling
_config_logging, _set_default_torch_dtype, and _check_model_and_tokenizer so
callers know about potential ValueError raises.

270-291: Unused rid parameter in both static helpers.

Static analysis (ARG004) confirms rid is never referenced inside _wait_for_final_result (line 271) or _stream_results (line 281). Both methods operate entirely on the ReqState object.

♻️ Proposed refactor
 `@staticmethod`
-async def _wait_for_final_result(rid: str, state: ReqState) -> Dict[str, Any]:
+async def _wait_for_final_result(state: ReqState) -> Dict[str, Any]:

 `@staticmethod`
-async def _stream_results(rid: str, state: ReqState) -> AsyncIterator[Dict[str, Any]]:
+async def _stream_results(state: ReqState) -> AsyncIterator[Dict[str, Any]]:

Update call sites to drop rid:

-return await self._wait_for_final_result(rid, state)
+return await self._wait_for_final_result(state)

-async for chunk in self._stream_results(rid, state):
+async for chunk in self._stream_results(state):

-yield await self._wait_for_final_result(rid, state)
+yield await self._wait_for_final_result(state)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/engine/launch.py` around lines 270 - 291, Both static helpers
_wait_for_final_result and _stream_results declare an unused rid parameter;
remove rid from their signatures and any internal references, leaving only
(state: ReqState) as the parameter, update their return/type hints to remain the
same, and then update all call sites that invoke _wait_for_final_result(...) and
_stream_results(...) to pass only the ReqState instance; ensure
imports/annotations still reference ReqState and run tests/lint to confirm
ARG004 is resolved.
pymllm/mem_cache/radix_cache.py (2)

70-70: __slots__ tuples are unsorted (Ruff RUF023)

Both RadixKey.__slots__ (Line 70) and TreeNode.__slots__ (Lines 113-126) are flagged by Ruff RUF023. Applying natural sort aligns with Ruff's expectations and improves scan-ability.

Also applies to: 113-126

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mem_cache/radix_cache.py` at line 70, RadixKey.__slots__ and
TreeNode.__slots__ are unsorted; change each __slots__ tuple to use a
consistently ordered (natural/alphanumeric) sequence to satisfy Ruff RUF023 —
locate RadixKey.__slots__ ("token_ids", "extra_key") and the TreeNode.__slots__
declaration and reorder the strings alphabetically (or by natural sort) so
member names are sorted, then run the linter to confirm the warning is resolved.

129-129: defaultdict(TreeNode) creates nodes with side-effects on accidental key access

Using defaultdict(TreeNode) means any unguarded node.children[missing_key] silently creates a new TreeNode — invoking _next_node_id(), setting last_access_time, and initialising a fresh defaultdict for its own children. All current call-sites correctly guard with if ck not in node.children:, but this default factory is a latent footgun (and makes bugs harder to spot during future maintenance). A plain dict would immediately raise KeyError on unguarded access.

♻️ Proposed refactor
-        self.children: Dict[Any, TreeNode] = defaultdict(TreeNode)
+        self.children: Dict[Any, TreeNode] = {}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mem_cache/radix_cache.py` at line 129, Replace the latent-footgun
defaultdict usage so TreeNode.children is a plain dict instead of
defaultdict(TreeNode); update any places that previously relied on implicit
creation to explicitly construct and assign a new TreeNode (calling
TreeNode(...) so _next_node_id() and last_access_time are invoked only when
intended). Specifically change the attribute declaration in class TreeNode from
"self.children: Dict[Any, TreeNode] = defaultdict(TreeNode)" to a plain dict,
and audit call-sites that check/insert children (e.g., where code currently does
"if ck not in node.children: ..." or might index directly) to use explicit
node.children[ck] = TreeNode(...) before accessing the child.
pymllm/mem_cache/__init__.py (1)

20-37: __all__ is not sorted (Ruff RUF022)

Ruff flags the __all__ list as not in isort order. Sorting it keeps the public API surface easy to scan.

♻️ Proposed sort
 __all__ = [
     # memory_pool
+    "KVPool",
     "KVPool",
     "TokenToKVPoolAllocator",
     "ReqToTokenPool",
-    "KVPool",
-    "TokenToKVPoolAllocator",
-    "ReqToTokenPool",
-    "make_full_attention_net_mem_pool",
-    "make_req_to_token_pool",
-    # radix_cache
-    "RadixCache",
-    "RadixKey",
-    "TreeNode",
-    "MatchResult",
-    "InsertResult",
-    "EvictResult",
-    "hash_token_ids",
-    "hash_to_int64",
-    "hash_bytes",
+    "EvictResult",
+    "InsertResult",
+    "KVPool",
+    "MatchResult",
+    "RadixCache",
+    "RadixKey",
+    "ReqToTokenPool",
+    "TokenToKVPoolAllocator",
+    "TreeNode",
+    "hash_bytes",
+    "hash_to_int64",
+    "hash_token_ids",
+    "make_full_attention_net_mem_pool",
+    "make_req_to_token_pool",
 ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mem_cache/__init__.py` around lines 20 - 37, The __all__ list in
pymllm.mem_cache is not sorted which triggers Ruff RUF022; reorder the entries
in the __all__ list into case-sensitive alphabetical order (e.g., arrange
entries like EvictResult, InsertResult, KVPool, MatchResult, RadixCache,
RadixKey, ReqToTokenPool, TokenToKVPoolAllocator, TreeNode, and the hash_* and
factory names such as hash_bytes, hash_token_ids, hash_to_int64,
make_full_attention_net_mem_pool, make_req_to_token_pool) so the public API
names are sorted and Ruff stops flagging the file.
pymllm/mem_cache/memory_pool.py (1)

97-97: _row_bytes computation creates a throwaway tensor

torch.tensor([], dtype=dtype).element_size() allocates a temporary tensor purely to query element size. Prefer the direct dtype API:

♻️ Proposed refactor
-        self._row_bytes = k_row_dim * torch.tensor([], dtype=dtype).element_size()
+        self._row_bytes = k_row_dim * torch.empty(0, dtype=dtype).element_size()

Or even more directly (zero allocation): torch.finfo(dtype).bits // 8 for floating-point dtypes (guard with dtype.is_floating_point if integer dtypes need support).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/mem_cache/memory_pool.py` at line 97, The allocation of a temporary
tensor to compute element size in the _row_bytes assignment should be replaced
with a dtype-safe, zero-allocation computation; update the code that computes
self._row_bytes (currently using k_row_dim and dtype) to derive
bytes-per-element from the dtype directly (use torch.finfo(dtype).bits // 8 for
floating dtypes and torch.iinfo(dtype).bits // 8 for integer dtypes, or use a
try/except that falls back between finfo and iinfo) and multiply by k_row_dim so
no throwaway tensor is created.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@mllm-kernel/benchmarks/bench_store_cache.py`:
- Around line 125-132: Add an explicit validation that args.batch_size <=
args.num_slots before creating indices so we don't silently truncate; if the
check fails, raise a clear ValueError describing both args.batch_size and
args.num_slots. Locate the block where k, v, indices, k_cache, and v_cache are
created (variables k, v, indices, k_cache, v_cache) and insert the guard
immediately before computing indices = torch.randperm(...), ensuring the check
runs on the same device/CPU context as the rest of the setup.

In `@pymllm/engine/io_struct.py`:
- Around line 113-133: The to_request_dict method currently calls
payload.update(self.extra_options) which allows self.extra_options to silently
overwrite core fields (e.g., "rid", "text", "input_ids", "sampling_params",
etc.); modify to_request_dict to check for key collisions between
self.extra_options and the core keys collected in payload before applying them,
and either raise a ValueError (reject) or emit a clear warning via the module
logger (warn) when a collision is detected (choose one behavior consistently),
and only merge non-colliding keys into payload if proceeding; reference the
to_request_dict function, the payload dict, and self.extra_options when
implementing this guard.

In `@pymllm/engine/launch.py`:
- Around line 143-151: The readiness loop blocks indefinitely because
reader.recv() has no timeout; update the loop that iterates procs_and_readers to
call reader.poll(timeout_sec) (e.g., a few seconds configurable) before calling
reader.recv(), raise a RuntimeError if poll returns False (timeout) and include
the process name in the message, keep the existing EOFError handling for recv,
and only call reader.recv() when poll indicates data is available; reference
procs_and_readers, reader.recv, reader.poll and the RuntimeError raised for
timeouts.
- Line 44: mp.set_start_method("spawn", force=True) is a global process-wide
mutation and must not live inside the instance method _launch_processes; remove
that call from pymllm.engine.launch._launch_processes (or any Engine.launch
helper) and instead set the start method exactly once in the program entrypoint:
add mp.set_start_method("spawn") inside an if __name__ == "__main__" guard
before constructing/launching Engine (e.g., before Engine() / engine.launch())
so the start method is configured at startup and not mutated at runtime.
- Around line 206-212: The isinstance(rid, list) guard inside the inner async
_run() is dead and should be moved to the start of generate(): validate rid (and
similarly in generate_async) immediately when the function begins—before you
build the request or call add_request—by raising ValueError("Synchronous
`generate` currently supports single request.") for list rid inputs; then remove
the unreachable check from _run(). This keeps the validation consistent with
add_request and avoids redundant/unreachable code in _run() and the analogous
check in generate_async.
- Around line 31-36: The __init__ currently never initializes self._loop which
can cause AttributeError if _launch_processes sets self._rr_process before
creating the asyncio loop or if generate() is called before launch(); initialize
self._loop = None in __init__ (add to the existing initializations) and update
shutdown() (and any other places that access self._loop, e.g., generate()) to
guard usage with "if self._loop is not None" (or hasattr(self, "_loop") and
self._loop is not None) before calling methods on it so accesses are safe even
if loop creation failed or launch() wasn’t run.
- Around line 296-299: The shutdown block currently swallows all errors with
"except Exception: pass"; update the code around
self._loop.run_until_complete(self._rr_process.shutdown()) to handle failures
explicitly by (1) checking that self._loop and self._rr_process exist/are
initialized before calling, and (2) catching exceptions as "except Exception as
e" and logging or re-raising the error instead of ignoring it so shutdown
failures (including AttributeError when _loop is missing) are visible; reference
the shutdown call self._rr_process.shutdown and the run loop invocation
self._loop.run_until_complete when making the change.
- Around line 359-363: When calling AutoConfig.from_pretrained to set
cfg.model.hf_config, detect if the trust_remote_code flag is True and emit a
clear security warning before invoking the call (e.g., use the module logger to
logger.warning("...") or similar). Reference the trust_remote_code variable and
the AutoConfig.from_pretrained call around cfg.model.hf_config so the warning is
logged only when trust_remote_code is True and includes a brief note that remote
code will be executed and operators should verify the source.

In `@pymllm/mem_cache/memory_pool.py`:
- Around line 108-125: The tensors in KVPool.__init__ (used by self.k_buffer and
self.v_buffer) are being created with pin_memory=True even when device is CUDA
which raises RuntimeError; update KVPool.__init__ to convert the device with
torch.device(device) and then override pin_memory = pin_memory and
self.device.type == "cpu" (i.e., only keep pin_memory True for CPU devices)
before calling torch.zeros, and apply the same guard/behavior in the factory
function make_full_attention_net_mem_pool (and update its docstring if it keeps
pin_memory=True by default) so torch.zeros is never called with pin_memory=True
for GPU devices.

In `@pymllm/mem_cache/radix_cache.py`:
- Around line 369-404: Phase 2 leaves node.value populated after calling
_free_swa_indices and _tombstone_node, causing Phase 1 to call _free_indices on
the same tensor later; fix by marking the node as swa-freed immediately after
SWA-freeing: after calling _free_swa_indices(node.value) (and before/after
_tombstone_node(node)) set a distinct sentinel on the node (e.g., node.value =
SWA_FREED_SENTINEL or node.swa_freed = True) so later code in evict's Phase 1
(the parent-check that calls _free_indices(p.value)) skips freeing when the
sentinel/flag indicates the indices were already freed; update any checks that
rely on value is None accordingly to treat the sentinel/flag as already freed.
- Around line 637-680: _insert_swa incorrectly discards the return value of
_split_node and then applies frees/assignments to the suffix child; capture the
node returned by self._split_node(node.key, node, plen) (e.g. prefix_node =
self._split_node(...)) and use that returned prefix_node for the following
SWA-specific operations (_free_indices on prefix_node.value slice, assign
prefix_node.value = value[:plen].clone(), and set prefix_node.swa_tombstone =
False) so the prefix (new leaf) is updated instead of the suffix; leave the
subsequent traversal logic (total_prefix, key/value slicing, and node
reassignment) unchanged except to continue using the correct node reference
returned by _split_node.
- Around line 97-103: The module-level counter _node_counter is not thread-safe;
update _next_node_id to acquire a module-level threading.Lock before
mutating/reading _node_counter to prevent races and duplicate IDs. Add an import
for threading, create a lock variable (e.g., _node_counter_lock =
threading.Lock()) next to _node_counter, and wrap the increment+return inside a
with _node_counter_lock: block inside _next_node_id so TreeNode() constructions
get unique, consistent node.id values.
- Around line 597-615: The _insert_normal loop currently calls
self._split_node(node.key, node, plen) but ignores its return value so `node`
still points at the suffix child; change this to assign the returned node (e.g.
node = self._split_node(node.key, node, plen)) and then break out of the loop
(mirroring _match_normal's new_node = self._split_node(...); node = new_node;
break) so subsequent self._add_leaf(node, key, value) attaches the new leaf to
the correct (prefix) parent.

In `@pymllm/orchestrator/request_response_process.py`:
- Around line 103-110: The abort_request implementation currently always sends
an abort message via self._send_to_tokenizer even when the rid is unknown;
update abort_request (which uses self._rid_to_state and self._send_to_tokenizer)
to only send the downstream abort when a state was found (i.e., state is not
None and you performed the state.finished/state.out_list/state.event updates).
Move the await self._send_to_tokenizer.send_pyobj({"rid": rid, "abort": True})
inside that conditional so no spurious aborts are emitted for unknown/cleaned-up
rids.
- Around line 28-42: ReqState.event is never cleared so callers that await
state.event.wait() will spin once the event is set; update the _recv_loop to
clear the event before signalling each new result (call state.event.clear()
prior to state.event.set()) to ensure wait() blocks until the next item, and
also add a short note to the ReqState docstring about the event lifecycle so
callers understand it is managed internally; reference ReqState.event and the
_recv_loop function when making the change.

In `@pymllm/orchestrator/scheduler_process.py`:
- Around line 201-205: The stream_output method currently uses
self._finished.pop(0) which is O(n); change the underlying container _finished
from a list to a collections.deque (same type used for _waiting_queue) and
replace pop(0) with popleft() in the stream_output method so removals are O(1);
update any initialization of _finished to use deque([...]) and keep existing
uses of self._send_to_detokenizer.send_pyobj(item) unchanged.
- Around line 98-108: The event_loop currently busy-waits because
recv_requests() uses poll(timeout=0) and when get_next_batch_to_run() returns
None the loop immediately repeats; modify event_loop (or adjust recv_requests)
to avoid spinning by introducing a small blocking wait when idle: after calling
get_next_batch_to_run() and finding batch is None (and _waiting_queue is empty /
no pending work), either call recv_requests() with a non-zero poll timeout or
insert a short sleep (e.g., time.sleep(0.01)) before the next iteration so the
loop yields CPU; update references in event_loop, recv_requests, and any checks
against _waiting_queue/get_next_batch_to_run to implement this idle backoff.
- Around line 155-159: The batch_id should not use id(batch_requests); update
get_next_batch_to_run to generate a stable unique ID instead — either add a
monotonic counter attribute (e.g., self._batch_counter initialized in __init__
and incremented when creating a batch) or import uuid and use uuid.uuid4().hex
to populate "batch_id" on the batch dict; update the batch creation in
get_next_batch_to_run (where batch and batch_requests are assembled) to use the
new generator and ensure the counter or uuid import is added to the
class/module.

In `@pymllm/orchestrator/tokenizer_process.py`:
- Line 73: The assignment to the local variable "text" in tokenizer_process.py
(raw_request.get("text", "")) is unused and triggers a Ruff F841; either remove
the assignment or rename it to "_text" to indicate an intentionally unused
value. Update the assignment at the top of the tokenizer/process function (the
line with raw_request.get("text", "")) to use "_text" or delete that line, and
run the linter to confirm the F841 warning is resolved.
- Around line 30-32: Change the three attributes in tokenizer_process.py to use
Optional types: update the annotations for self._zmq_ctx, self._recv_from_rr,
and self._send_to_scheduler to Optional[zmq.Context] and Optional[zmq.Socket]
respectively, and add Optional to the imports on line 10 so the type hints
correctly allow None (matching the pattern used in model_runner_process.py and
detokenizer_process.py).

---

Nitpick comments:
In @.claude/skills/update-codeowners/SKILL.md:
- Line 38: Replace the wording "exact same" in the checklist item "- [ ] No
accidental duplicate rule for the exact same path pattern." with a tighter
phrase such as "identical" or "the same" (e.g., "- [ ] No accidental duplicate
rule for the same path pattern."), ensuring the checklist line text is updated
accordingly to improve concision.

In `@mllm-kernel/benchmarks/bench_store_cache.py`:
- Around line 36-89: _in _profile_path_, the final torch.cuda.synchronize() is
executed after the profiler context exits, so in-flight CUDA work from the last
iteration may not be captured; move the synchronization into the profiler block
immediately after the iters loop (i.e., call torch.cuda.synchronize() while
still inside the with profile(...) as prof: block right after running fn() iters
times) and remove or keep the external one accordingly so that the profiler
records completed CUDA activity for the last iteration.
- Around line 62-71: The code defines both time_attr and sort_key using the same
condition, making sort_key redundant; update the benchmarking logic to remove
sort_key and use time_attr everywhere instead (replace any usage of sort_key
with time_attr), keeping the original conditional that checks events and
hasattr(events[0], "self_cuda_time_total") to set time_attr; ensure any sorting
or attribute lookups that previously referenced sort_key now reference time_attr
so behavior is unchanged.

In `@mllm-kernel/mllm_kernel/cuda/csrc/store_cache.cuh`:
- Around line 138-149: The get_kernel template can hit Panic when a
user-provided num_split doesn't satisfy kElementBytes alignment (e.g.,
num_split=4 with row_bytes=128); change handling so this is recoverable: either
add explicit validation of num_split in the Python-facing store_cache dispatcher
(check row_bytes % (num_split * 128) == 0 and raise a clear Python error) or
replace the Panic call inside get_kernel with a descriptive runtime check (use
RuntimeCheck or equivalent to emit an error message naming get_kernel, the
requested num_split, kElementBytes and the required alignment) so callers get a
clear, non-crashing diagnostic; update references to get_kernel, store_cache,
store_kernel and Panic accordingly.

In `@mllm-kernel/mllm_kernel/cuda/jit/store_cache.py`:
- Around line 56-62: The current manual cache _KERNEL_CACHE duplicates the
memoization already provided by the `@jit` decorator's internal cache_once for
each kernel variant; update the code to keep the manual dict but add a concise
inline comment above _KERNEL_CACHE explaining its purpose: that _KERNEL_CACHE
stores the Python-level wrapper function returned by _make_store_cache_kernel
(so callers reuse the exact wrapper object) while cache_once/@jit handles the
compiled module memoization, and reference _get_kernel, _make_store_cache_kernel
and cache_once/@jit in that comment to clarify why both layers exist.
- Around line 18-24: The PDL capability check _is_arch_support_pdl currently
reads torch.cuda.get_device_capability() for the current device and caches that
result once, which breaks on multi-GPU systems and also cannot distinguish sm_90
vs sm_90a; update _is_arch_support_pdl to accept a device index/torch.device
parameter (or make the cache key include device) and query
torch.cuda.get_device_capability(device) so results are per-GPU, then propagate
that device argument from call sites in _make_store_cache_kernel and store_cache
(pass the tensor.device.index or tensor.device) so the correct GPU capability is
checked at runtime; also add a short comment near the check noting that PyTorch
returns (9,0) for both sm_90 and sm_90a and thus this check is conservative for
PDL which requires sm_90a.
- Around line 91-127: The store_cache function currently omits the row_bytes
alignment check present in can_use_store_cache; add a guard in store_cache
(after computing row_bytes = row_bytes or k.shape[-1] * k.element_size() and
before calling _get_kernel or JIT) that validates row_bytes % 4 == 0 and raises
a clear ValueError (or TypeError) with a descriptive message if the check fails
so callers who bypass can_use_store_cache get an immediate, useful error;
reference the store_cache function and can_use_store_cache behavior when
implementing the guard and keep the check before selecting num_split or invoking
kernel(k, v, k_cache, v_cache, indices, num_split).
- Around line 65-88: can_use_store_cache currently uses `@cache_once` which will
permanently cache False when _get_kernel raises transient JIT compilation
errors; update the function to avoid caching transient failures (e.g., remove or
modify `@cache_once` for error paths, or catch exceptions and re-raise after
logging so the cache only stores successful results), add a docstring note on
caching semantics for callers, and move the successful-return into a
try/except/else pattern so the return True resides in the else block rather than
inside the try — refer to can_use_store_cache, the `@cache_once` decorator, and
_get_kernel to locate and change the behavior.

In `@pymllm/engine/io_struct.py`:
- Around line 85-89: The __getitem__ in GenerateReqInput currently returns self
when batch_size == 1, causing callers (e.g., _forward_batch) that mutate fields
like output_token_ids to alias and modify the original; change __getitem__ so
that when batch_size == 1 it returns a shallow copy of the GenerateReqInput
instance instead of self (either by invoking an existing clone/copy helper or by
constructing a new GenerateReqInput with the same field values), ensuring
mutable fields (like output_token_ids) are new lists to avoid aliasing.
- Around line 8-27: Add concise class-level docstrings to the public dataclasses
BaseReq and BaseBatchReq describing their purpose and the semantics of the
rid/rids fields (that rid is an optional single or list of request identifiers
and regenerate_rid returns/refreshes them, and that rids is a list of request
identifiers with regenerate_rids refreshing them), and include brief notes on
return types for regenerate_rid and regenerate_rids; update the docstrings near
the class definitions of BaseReq and BaseBatchReq to improve discoverability and
comply with the public API documentation guideline.

In `@pymllm/engine/launch.py`:
- Around line 30-36: Add docstrings to the Engine class and its __init__ method:
for the class describe its responsibility (managing subprocesses,
request/response process, and model setup) and for __init__ document parameters
(none), what it configures (_subprocesses, _rr_process, logging, default torch
dtype, model/tokenizer checks), and enumerate raised exceptions (ValueError when
model/tokenizer paths are missing or when an unsupported dtype is configured).
Reference the Engine class and its __init__ constructor and ensure the docs
mention the side effects of calling _config_logging, _set_default_torch_dtype,
and _check_model_and_tokenizer so callers know about potential ValueError
raises.
- Around line 270-291: Both static helpers _wait_for_final_result and
_stream_results declare an unused rid parameter; remove rid from their
signatures and any internal references, leaving only (state: ReqState) as the
parameter, update their return/type hints to remain the same, and then update
all call sites that invoke _wait_for_final_result(...) and _stream_results(...)
to pass only the ReqState instance; ensure imports/annotations still reference
ReqState and run tests/lint to confirm ARG004 is resolved.

In `@pymllm/mem_cache/__init__.py`:
- Around line 20-37: The __all__ list in pymllm.mem_cache is not sorted which
triggers Ruff RUF022; reorder the entries in the __all__ list into
case-sensitive alphabetical order (e.g., arrange entries like EvictResult,
InsertResult, KVPool, MatchResult, RadixCache, RadixKey, ReqToTokenPool,
TokenToKVPoolAllocator, TreeNode, and the hash_* and factory names such as
hash_bytes, hash_token_ids, hash_to_int64, make_full_attention_net_mem_pool,
make_req_to_token_pool) so the public API names are sorted and Ruff stops
flagging the file.

In `@pymllm/mem_cache/memory_pool.py`:
- Line 97: The allocation of a temporary tensor to compute element size in the
_row_bytes assignment should be replaced with a dtype-safe, zero-allocation
computation; update the code that computes self._row_bytes (currently using
k_row_dim and dtype) to derive bytes-per-element from the dtype directly (use
torch.finfo(dtype).bits // 8 for floating dtypes and torch.iinfo(dtype).bits //
8 for integer dtypes, or use a try/except that falls back between finfo and
iinfo) and multiply by k_row_dim so no throwaway tensor is created.

In `@pymllm/mem_cache/radix_cache.py`:
- Line 70: RadixKey.__slots__ and TreeNode.__slots__ are unsorted; change each
__slots__ tuple to use a consistently ordered (natural/alphanumeric) sequence to
satisfy Ruff RUF023 — locate RadixKey.__slots__ ("token_ids", "extra_key") and
the TreeNode.__slots__ declaration and reorder the strings alphabetically (or by
natural sort) so member names are sorted, then run the linter to confirm the
warning is resolved.
- Line 129: Replace the latent-footgun defaultdict usage so TreeNode.children is
a plain dict instead of defaultdict(TreeNode); update any places that previously
relied on implicit creation to explicitly construct and assign a new TreeNode
(calling TreeNode(...) so _next_node_id() and last_access_time are invoked only
when intended). Specifically change the attribute declaration in class TreeNode
from "self.children: Dict[Any, TreeNode] = defaultdict(TreeNode)" to a plain
dict, and audit call-sites that check/insert children (e.g., where code
currently does "if ck not in node.children: ..." or might index directly) to use
explicit node.children[ck] = TreeNode(...) before accessing the child.

In `@pymllm/orchestrator/ipc_utils.py`:
- Around line 14-18: Add a cleanup helper to remove stale IPC socket files and
wire it into startup/shutdown: implement a new function cleanup_ipc_dir() that
lists files in _IPC_DIR (e.g., with glob or os.listdir), removes only known
socket file patterns (or all files in that directory) using os.remove with
exception handling, call cleanup_ipc_dir() from _ensure_ipc_dir() at startup to
purge leftovers, and also register it with atexit.register(cleanup_ipc_dir) so
sockets created by this process are removed on orderly shutdown; reference
_IPC_DIR, _ensure_ipc_dir, and the new cleanup_ipc_dir function when making
changes.

In `@pymllm/orchestrator/model_runner_process.py`:
- Around line 50-56: Wrap the body of ModelRunnerProcess.event_loop in a
try/except that catches exceptions from self._recv_from_scheduler.recv_pyobj,
self._forward_batch, and self._send_to_scheduler.send_pyobj, log the error with
context (including the exception) and continue the loop instead of letting the
process crash; on deserialization/malformed message errors drop or ack the
offending batch and continue, and on persistent socket errors optionally backoff
or break to let run_model_runner_process invoke shutdown. Ensure logging
references event_loop, _recv_from_scheduler.recv_pyobj, _forward_batch and
_send_to_scheduler.send_pyobj so the failing component is clear.

In `@pymllm/orchestrator/request_response_process.py`:
- Around line 91-92: The type-check for request.rid in RequestResponseProcess is
raising ValueError but should raise TypeError for type failures; update the
conditional in request_response_process.py (the block that checks "if not
isinstance(request.rid, str):") to raise TypeError with an appropriate message
(e.g., indicating that rid must be a str) instead of ValueError so the exception
semantics match the type validation.
- Around line 68-82: The type-checker warning is caused by passing a
zmq.asyncio.Context into create_zmq_socket (typed for sync zmq.Context) from
start in request_response_process.py; fix by updating
ipc_utils.create_zmq_socket signature to accept Union[zmq.Context,
zmq.asyncio.Context] (and corresponding Socket/AsyncSocket unions or add
overloads) so both sync and async contexts are typed, or if you prefer a
localized change, add a precise "# type: ignore[assignment]" (with a short
comment referencing create_zmq_socket expecting sync Context) on the two
create_zmq_socket calls in start to suppress the warning; reference the start
method and create_zmq_socket when making the change.

In `@pymllm/orchestrator/scheduler_process.py`:
- Around line 180-195: The process_batch_result method currently accepts an
unused parameter batch; either remove batch from the signature of
process_batch_result and update all callers (e.g., places that call
SchedulerProcess.process_batch_result) to stop passing the batch, or keep the
parameter and add a one-line comment inside process_batch_result explaining that
batch is reserved for future features (e.g., KV-cache or completion checks) to
avoid lint warnings; update any type hints or tests accordingly so the codebase
remains consistent with the chosen approach.

Comment on lines +125 to +132
k = torch.randn(args.batch_size, row_dim, device=device, dtype=dtype)
v = torch.randn(args.batch_size, row_dim, device=device, dtype=dtype)
# Use unique indices to avoid write conflicts.
indices = torch.randperm(args.num_slots, device=device)[: args.batch_size].to(
torch.int64
)
k_cache = torch.zeros(args.num_slots, row_dim, device=device, dtype=dtype)
v_cache = torch.zeros_like(k_cache)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing validation: batch_size > num_slots silently produces fewer indices than expected.

torch.randperm(args.num_slots)[:args.batch_size] silently truncates to num_slots elements when batch_size > num_slots, causing a shape mismatch between k/v (shape [batch_size, ...]) and indices (shape [num_slots]). This would surface as a confusing runtime error.

Proposed guard
+    if args.batch_size > args.num_slots:
+        raise ValueError(
+            f"batch_size ({args.batch_size}) must not exceed num_slots ({args.num_slots})"
+        )
+
     k = torch.randn(args.batch_size, row_dim, device=device, dtype=dtype)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
k = torch.randn(args.batch_size, row_dim, device=device, dtype=dtype)
v = torch.randn(args.batch_size, row_dim, device=device, dtype=dtype)
# Use unique indices to avoid write conflicts.
indices = torch.randperm(args.num_slots, device=device)[: args.batch_size].to(
torch.int64
)
k_cache = torch.zeros(args.num_slots, row_dim, device=device, dtype=dtype)
v_cache = torch.zeros_like(k_cache)
if args.batch_size > args.num_slots:
raise ValueError(
f"batch_size ({args.batch_size}) must not exceed num_slots ({args.num_slots})"
)
k = torch.randn(args.batch_size, row_dim, device=device, dtype=dtype)
v = torch.randn(args.batch_size, row_dim, device=device, dtype=dtype)
# Use unique indices to avoid write conflicts.
indices = torch.randperm(args.num_slots, device=device)[: args.batch_size].to(
torch.int64
)
k_cache = torch.zeros(args.num_slots, row_dim, device=device, dtype=dtype)
v_cache = torch.zeros_like(k_cache)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm-kernel/benchmarks/bench_store_cache.py` around lines 125 - 132, Add an
explicit validation that args.batch_size <= args.num_slots before creating
indices so we don't silently truncate; if the check fails, raise a clear
ValueError describing both args.batch_size and args.num_slots. Locate the block
where k, v, indices, k_cache, and v_cache are created (variables k, v, indices,
k_cache, v_cache) and insert the guard immediately before computing indices =
torch.randperm(...), ensuring the check runs on the same device/CPU context as
the rest of the setup.

Comment on lines +113 to +133
def to_request_dict(self) -> Dict[str, Any]:
payload: Dict[str, Any] = {}
for key, value in {
"rid": self.rid,
"text": self.text,
"input_ids": self.input_ids,
"sampling_params": self.sampling_params,
"return_logprob": self.return_logprob,
"logprob_start_len": self.logprob_start_len,
"top_logprobs_num": self.top_logprobs_num,
"stream": self.stream,
"image_data": self.image_data,
"video_data": self.video_data,
"audio_data": self.audio_data,
"lora_path": self.lora_path,
"session_params": self.session_params,
}.items():
if value is not None:
payload[key] = value
payload.update(self.extra_options)
return payload
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

extra_options can silently overwrite core request fields.

payload.update(self.extra_options) on Line 132 runs after populating core keys like rid, text, and input_ids. If extra_options contains any of these keys, the core values are silently replaced. This could cause subtle data corruption downstream.

🛡️ Proposed fix: warn or reject collisions
+        _RESERVED = frozenset(payload.keys())
+        collisions = _RESERVED & self.extra_options.keys()
+        if collisions:
+            raise ValueError(
+                f"extra_options keys {collisions} collide with reserved fields"
+            )
         payload.update(self.extra_options)
         return payload
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/engine/io_struct.py` around lines 113 - 133, The to_request_dict
method currently calls payload.update(self.extra_options) which allows
self.extra_options to silently overwrite core fields (e.g., "rid", "text",
"input_ids", "sampling_params", etc.); modify to_request_dict to check for key
collisions between self.extra_options and the core keys collected in payload
before applying them, and either raise a ValueError (reject) or emit a clear
warning via the module logger (warn) when a collision is detected (choose one
behavior consistently), and only merge non-colliding keys into payload if
proceeding; reference the to_request_dict function, the payload dict, and
self.extra_options when implementing this guard.

Comment on lines +31 to +36
def __init__(self):
self._subprocesses: List[mp.Process] = []
self._rr_process: Optional[RequestResponseProcess] = None
self._config_logging()
self._set_default_torch_dtype()
self._check_model_and_tokenizer()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

self._loop is never initialized in __init__, risking AttributeError.

_launch_processes sets self._rr_process at line 154 before self._loop at line 162. If an exception fires between those two lines (e.g., asyncio.new_event_loop() failure), atexit will invoke shutdown(), which passes the if self._rr_process is not None guard but then crashes on the undeclared self._loop — silently swallowed by the bare except. The same AttributeError would also surface if generate() is called without a prior launch() call.

Initialize _loop to None in __init__ and guard the access in shutdown():

🐛 Proposed fix
 def __init__(self):
     self._subprocesses: List[mp.Process] = []
     self._rr_process: Optional[RequestResponseProcess] = None
+    self._loop: Optional[asyncio.AbstractEventLoop] = None
     self._config_logging()
     self._set_default_torch_dtype()
     self._check_model_and_tokenizer()
 def shutdown(self) -> None:
     if self._rr_process is not None:
         try:
-            self._loop.run_until_complete(self._rr_process.shutdown())
+            if self._loop is not None:
+                self._loop.run_until_complete(self._rr_process.shutdown())
         except Exception:
-            pass
+            logger.exception("Error during RR process shutdown")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/engine/launch.py` around lines 31 - 36, The __init__ currently never
initializes self._loop which can cause AttributeError if _launch_processes sets
self._rr_process before creating the asyncio loop or if generate() is called
before launch(); initialize self._loop = None in __init__ (add to the existing
initializations) and update shutdown() (and any other places that access
self._loop, e.g., generate()) to guard usage with "if self._loop is not None"
(or hasattr(self, "_loop") and self._loop is not None) before calling methods on
it so accesses are safe even if loop creation failed or launch() wasn’t run.


def _launch_processes(self) -> None:
"""Spawn all subprocess workers and wire up ZMQ IPC channels."""
mp.set_start_method("spawn", force=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

mp.set_start_method("spawn", force=True) is a process-wide global mutation — move it to the entry point.

Calling this inside an instance method silently overrides any start method set by the host application or other libraries on Linux (where fork is default). If this method were ever called more than once, the forced override is harmless but intentionally hides misconfiguration. Per Python docs, set_start_method should be called exactly once in a if __name__ == "__main__" guard in the entry-point script.

💡 Suggested approach

Remove the call from _launch_processes and place it in the server entry-point script:

-        mp.set_start_method("spawn", force=True)

In pymllm/server/launch.py or equivalent entry script:

if __name__ == "__main__":
    mp.set_start_method("spawn")
    engine = Engine()
    engine.launch()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/engine/launch.py` at line 44, mp.set_start_method("spawn", force=True)
is a global process-wide mutation and must not live inside the instance method
_launch_processes; remove that call from pymllm.engine.launch._launch_processes
(or any Engine.launch helper) and instead set the start method exactly once in
the program entrypoint: add mp.set_start_method("spawn") inside an if __name__
== "__main__" guard before constructing/launching Engine (e.g., before Engine()
/ engine.launch()) so the start method is configured at startup and not mutated
at runtime.

Comment on lines +143 to +151
# Wait for readiness signals
for _, reader, name in procs_and_readers:
try:
msg = reader.recv()
except EOFError:
raise RuntimeError(f"{name} process died before signalling readiness")
if msg.get("status") != "ready":
raise RuntimeError(f"{name} process failed to initialise: {msg}")
logger.info("%s process ready", name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

reader.recv() has no timeout — a hanging subprocess blocks the main process forever.

If any subprocess hangs during initialization (e.g., waiting on GPU memory or a stuck IPC bind), reader.recv() will block indefinitely. Use poll() with a timeout before recv():

🐛 Proposed fix
+        _READINESS_TIMEOUT_S = 120  # seconds
         for _, reader, name in procs_and_readers:
             try:
-                msg = reader.recv()
+                if not reader.poll(_READINESS_TIMEOUT_S):
+                    raise RuntimeError(
+                        f"{name} process did not signal readiness within "
+                        f"{_READINESS_TIMEOUT_S}s"
+                    )
+                msg = reader.recv()
             except EOFError as err:
-                raise RuntimeError(f"{name} process died before signalling readiness")
+                raise RuntimeError(
+                    f"{name} process died before signalling readiness"
+                ) from err
🧰 Tools
🪛 Ruff (0.15.1)

[warning] 148-148: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


[warning] 148-148: Avoid specifying long messages outside the exception class

(TRY003)


[warning] 150-150: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/engine/launch.py` around lines 143 - 151, The readiness loop blocks
indefinitely because reader.recv() has no timeout; update the loop that iterates
procs_and_readers to call reader.poll(timeout_sec) (e.g., a few seconds
configurable) before calling reader.recv(), raise a RuntimeError if poll returns
False (timeout) and include the process name in the message, keep the existing
EOFError handling for recv, and only call reader.recv() when poll indicates data
is available; reference procs_and_readers, reader.recv, reader.poll and the
RuntimeError raised for timeouts.

Comment on lines +98 to +108
def event_loop(self) -> None:
"""Infinite scheduling loop."""
logger.info("SchedulerProcess event loop started")
while True:
self.recv_requests()
self.process_input_requests()
batch = self.get_next_batch_to_run()
if batch is not None:
result = self.run_batch(batch)
self.process_batch_result(batch, result)
self.stream_output()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Event loop will spin-wait at 100% CPU when idle.

recv_requests() uses poll(timeout=0) which returns immediately when no messages are available. When the _waiting_queue is empty, the loop cycles through all steps with no blocking or sleeping, consuming a full CPU core. Add a small sleep or use a non-zero poll timeout when there's no pending work.

Proposed fix: add a short sleep when idle
     def event_loop(self) -> None:
         """Infinite scheduling loop."""
         logger.info("SchedulerProcess event loop started")
         while True:
             self.recv_requests()
             self.process_input_requests()
             batch = self.get_next_batch_to_run()
             if batch is not None:
                 result = self.run_batch(batch)
                 self.process_batch_result(batch, result)
+            else:
+                time.sleep(0.001)  # avoid busy-spin when idle
             self.stream_output()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/scheduler_process.py` around lines 98 - 108, The
event_loop currently busy-waits because recv_requests() uses poll(timeout=0) and
when get_next_batch_to_run() returns None the loop immediately repeats; modify
event_loop (or adjust recv_requests) to avoid spinning by introducing a small
blocking wait when idle: after calling get_next_batch_to_run() and finding batch
is None (and _waiting_queue is empty / no pending work), either call
recv_requests() with a non-zero poll timeout or insert a short sleep (e.g.,
time.sleep(0.01)) before the next iteration so the loop yields CPU; update
references in event_loop, recv_requests, and any checks against
_waiting_queue/get_next_batch_to_run to implement this idle backoff.

Comment on lines +155 to +159
batch = {
"requests": batch_requests,
"batch_id": id(batch_requests),
"created_at": time.time(),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

id(batch_requests) is a poor choice for batch_id.

id() returns the CPython memory address of the list, which can be reused after garbage collection. This makes batch IDs non-unique over time, complicating tracing and debugging. Use a monotonic counter or uuid.uuid4().hex instead.

Proposed fix using a counter

Add to __init__:

+        self._next_batch_id: int = 0

Then in get_next_batch_to_run:

         batch = {
             "requests": batch_requests,
-            "batch_id": id(batch_requests),
+            "batch_id": self._next_batch_id,
             "created_at": time.time(),
         }
+        self._next_batch_id += 1
         return batch
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/scheduler_process.py` around lines 155 - 159, The
batch_id should not use id(batch_requests); update get_next_batch_to_run to
generate a stable unique ID instead — either add a monotonic counter attribute
(e.g., self._batch_counter initialized in __init__ and incremented when creating
a batch) or import uuid and use uuid.uuid4().hex to populate "batch_id" on the
batch dict; update the batch creation in get_next_batch_to_run (where batch and
batch_requests are assembled) to use the new generator and ensure the counter or
uuid import is added to the class/module.

Comment on lines +201 to +205
def stream_output(self) -> None:
"""Send finished token-ID outputs to the DetokenizerProcess."""
while self._finished:
item = self._finished.pop(0)
self._send_to_detokenizer.send_pyobj(item)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

self._finished.pop(0) is O(n) — use deque for _finished.

_finished is a list, and pop(0) shifts all remaining elements on every call. You already import deque and use it for _waiting_queue. Apply the same here.

Proposed fix

Change the declaration on Line 60:

-        self._finished: List[Dict[str, Any]] = []
+        self._finished: Deque[Dict[str, Any]] = deque()

Then in stream_output:

     def stream_output(self) -> None:
         """Send finished token-ID outputs to the DetokenizerProcess."""
         while self._finished:
-            item = self._finished.pop(0)
+            item = self._finished.popleft()
             self._send_to_detokenizer.send_pyobj(item)

As per coding guidelines, "Prioritize production-ready code quality by evaluating time and space complexity of algorithms and data structures, and suggest more efficient alternatives for operations with high complexity."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/scheduler_process.py` around lines 201 - 205, The
stream_output method currently uses self._finished.pop(0) which is O(n); change
the underlying container _finished from a list to a collections.deque (same type
used for _waiting_queue) and replace pop(0) with popleft() in the stream_output
method so removals are O(1); update any initialization of _finished to use
deque([...]) and keep existing uses of
self._send_to_detokenizer.send_pyobj(item) unchanged.

Comment on lines +30 to +32
self._zmq_ctx: zmq.Context = None
self._recv_from_rr: zmq.Socket = None
self._send_to_scheduler: zmq.Socket = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Type annotations are incorrect — None is not a valid zmq.Context or zmq.Socket.

These attributes are initialized to None but annotated with non-Optional types, which is both misleading and will cause type-checker errors. The other process files in this PR (e.g., model_runner_process.py Line 30-32, detokenizer_process.py Line 31-33) correctly use Optional[...].

Proposed fix
-        self._zmq_ctx: zmq.Context = None
-        self._recv_from_rr: zmq.Socket = None
-        self._send_to_scheduler: zmq.Socket = None
+        self._zmq_ctx: Optional[zmq.Context] = None
+        self._recv_from_rr: Optional[zmq.Socket] = None
+        self._send_to_scheduler: Optional[zmq.Socket] = None

You'll also need to add Optional to the imports on Line 10:

-from typing import Any, Dict, List
+from typing import Any, Dict, List, Optional
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/tokenizer_process.py` around lines 30 - 32, Change the
three attributes in tokenizer_process.py to use Optional types: update the
annotations for self._zmq_ctx, self._recv_from_rr, and self._send_to_scheduler
to Optional[zmq.Context] and Optional[zmq.Socket] respectively, and add Optional
to the imports on line 10 so the type hints correctly allow None (matching the
pattern used in model_runner_process.py and detokenizer_process.py).


TODO: replace with real tokenizer call.
"""
text = raw_request.get("text", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unused variable text — flagged by Ruff (F841).

The text variable is assigned but never read in the placeholder. Prefix it with _ to signal intent or remove the assignment.

Proposed fix
-        text = raw_request.get("text", "")
+        _text = raw_request.get("text", "")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
text = raw_request.get("text", "")
_text = raw_request.get("text", "")
🧰 Tools
🪛 Ruff (0.15.1)

[error] 73-73: Local variable text is assigned to but never used

Remove assignment to unused variable text

(F841)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/tokenizer_process.py` at line 73, The assignment to the
local variable "text" in tokenizer_process.py (raw_request.get("text", "")) is
unused and triggers a Ruff F841; either remove the assignment or rename it to
"_text" to indicate an intentionally unused value. Update the assignment at the
top of the tokenizer/process function (the line with raw_request.get("text",
"")) to use "_text" or delete that line, and run the linter to confirm the F841
warning is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant