feat(pymllm): VocabParallelEmbedding & pymllm's cuda infra init#640
feat(pymllm): VocabParallelEmbedding & pymllm's cuda infra init#640chenghuaWang wants to merge 10 commits intoUbiquitousLearning:mainfrom
Conversation
…mple - Replaced the previous JIT utility functions with a streamlined `jit` decorator for kernel registration. - Updated the README.md to reflect the new recommended pattern for CPU kernel implementation. - Simplified the example for using the JIT compilation with a focus on clarity and ease of use.
- Updated `apache-tvm-ffi` version to `0.1.8` in `pyproject.toml` and `mllm-kernel/pyproject.toml`. - Refactored mobile module imports and structure, moving scripts to `pymllm.mobile` and removing unused backends. - Introduced new classes and methods for quantization and model deployment in the Qualcomm backend. - Added new README files for mobile and Qualcomm transformer components.
- Added `flashinfer-python` to the optional `cuda` dependencies in `pyproject.toml`. - Introduced new configuration files for server, model, and layers to centralize runtime settings. - Created initial structure for various layers and components to support future development.
- Added main entry points for `pymllm` and `mllm-kernel` in their respective `pyproject.toml` files. - Implemented a configuration module for `pymllm` to manage global settings, including server, model, runtime, and cache configurations. - Introduced the `VocabParallelEmbedding` layer and utility functions for weight management in the layers module. - Created initial tests for the `VocabParallelEmbedding` layer to validate functionality with tensor parallelism.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReworks packaging and CLI, adds a mobile subpackage, expands layer primitives (TP-aware layers, RoPE, norms, MLP), introduces typed GlobalConfig/ServerConfig with CLI overrides, implements an orchestrator with multi-process components and parallel state, adds KV/mem-cache and store_cache CUDA kernel, and updates tests and FFI bindings. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant RR as RequestResponseProcess
participant Tokenizer
participant Scheduler
participant ModelRunner
participant Detokenizer
participant AsyncIO as AsyncDiskIoProcess
Client->>RR: submit GenerateReqInput
RR->>Tokenizer: forward tokenization request (ZMQ)
Tokenizer->>Scheduler: send tokenized batch
Scheduler->>ModelRunner: dispatch batch for forward
ModelRunner->>AsyncIO: (optional) async weight IO / checkpoints
ModelRunner-->>Scheduler: return token ids / partial results
Scheduler->>Detokenizer: send finished tokens
Detokenizer-->>RR: send decoded text chunks
RR-->>Client: stream final/partial outputs
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 14
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (9)
pymllm/mobile/backends/qualcomm/transformers/qwen3/train.py (1)
31-35:⚠️ Potential issue | 🟡 Minor
--output_diris not markedrequired=True— crashes with an opaqueTypeErrorwhen omitted.
args.output_dirdefaults toNone. Line 50 then callsos.makedirs(None, exist_ok=True), raisingTypeError: expected str, bytes or os.PathLike object, not NoneTypeinstead of a clear argparse error.🐛 Proposed fix
parser.add_argument( "--output_dir", type=str, + required=True, help="Directory to save the quantized model", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/train.py` around lines 31 - 35, The --output_dir argument is optional but later used in os.makedirs causing a TypeError when omitted; update the parser.add_argument for "--output_dir" (the one in train.py) to include required=True (or alternatively provide a sensible default) so args.output_dir is never None, and ensure any subsequent call (os.makedirs(args.output_dir, exist_ok=True) in this file) receives a valid path; keep the change localized to the parser.add_argument for "--output_dir" and/or add an explicit argparse error if args.output_dir is falsy before calling os.makedirs.pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py (1)
136-160:⚠️ Potential issue | 🟠 MajorFix
apply_rotary_pos_embfunction calls torotate_half.
rotate_half()requires 4 positional arguments (x,x_observer,x2_neg_fake_quant,concat_observer), butapply_rotary_pos_emb()calls it with only 1 argument at lines 158–159:q_embed = (q * cos) + (rotate_half(q) * sin) k_embed = (k * cos) + (rotate_half(k) * sin)This causes
TypeError: rotate_half() missing 3 required positional argumentsfor any caller.LlamaAttention.forward(lines 397–420) bypasses this by callingrotate_half()directly with all required arguments, but the public function remains broken and unusable.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py` around lines 136 - 160, apply_rotary_pos_emb currently calls rotate_half with a single argument but rotate_half requires four positional args (x, x_observer, x2_neg_fake_quant, concat_observer); update apply_rotary_pos_emb to call rotate_half with the same four arguments used in LlamaAttention.forward (i.e., pass through the observer/quant args or supply None defaults) so the calls become rotate_half(q, x_observer, x2_neg_fake_quant, concat_observer) and rotate_half(k, x_observer, x2_neg_fake_quant, concat_observer) (or explicitly pass None for those three if observers are not available), ensuring the function signature and callers are updated accordingly.pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py (1)
114-138:⚠️ Potential issue | 🟠 MajorRemove
apply_rotary_pos_embfunction — it is dead code with signature mismatch.
apply_rotary_pos_embcallsrotate_half(q)androtate_half(k)with 1 argument each (lines 136–137), butrotate_halfrequires 4 arguments (x,x_observer,x2_neg_fake_quant,concat_observer). This would raiseTypeErrorif called. The function is not invoked anywhere in the codebase and is absent from__all__exports.Qwen3Attention.forwardapplies RoPE manually with correct argument passing. Remove this unused function.Note: The same issue exists identically in
pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.pyandpymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py` around lines 114 - 138, The function apply_rotary_pos_emb is dead code and wrong: it calls rotate_half(q) and rotate_half(k) with only one argument while rotate_half requires four, and Qwen3Attention.forward applies RoPE correctly; remove the entire apply_rotary_pos_emb function definition from modeling_qwen3.py to avoid the incorrect call and dead export, and likewise remove the identical apply_rotary_pos_emb definitions from modeling_qwen2.py and modeling_llama.py so no mismatched helper remains in the qualcomm backend.pymllm/mobile/backends/qualcomm/transformers/llama/train.py (1)
32-35:⚠️ Potential issue | 🟠 Major
--output_dirwill crash withTypeErrorif omitted.
--output_dirhas nodefaultand norequired=True. Whenargs.output_dirisNone, line 50 (os.makedirs(args.output_dir, exist_ok=True)) raisesTypeError: expected str, bytes or os.PathLike object, not NoneType. The same issue exists inqwen2/train.py.🐛 Proposed fix
parser.add_argument( "--output_dir", type=str, + required=True, help="Directory to save the quantized model", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/llama/train.py` around lines 32 - 35, The CLI argument "--output_dir" is missing a default or required=True so args.output_dir can be None and later break on os.makedirs(args.output_dir, exist_ok=True); update the argument definition in train.py (and mirror the same change in qwen2/train.py) to either set a sensible default (e.g., "./output") or mark it required=True and then add a defensive check before calling os.makedirs (e.g., ensure args.output_dir is not None and raise a clear error or use the default), referencing the argument parsing block that defines "--output_dir" and the code that calls os.makedirs(args.output_dir, exist_ok=True).pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py (3)
291-298:⚠️ Potential issue | 🟡 MinorMisleading streaming comment and
trust_remote_code=Truesecurity note.
- Line 292 comment states
streaming=Trueis used, but the actualMsDataset.load()call has no such argument. This will cause the full dataset to be downloaded locally — the opposite of what the comment describes.trust_remote_code=Trueon line 297 permits execution of arbitrary Python code bundled with the dataset. While common in research settings, it should be an explicit opt-in with a clear docstring warning rather than a silent default in a library.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines 291 - 298, The comment above the MsDataset.load call is misleading and the use of trust_remote_code=True is unsafe; update the MsDataset.load invocation (the call to MsDataset.load in runner.py) to pass streaming=True to match the comment and avoid downloading the entire dataset, and change trust_remote_code=True to a safe default (False) or wire it to an explicit opt-in flag (e.g., allow_trust_remote_code) exposed in the function/class API with a clear docstring warning about executing remote code so callers must consciously opt in; also update the inline comment to accurately reflect the behavior.
166-185:⚠️ Potential issue | 🔴 CriticalAdd
QLinearW8A16_PerChannelSymhandling toenable_fake_quantanddisable_fake_quantfunctions.
QLinearW8A16_PerChannelSymis included infreeze_qwen3_linear_weight(line 147) andconvert_weight(line 189), but is missing fromenable_fake_quantanddisable_fake_quant(lines 166–185). UnlikeQLinearLPBQ, which definesenable_fakequant()anddisable_fakequant()methods,QLinearW8A16_PerChannelSymcurrently lacks these methods. Both classes should provide consistent interfaces for managing fake-quantization state during calibration. AddQLinearW8A16_PerChannelSymto the type checks in both functions, and implement the corresponding methods in the class if they do not exist.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines 166 - 185, Add QLinearW8A16_PerChannelSym to the fake-quant toggles and implement matching methods on the class: update enable_fake_quant and disable_fake_quant to check isinstance(..., QLinearW8A16_PerChannelSym) and call the new methods; then add enable_fakequant(self) and disable_fakequant(self) to the QLinearW8A16_PerChannelSym class to mirror QLinearLPBQ’s behavior (toggle whatever internal flag or buffer used for fake-quant state). Ensure the implementation is consistent with how QLinearLPBQ, ActivationQDQ, QRMSNorm, and QEmbedding manage fake-quant state so freeze_qwen3_linear_weight and convert_weight (which already reference QLinearW8A16_PerChannelSym) behave correctly during calibration.
69-73:⚠️ Potential issue | 🟡 MinorThe diagnostic logging loop is dead code and contains an
AttributeError.
FakeQuantizeregistersscaleas a buffer (viaregister_buffer), not as a parameter. Therefore,module.fake_quant.named_parameters()will never yieldscale, making the loop body at lines 70–73 unreachable dead code.Additionally, line 72 references
module.scale, butActivationQDQdoes not expose a.scaleproperty — it should bemodule.fake_quant.scale.Fix both issues:
🐛 Proposed fix
- for key, value in module.fake_quant.named_parameters(): + for key, value in module.fake_quant.named_buffers(): if value is module.fake_quant.scale: - print(f"{module._get_name()}.{key}: {module.scale}") + print(f"{module._get_name()}.{key}: {module.fake_quant.scale}") break🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines 69 - 73, The loop using module.fake_quant.named_parameters() is dead because FakeQuantize.register_buffer registers scale as a buffer, and it also wrongly references module.scale; update the diagnostic to iterate module.fake_quant.named_buffers() (or check both named_parameters() and named_buffers() if desired) and reference module.fake_quant.scale when printing; specifically modify the block around the loop that uses module.fake_quant.named_parameters() and the print that uses module.scale so it inspects named_buffers() and prints module.fake_quant.scale (or the buffer name) instead.pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py (2)
366-369:⚠️ Potential issue | 🟡 MinorPre-existing typo:
layer_dixshould belayer_idx.Line 369 assigns
self.layer_dix = layer_idx(note "dix" vs "idx"). This typo propagates to Lines 384 and 404 whereself.layer_dixis read. While internally consistent, it's misleading and will cause confusion for anyone extendingQwen2DecoderLayer.Not introduced by this PR, but flagging for awareness.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py` around lines 366 - 369, The Qwen2DecoderLayer constructor uses a typo'd attribute name self.layer_dix instead of self.layer_idx; update the constructor in class Qwen2DecoderLayer to assign self.layer_idx = layer_idx and then change any reads/writes of self.layer_dix (e.g., at the sites currently reading it on lines where it’s accessed) to self.layer_idx so the attribute name is consistent and clear across the class (keep the class name Qwen2DecoderLayer and parameter layer_idx the same).
90-123:⚠️ Potential issue | 🟡 MinorFix the arity mismatch in
apply_rotary_pos_emb: it callsrotate_halfwith 1 argument but the function requires 4.
rotate_halfon line 90 requires 4 parameters (x,x_observer,x2_neg_fake_quant,concat_observer), butapply_rotary_pos_embcalls it on lines 121–122 with only 1 argument. This will raiseTypeErrorat runtime ifapply_rotary_pos_embis ever invoked.The actual RoPE implementation in the attention forward method (lines 281–299) calls
rotate_halfwith all 4 arguments correctly, so the bug only affects the unusedapply_rotary_pos_embfunction. Suggest either updatingapply_rotary_pos_embto pass all required arguments or removing it if it's not needed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen2/modeling_qwen2.py` around lines 90 - 123, The apply_rotary_pos_emb function calls rotate_half(q) and rotate_half(k) but rotate_half requires four parameters (x, x_observer, x2_neg_fake_quant, concat_observer); update apply_rotary_pos_emb to accept and forward those observers/quantizers (e.g., add parameters x_observer, x2_neg_fake_quant: ActivationQDQ, concat_observer: ConcatObserver to apply_rotary_pos_emb signature and call rotate_half(q, x_observer, x2_neg_fake_quant, concat_observer) and rotate_half(k, x_observer, x2_neg_fake_quant, concat_observer)), or if apply_rotary_pos_emb is unused remove it—ensure the fix mirrors how rotate_half is invoked in the attention forward implementation.
🧹 Nitpick comments (32)
pymllm/orchestrator/parallel_state.py (3)
36-37: Silent early return whendistis not initialized may mask configuration errors.If a caller invokes
initialize_model_parallel(tp=4, dp=2, ...)without first callingdist.init_process_group, the function silently returns without setting any groups or config. Downstream code relying onget_tp_group()etc. will getNonewith no indication of why. Consider logging a warning so that misconfigured setups are easier to diagnose.♻️ Proposed change
if not dist.is_initialized(): + logger.warning( + "torch.distributed is not initialized; " + "skipping model parallel group setup." + ) returnAs per coding guidelines, "Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/orchestrator/parallel_state.py` around lines 36 - 37, The early return guarded by dist.is_initialized() in initialize_model_parallel(tp=..., dp=...) should not be silent; update it to log a warning (using the module logger) that process groups were not initialized and advise calling torch.distributed.init_process_group before initialize_model_parallel, so downstream calls like get_tp_group() don't fail silently; ensure the message includes which function returned early (initialize_model_parallel) and the check (dist.is_initialized()) to aid debugging.
20-25: No thread safety or re-initialization guard on the global state.
initialize_model_parallelmutates module-level globals without any guard against concurrent or repeated calls. A second call would leak the priordist.new_grouphandles. Consider adding an early check viamodel_parallel_is_initialized()and raising if already initialized, or explicitly destroying old groups first.♻️ Example guard
def initialize_model_parallel( tensor_model_parallel_size: int = 1, data_parallel_size: int = 1, pipeline_model_parallel_size: int = 1, backend: str = "nccl", ) -> None: + if model_parallel_is_initialized(): + raise RuntimeError( + "Model parallel groups are already initialized. " + "Call destroy_model_parallel() first." + ) + global _TP_GROUP, _DP_GROUP, _PP_GROUPAlso applies to: 177-179
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/orchestrator/parallel_state.py` around lines 20 - 25, The initialize_model_parallel function mutates module-level globals and may leak prior dist.new_group handles on concurrent or repeated calls; add a re-initialization guard by checking model_parallel_is_initialized() at the top of initialize_model_parallel and either raise a clear exception if already initialized or call a cleanup routine (e.g., destroy_model_parallel or a new teardown function) to explicitly destroy existing groups before creating new ones; also protect the initialization logic with a simple threading.Lock to ensure thread-safety when creating dist.new_group handles and apply the same guard/cleanup pattern to the other initialization site(s) that mutate the same globals.
66-72: Sameassert-based validation concern as ingroup_coordinator.py.This is the primary invariant for the entire parallelism setup. If the product doesn't match
world_size, the system will silently compute incorrect groups under-O. Raise aValueErrorinstead.♻️ Proposed fix
- assert ( - tensor_model_parallel_size * data_parallel_size * pipeline_model_parallel_size - == world_size - ), ( - f"TP({tensor_model_parallel_size}) * DP({data_parallel_size}) * " - f"PP({pipeline_model_parallel_size}) != World({world_size})" - ) + total = tensor_model_parallel_size * data_parallel_size * pipeline_model_parallel_size + if total != world_size: + raise ValueError( + f"TP({tensor_model_parallel_size}) * DP({data_parallel_size}) * " + f"PP({pipeline_model_parallel_size}) = {total} != World({world_size})" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/orchestrator/parallel_state.py` around lines 66 - 72, Replace the assert-based check that multiplies tensor_model_parallel_size, data_parallel_size, and pipeline_model_parallel_size against world_size with an explicit runtime validation that raises a ValueError when the product doesn't equal world_size; locate the expression that currently uses assert (the block referencing tensor_model_parallel_size * data_parallel_size * pipeline_model_parallel_size == world_size and the f-string message) and change it to compute the product, compare to world_size, and raise ValueError(f"TP({tensor_model_parallel_size}) * DP({data_parallel_size}) * PP({pipeline_model_parallel_size}) != World({world_size})") when they mismatch.pymllm/orchestrator/group_coordinator.py (2)
72-77:assertfor input validation is stripped underpython -O; prefer raising exceptions in public API functions.
divide()andsplit_tensor_along_dim()are public API (exported via__init__.py).assertstatements are removed when Python runs with the-Oflag, silently disabling these guards and leading to incorrect silent behavior (e.g., truncated results from integer division). Use explicitValueErrorraises for robustness.♻️ Example for `divide`; apply the same pattern to `split_tensor_along_dim`
def divide(numerator: int, denominator: int) -> int: """Divide and ensure divisibility.""" - assert numerator % denominator == 0, ( - f"{numerator} is not divisible by {denominator}" - ) + if numerator % denominator != 0: + raise ValueError(f"{numerator} is not divisible by {denominator}") return numerator // denominatorAs per coding guidelines, "Ensure functions that can fail return appropriate error codes or raise exceptions" and "Validate inputs for public APIs and critical internal functions."
Also applies to: 80-98
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/orchestrator/group_coordinator.py` around lines 72 - 77, The public functions divide and split_tensor_along_dim currently use assert for input validation which is removed under python -O; replace those asserts with explicit exception raises (e.g., raise ValueError(...)) that include the same descriptive message so invalid inputs (non-divisible numerator/denominator or bad split parameters) raise a clear exception instead of failing silently; update the divide function to raise ValueError when numerator % denominator != 0 and apply the same pattern to split_tensor_along_dim to validate its inputs and raise ValueError with contextual messages while preserving the existing return behavior.
19-37:local_rankis stored but never used within the class.
self.local_rankis assigned but never referenced by any method inGroupCoordinator. If it's only needed by external callers (e.g., for device placement), that's fine, but consider documenting that intent. Otherwise, it's dead state that may mislead future maintainers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/orchestrator/group_coordinator.py` around lines 19 - 37, The constructor currently assigns self.local_rank but nothing in GroupCoordinator uses it; either remove the local_rank parameter and attribute from GroupCoordinator.__init__ (and update callers) to avoid dead state, or explicitly document its purpose for external use by adding a docstring/param comment to the GroupCoordinator class and keeping self.local_rank; if you choose removal, delete the local_rank argument and any assignment to self.local_rank and update instantiations, and if you choose to keep it, add a clear docstring on GroupCoordinator and/or expose it via a property (e.g., def local_rank(self): return self._local_rank) so the intent is explicit (references: GroupCoordinator.__init__, self.local_rank, rank_in_group, device_group).pymllm/orchestrator/__init__.py (1)
25-48: Categorical grouping in__all__is reasonable; ignore the sort lint.The static analysis tool flags that
__all__is not alphabetically sorted (RUF022), but the current categorical grouping (GroupCoordinator, TP, DP, PP, State) is intentional and more readable for this API surface. Consider adding a# noqa: RUF022comment if you want to suppress the lint.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/orchestrator/__init__.py` around lines 25 - 48, The linter flags the __all__ list as unsorted (RUF022) even though the categorical grouping is intentional; suppress the warning by adding a noqa comment for RUF022 to the __all__ definition (e.g., append "# noqa: RUF022" to the __all__ assignment) so the grouped exports (including symbols like "GroupCoordinator", "divide", "get_tp_group", "data_parallel_all_reduce", "get_pp_group", "initialize_model_parallel", etc.) keep their categorical order without lint noise.pymllm/mobile/backends/qualcomm/nn.py (1)
4-11: Add docstrings and document placeholder intent forQnnSoftmaxandQnnRoPE.Both classes are bare subclasses that add no behavior. Per coding guidelines, public classes must have docstrings explaining purpose. Without any comment it is also unclear whether these are intentional stubs awaiting Qualcomm-specific overrides or permanent thin wrappers.
♻️ Proposed fix: add minimal docstrings
class QnnSoftmax(Softmax): + """Qualcomm-specific Softmax layer. + + Thin wrapper around Softmax; reserved for Qualcomm QNN-specific + operator overrides. + """ def __init__(self): super().__init__() class QnnRoPE(RoPE): + """Qualcomm-specific Rotary Position Embedding layer. + + Thin wrapper around RoPE; reserved for Qualcomm QNN-specific + operator overrides. + """ def __init__(self): super().__init__()As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/nn.py` around lines 4 - 11, The public classes QnnSoftmax and QnnRoPE are empty subclasses with no docstrings; add concise class docstrings to each explaining their intent (e.g., "Qualcomm-specific subclass of Softmax/RoPE; currently a thin wrapper/placeholder for future Qualcomm overrides") and note any expected behavior or why they exist (compatibility shim, vendor-specific hooks), so readers know they are intentional stubs and not omissions; ensure the docstrings appear directly under the class definitions for QnnSoftmax and QnnRoPE.pymllm/mobile/ffi/base.py (1)
9-25: Add a pre-flight existence check and a descriptive error in_load_lib().
tvm_ffi.load_module(lib_path)is called without checking whetherlib_pathexists. On any environment where the native library hasn't been built or installed, the module import fails with a raw OS/TVM error that gives no indication of which path was searched or what the user should do. This is worsened by the module-level_LIB = _load_lib()at line 25, which means the error fires on anyimportof this module.♻️ Proposed fix: descriptive pre-flight check
lib_path = os.path.join(parent_dir, "lib", lib_name) + if not os.path.exists(lib_path): + raise RuntimeError( + f"MllmFFIExtension library not found at '{lib_path}'. " + "Ensure the native library is built and installed correctly." + ) return tvm_ffi.load_module(lib_path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/ffi/base.py` around lines 9 - 25, The current _load_lib() calls tvm_ffi.load_module(lib_path) without verifying the file exists, causing opaque import-time errors; update _load_lib to check os.path.exists(lib_path) (or os.path.isfile) before calling tvm_ffi.load_module and, if the file is missing, raise a clear FileNotFoundError/RuntimeError that includes the full lib_path, the platform-determined lib_name, and a short hint about building or installing the native library; keep the _LIB = _load_lib() behavior but ensure the new error message surfaces the searched path and recovery steps so users know what to do.pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py (2)
379-382: Typo:layer_dixshould belayer_idx.
self.layer_dix = layer_idx(line 382) is a persistent typo. It is used consistently aslayer_dixwithin the class (lines 394, 397, 417), so there is no immediate runtime error, but it clashes with the attribute namelayer_idxused by the parentQwen3Attentionand the rest of the decoder stack.🧹 Proposed rename (all occurrences in this class)
- self.layer_dix = layer_idx + self.layer_idx = layer_idxThen update lines 394, 397, 417 from
self.layer_dix→self.layer_idx.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py` around lines 379 - 382, The class Qwen3DecoderLayer has a persistent typo: the attribute was set as self.layer_dix instead of self.layer_idx; update the attribute assignment in Qwen3DecoderLayer.__init__ to self.layer_idx = layer_idx and rename all uses of self.layer_dix to self.layer_idx within this class (including places that interact with Qwen3Attention) so the decoder layer attribute matches the expected layer_idx name used elsewhere.
105-111: Unusedx_observerparameter inrotate_half.
x_observeris accepted but never referenced in the function body. If it is no longer needed, remove it; if it was meant to be used (e.g., to trigger an observer forward pass onx), add the missing call.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/modeling_qwen3.py` around lines 105 - 111, The parameter x_observer in function rotate_half is unused; either remove it from rotate_half's signature and all callers if observers are not needed, or invoke it to trigger observation (e.g., call x_observer(x) before returning) so the input tensor is recorded; update the rotate_half definition and ensure symbols x2_neg_fake_quant and concat_observer are still applied in the existing order (use x_observer(x) then return concat_observer(torch.cat((x2_neg_fake_quant(-x2), x1), dim=-1))) and adjust any call sites accordingly.pymllm/mobile/backends/qualcomm/transformers/core/qlinear.py (3)
219-234: Use the already-definednum_stepsconstant instead of repeating2**bitwidth_of_scale.
num_stepsis computed at line 221 but2**bitwidth_of_scaleis used again directly at line 231, breaking the named-constant rule.🧹 Proposed fix
- max=2**bitwidth_of_scale, + max=num_steps,As per coding guidelines, "Use named constants instead of magic numbers."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/core/qlinear.py` around lines 219 - 234, Replace the duplicated magic expression 2**bitwidth_of_scale with the already-defined constant num_steps in the quantization loop: inside the loop in qlinear.py where q_scales is computed (the torch.clamp call that currently uses max=2**bitwidth_of_scale), change that max to num_steps to use the named constant and avoid repeating the exponentiation.
32-36: Redundantis not Noneguard inside the outer null-check.The outer
if self.weight_quant is not None:at line 32 already ensures the inner branch is never reached withNone. Theand self.weight_quant is not Noneon line 36 adds nothing.🧹 Proposed cleanup
- if ( - isinstance(self.weight_quant, PerBlockParamFakeQuantize) - and self.weight_quant is not None - ): + if isinstance(self.weight_quant, PerBlockParamFakeQuantize):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/core/qlinear.py` around lines 32 - 36, Remove the redundant null-check: inside the outer guard "if self.weight_quant is not None:" drop the duplicated "and self.weight_quant is not None" from the inner condition that tests "isinstance(self.weight_quant, PerBlockParamFakeQuantize)"; leave only the isinstance check so the inner branch reads like a single isinstance(self.weight_quant, PerBlockParamFakeQuantize) test referencing the existing self.weight_quant variable.
205-215: Replaceassertwith an explicitValueErrorfor runtime data validation.
assertstatements are silently stripped when Python runs with-O; a data-validity check on user-provided tensors should survive in optimized mode.🛡️ Proposed fix
-assert self.weight.shape[-1] % self.block_size[1] == 0 -assert linear_zero_point.sum() == 0 +if self.weight.shape[-1] % self.block_size[1] != 0: + raise ValueError( + f"in_features ({self.weight.shape[-1]}) must be divisible by " + f"block_size ({self.block_size[1]})" + ) +if linear_zero_point.sum() != 0: + raise ValueError("LPBQ deployment requires symmetric quantization (zero_point must be all zeros)")As per coding guidelines, "Ensure functions that can fail return appropriate error codes or raise exceptions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/core/qlinear.py` around lines 205 - 215, Replace the two runtime assertions in QLinear's quantization block with explicit checks that raise ValueError: check that self.weight.shape[-1] % self.block_size[1] == 0 and raise ValueError with a clear message if not, and check that linear_zero_point.sum() == 0 and raise ValueError with a clear message if not; keep the rest of the flow (calling _quantize_affine and assigning weight_int4) unchanged so invalid tensor shapes or zero-point sums still raise informative exceptions at runtime.mllm-kernel/pyproject.toml (1)
3-3:apache-tvm-ffiis unpinned in[build-system].requiresbut pinned to0.1.8in[project].dependencies.During the
scikit-build-corebuild step a different (potentially incompatible) version ofapache-tvm-fficould be resolved, while the installed package enforces== 0.1.8at runtime. If the build scripts rely on the same API surface, pin the build-system requirement to match.🔧 Proposed fix
requires = [ - "scikit-build-core>=0.11.0", "apache-tvm-ffi", "torch-c-dlpack-ext" + "scikit-build-core>=0.11.0", "apache-tvm-ffi == 0.1.8", "torch-c-dlpack-ext" ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mllm-kernel/pyproject.toml` at line 3, The build-system requirement for apache-tvm-ffi is unpinned while the runtime dependency in [project].dependencies is pinned to 0.1.8; update the [build-system].requires entry for "apache-tvm-ffi" in pyproject.toml to pin it to "apache-tvm-ffi==0.1.8" so the build resolver uses the same version as the runtime and avoids API/compatibility mismatches.pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py (1)
127-133:x_observerparameter accepted but never used inrotate_half.Same issue as the Qwen3 variant: the
x_observerargument (line 128) is passed at every call-site but the function body ignores it entirely. Either remove it from the signature (and all call-sites) or add the intended forward/observation logic.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/llama/modeling_llama.py` around lines 127 - 133, The rotate_half function currently accepts x_observer but never uses it; either remove x_observer from the signature and from all call-sites, or apply the observer to the input (e.g., call x = x_observer(x) or x_observer.observe(x) as appropriate) before splitting/rotating so the input is recorded; update the function rotate_half and all callers to match the chosen approach and keep the existing ActivationQDQ x2_neg_fake_quant and ConcatObserver concat_observer usage intact.pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py (2)
305-340:tqdmprogress bar never closed.
pbaris created on line 305 but never explicitly closed. The final display update (e.g., elapsed time) will be missing, and if multiple calibration runs occur in the same process, stale bars may accumulate. Use a context manager:♻️ Proposed fix
- pbar = tqdm(total=num_samples, desc="Calibrating") - for entry in dataset: - ... - pbar.update(1) + with tqdm(total=num_samples, desc="Calibrating") as pbar: + for entry in dataset: + ... + pbar.update(1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines 305 - 340, The tqdm progress bar instantiated as pbar in the calibration loop is never closed; wrap the progress bar in a context manager or explicitly close it to avoid stale bars (e.g., use with tqdm(total=num_samples, desc="Calibrating") as pbar: around the for entry in dataset loop or call pbar.close() after the loop). Update the block where pbar is created and used (variable pbar, the for entry in dataset loop, and samples_processed updates) so the progress bar is safely closed regardless of early breaks or exceptions.
19-195: Significant code duplication across runner modules.The utility functions
recompute_scale_zp,validate_concat_observer_fn,disable_qdq_observer,enable_qdq_observer,enable_fake_quant,disable_fake_quant, andconvert_weight(lines 19–195) are byte-for-byte identical inqwen3/runner.py,llama/runner.py, and likelyqwen2/runner.py. This violates DRY and makes any future fix/enhancement require triple (or more) edits.Consider extracting these into a shared
pymllm/mobile/backends/qualcomm/transformers/core/runner_utils.pymodule and importing from there.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/backends/qualcomm/transformers/qwen3/runner.py` around lines 19 - 195, These functions (recompute_scale_zp, validate_concat_observer_fn, disable_qdq_observer, enable_qdq_observer, enable_fake_quant, disable_fake_quant, convert_weight) are duplicated across runner modules—extract them into a shared module (e.g., create a new runner_utils.py under transformers/core) and move the implementations there, then replace the in-file definitions in qwen3/runner.py (and the identical ones in llama/runner.py and qwen2/runner.py) with imports from that shared module; ensure the shared functions keep the same signatures and continue to reference ActivationQDQ, ConcatObserver, QLinearLPBQ, QLinearW8A16_PerChannelSym, QRMSNorm, QEmbedding, etc., so callers (model.apply or named_modules loops) work unchanged.pyproject.toml (2)
24-24: Exact-version pinapache-tvm-ffi == 0.1.8prevents receiving security/bug-fix patches.Strict
==pinning independencies(not justbuild-system.requires) means any downstream environment that already has a newer minor release will fail to resolve. Consider a compatible-release specifier:♻️ Proposed loosening
- "apache-tvm-ffi == 0.1.8", + "apache-tvm-ffi >= 0.1.8, < 0.2.0",If ABI breakage is the concern, at minimum document why strict pinning is required here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` at line 24, The dependency line pins apache-tvm-ffi exactly to "apache-tvm-ffi == 0.1.8", which blocks patch/minor updates and can break resolution in downstream environments; change this to a compatible-release specifier such as "apache-tvm-ffi ~= 0.1.8" or "apache-tvm-ffi >=0.1.8,<0.2.0" in pyproject.toml to allow receiving bug/security fixes while preventing ABI bumps, and if you truly require an exact pin keep the existing strict pin but add a short justification comment in the project docs explaining why exact-version pinning of apache-tvm-ffi is necessary.
21-23:pytestandpytest-htmlshould not be runtime dependencies.Test runner packages installed into every end-user environment bloat the install footprint and create unnecessary version conflicts. Move them to an optional
[dev]or[test]group:♻️ Proposed fix
dependencies=[ "packaging", - "pytest", - "pytest-html", "apache-tvm-ffi == 0.1.8", ... ] +[project.optional-dependencies] +dev = ["pytest", "pytest-html"] cuda = ["tilelang", "flashinfer-python"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pyproject.toml` around lines 21 - 23, The pyproject currently lists "pytest" and "pytest-html" as regular runtime dependencies; remove those two entries from the main dependencies list and add them to a development/test-only group instead (for example add them under tool.poetry.dev-dependencies if using Poetry, or under project.optional-dependencies.test if following PEP 621). Ensure you delete the "pytest" and "pytest-html" lines from the dependencies block and add equivalent entries in the chosen dev/test section so they are only installed for development/testing.pymllm/layers/utils.py (1)
3-10: Inconsistent type hint style:Dictimport unused in favor of|union syntax.Line 3 imports
Dictfromtyping, but Line 10 usesDict[str, Any] | Nonewith the|union operator (Python 3.10+). Since you're already using 3.10+ syntax, consider using the built-indictinstead of the importedDictfor consistency:✏️ Suggested change
-from typing import Any, Dict +from typing import Any import torch def set_weight_attrs( weight: torch.Tensor, - weight_attrs: Dict[str, Any] | None, + weight_attrs: dict[str, Any] | None, ) -> None:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/layers/utils.py` around lines 3 - 10, The type hint style is inconsistent: remove the unused Dict import and change the annotation on set_weight_attrs (parameter weight_attrs) from Dict[str, Any] | None to the modern built-in form dict[str, Any] | None to match Python 3.10+ union syntax; keep the Any import and update the function signature accordingly (refer to the function name set_weight_attrs and parameter weight_attrs).pymllm/tests/test_vocab_parallel_embedding.py (2)
61-62: HardcodedMASTER_PORTmay cause conflicts in CI.Port
29500is a common default and may already be in use by other distributed tests or services running in parallel. Consider using a dynamic free port or at minimum reading from an environment variable with a fallback.✏️ Suggested change
os.environ["MASTER_ADDR"] = "localhost" - os.environ["MASTER_PORT"] = "29500" + os.environ["MASTER_PORT"] = os.environ.get("MASTER_PORT", "29500")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/tests/test_vocab_parallel_embedding.py` around lines 61 - 62, Replace the hardcoded MASTER_PORT assignment in the test setup (currently setting os.environ["MASTER_PORT"] = "29500") with logic that first reads an optional env var and otherwise selects a free ephemeral port at runtime; e.g., use os.environ.get("MASTER_PORT") as primary and fall back to a dynamically discovered free port (via a helper like find_free_port() that binds a socket to port 0 to obtain an available port) before assigning os.environ["MASTER_PORT"], keeping os.environ["MASTER_ADDR"] assignment as-is.
22-23: Module-levellogging.basicConfig(force=True)can interfere with other tests.Using
force=Trueoverwrites the root logger configuration, which can affect logging behavior of other test modules when run in the same pytest session. Consider scoping this to a fixture or using a module-specific logger instead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/tests/test_vocab_parallel_embedding.py` around lines 22 - 23, Remove the module-level call that forcefully reconfigures the root logger; instead stop using logging.basicConfig(..., force=True) and avoid touching the root logger with logging.getLogger().setLevel at import time. Update this test module to either configure a module-scoped logger (use logging.getLogger(__name__) and set its level) or move root-level configuration into a pytest fixture (use caplog or a test-scoped setup) so other tests aren’t affected; look for the existing calls logging.basicConfig and logging.getLogger().setLevel in this file and replace them with a module logger or a fixture-based configuration.pymllm/layers/base.py (1)
8-27: Consider adding a class-level docstring forMllmBaseLayer.This is the base class for all layers in the
pymllm.layerspackage and is part of the public API (re-exported frompymllm.layers.__init__). A brief docstring explaining its role, how subclasses should overrideweight_loaderandforward, and the purpose ofquant_recipewould help consumers.As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/layers/base.py` around lines 8 - 27, Add a concise class-level docstring to MllmBaseLayer describing its role as the base class for layers in pymllm.layers, documenting the quant_recipe attribute's purpose (optional QuantRecipe used for quantization settings), describing that subclasses should override weight_loader to implement custom weight-loading/sharding behavior and forward to implement the layer's forward pass, and briefly list expected parameters/returns/behaviour for weight_loader (param: Parameter, loaded_weight: torch.Tensor) and forward (raise NotImplementedError by default). Reference MllmBaseLayer, weight_loader, forward, and the quant_recipe attribute in the docstring so consumers of the public API understand usage and extension points.pymllm/layers/__init__.py (1)
7-11:__all__is not sorted (Ruff RUF022).✏️ Suggested fix
__all__ = [ "MllmBaseLayer", - "set_weight_attrs", "VocabParallelEmbedding", + "set_weight_attrs", ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/layers/__init__.py` around lines 7 - 11, The __all__ list is not alphabetically sorted (Ruff RUF022); reorder the exported symbols in the __all__ list so they are lexicographically sorted — e.g., place "MllmBaseLayer", "VocabParallelEmbedding", and "set_weight_attrs" in alphabetical order (adjust the list containing MllmBaseLayer, set_weight_attrs, VocabParallelEmbedding accordingly) to satisfy the linter.pymllm/quantization/quant_recipe.py (1)
1-3: Add a docstring to the publicQuantRecipeclass.This class is referenced as a type annotation in
MllmBaseLayer(inpymllm/layers/base.py). A brief docstring explaining its intended purpose would help consumers understand it's a placeholder for quantization configuration.✏️ Suggested change
class QuantRecipe: + """Configuration recipe for quantization. + + This is a placeholder class that will be extended with quantization + parameters and methods as quantization support matures. + """ + def __init__(self): passAs per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/quantization/quant_recipe.py` around lines 1 - 3, Add a brief docstring to the public QuantRecipe class describing its purpose as a placeholder/structured container for quantization configuration and how it is intended to be used (e.g., fields or subclasses should represent quantization settings) so consumers and type annotations (such as MllmBaseLayer in pymllm/layers/base.py) understand its role; place the docstring immediately under the class declaration for QuantRecipe and keep it concise, describing expected responsibilities, typical usage, and any important invariants or extension points.pymllm/layers/embedding.py (3)
73-112: Prefer raising exceptions overassertfor runtime validation in public API methods.
assertstatements are stripped when Python runs with-O(optimized mode), silently disabling these shape checks. Sinceweight_loaderis part of the public API and loads external data, these checks should use explicit exceptions (e.g.,ValueError/RuntimeError) to guarantee validation at all times.Example for one assertion (apply same pattern to others)
- assert param.data.shape == loaded_weight.shape, ( - f"Shape mismatch: param {param.data.shape} vs " - f"loaded {loaded_weight.shape}" - ) + if param.data.shape != loaded_weight.shape: + raise ValueError( + f"Shape mismatch: param {param.data.shape} vs " + f"loaded {loaded_weight.shape}" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/layers/embedding.py` around lines 73 - 112, Replace all runtime validation assert statements in the weight_loader method with explicit exceptions so checks are always enforced; for each assert in weight_loader (the shape equality check when no sharding, the loaded_weight vocab-size check using output_dim vs self.num_embeddings, and the shard shape equality check), raise an appropriate exception (ValueError or RuntimeError) with the same descriptive message instead of using assert; ensure both branches (output_dim is None or tp_size == 1 and the sharded branch that uses self.vocab_start_index/self.vocab_end_index or .narrow with self.num_embeddings_per_partition) perform the validations via exceptions before calling param.data.copy_(...) so errors are never skipped under -O.
31-31: Use explicitOptional[int]orint | Noneinstead of implicitOptional.PEP 484 prohibits implicit
Optional. This was also flagged by Ruff (RUF013).Proposed fix
- padding_idx: int = None, + padding_idx: int | None = None,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/layers/embedding.py` at line 31, The parameter padding_idx currently uses an implicit Optional via "padding_idx: int = None"; update its type annotation to an explicit Optional by changing it to "padding_idx: Optional[int] = None" (or "padding_idx: int | None = None" for Python 3.10+), and add the corresponding import from typing (Optional) if you choose that form; ensure you update the signature where padding_idx is declared so Ruff (RUF013) and PEP 484 rules are satisfied.
44-50: Redundant divisibility check: both theif(Line 44) anddivide()(Line 50) validate the same condition.The explicit
ValueErroron Line 44 is the better check since it gives a clear message and can't be stripped. The subsequentdivide()call on Line 50 internally asserts the same condition. This is harmless but worth noting — thedivide()assert is now dead code for this path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/layers/embedding.py` around lines 44 - 50, Keep the explicit divisibility check and clear ValueError for num_embeddings vs tp_size, but remove the redundant runtime assertion by avoiding the call to divide() here: set self.num_embeddings_per_partition using integer division of num_embeddings by self.tp_size (e.g., compute num_embeddings // self.tp_size) instead of calling divide(), so the check remains in this function (ValueError in the if) and the duplicate assert inside divide() is not hit for this path; update references to num_embeddings_per_partition accordingly.pymllm/mobile/__init__.py (1)
1-45: Consider defining__all__to control the public API surface.Without
__all__, a wildcard import (from pymllm.mobile import *) will also pull in internal submodule names likeffi,convertor,utils,quantize,nn,service, andbackends. Defining__all__explicitly would keep the public surface intentional and consistent with howpymllm/__init__.pymanages its exports.Example
__all__ = [ # dtypes "float32", "float16", "bfloat16", "int8", "int16", "int32", "int64", "uint8", "uint16", "uint32", "uint64", "boolean", # devices "cpu", "cuda", "qnn", # core "Tensor", "empty", "echo", "device", "is_torch_available", "is_numpy_available", "from_torch", "from_numpy", "zeros", "ones", "arange", "random", # ops "matmul", # submodules (if intentionally public) "ffi", "nn", "backends", "convertor", "utils", "quantize", "service", ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mobile/__init__.py` around lines 1 - 45, Add an explicit __all__ list in pymllm.mobile.__init__ to control the public API surface: include the public symbols currently re-exported from .ffi (float32, float16, bfloat16, int8, int16, int32, int64, uint8, uint16, uint32, uint64, boolean, cpu, cuda, qnn, Tensor, empty, echo, device, is_torch_available, is_numpy_available, from_torch, from_numpy, zeros, ones, arange, random), plus matmul from .nn.functional; decide whether to also expose submodules (ffi, nn, backends, convertor, utils, quantize, service) and only include them in __all__ if they are intentionally public. Ensure the name __all__ is defined at module level so wildcard imports and documentation reflect the intended public API.pymllm/__init__.py (1)
26-27: Add a docstring to the publicis_mobile_available()function.Per coding guidelines, public APIs should have clear docstrings explaining purpose and return value.
Proposed fix
def is_mobile_available() -> bool: + """Check whether the mobile (on-device) FFI libraries are available. + + Returns: + True if the platform-specific MllmFFIExtension shared library + exists under pymllm/lib/, False otherwise. + """ return _has_mobile_libs()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/__init__.py` around lines 26 - 27, Add a concise docstring to the public function is_mobile_available() that explains its purpose (checking if mobile-specific dependencies are present), what it returns (bool indicating availability), and any notes about it delegating to the internal helper _has_mobile_libs(); place the docstring immediately below the def is_mobile_available() signature and follow project docstring style (one-line summary and a short return description).pymllm/configs/__init__.py (1)
12-21:__all__is not sorted (Ruff RUF022).♻️ Proposed fix
__all__ = [ - # Main singleton - "GlobalConfig", - "get_global_config", - # Sub configs - "ServerConfig", - "ModelConfig", - "RuntimeConfig", "CacheConfig", + "GlobalConfig", + "ModelConfig", + "RuntimeConfig", + "ServerConfig", + "get_global_config", ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/configs/__init__.py` around lines 12 - 21, The __all__ list in pymllm.configs.__init__ is not alphabetized; update the __all__ definition so the exported symbol names are sorted alphabetically (e.g., "CacheConfig", "GlobalConfig", "ModelConfig", "RuntimeConfig", "ServerConfig", "get_global_config") while preserving it as a list of strings so Ruff RUF022 is satisfied and the module exports remain the same.pymllm/configs/global_config.py (1)
173-173: Usingbase_gpu_id < 0as an undocumented CPU sentinel is confusing.The mapping
device = "cuda" if base_gpu_id >= 0 else "cpu"encodes CPU selection as a negative GPU index, a convention that is non-obvious and unmentioned in theServerConfigfield comment (# Base GPU index used for process-to-device mapping). A typo like-1intended as "auto" would silently switch the device to CPU.Consider an explicit device field on
ServerConfig(e.g.,device: Literal["cuda","cpu","auto"] = "auto") or at least document this sentinel contract on thebase_gpu_idfield.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/configs/global_config.py` at line 173, The code currently treats server_config.base_gpu_id < 0 as a hidden "use CPU" sentinel; update ServerConfig to include an explicit device field (e.g., device: Literal["cuda","cpu","auto"] = "auto") and change the runtime.device assignment to honor server_config.device first (if "cuda" set device="cuda", if "cpu" set device="cpu", if "auto" fall back to using base_gpu_id >= 0 ? "cuda" : "cpu"); also add a brief comment on the ServerConfig.device field explaining the semantics so the contract is explicit (refer to ServerConfig, server_config.base_gpu_id, and runtime.device).
- Updated `.codespellrc` to include 'flashinfer' in the ignore words list. - Introduced new files for `launch_server`, `prepare`, and various layer implementations including `LayerNorm`, `RMSNorm`, and `MLP`. - Added `ColumnParallelLinear` and `RowParallelLinear` classes for efficient linear operations in tensor parallelism. - Implemented rotary embedding functions in `rope.py` for enhanced model performance. - Created caching mechanisms in `param_disk_cache.py` and `radix_cache.py` for improved memory management. - Refactored `GroupCoordinator` to enhance broadcasting functionality in distributed settings.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
pymllm/layers/rms_norm.py (1)
19-20:weightinitialized withtorch.empty— no safe default if weights aren't loaded.
RMSNorm.weight(andGemmaRMSNorm.weightat line 47) usetorch.empty, which contains arbitrary memory.LayerNorminitializes weight totorch.onesand bias totorch.zeros, providing a usable identity default. For RMSNorm,weight = 1is the standard identity initialization. Consider usingtorch.onesfor consistency and safety against accidental use before weight loading.♻️ Proposed fix
- self.weight = Parameter(torch.empty(hidden_size)) + self.weight = Parameter(torch.ones(hidden_size))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/layers/rms_norm.py` around lines 19 - 20, RMSNorm and GemmaRMSNorm initialize self.weight with torch.empty which may contain garbage if weights aren't loaded; change both initializations to use a safe identity default (torch.ones) so the Parameter is created with ones (e.g., replace Parameter(torch.empty(hidden_size)) with Parameter(torch.ones(hidden_size))) while keeping the subsequent set_weight_attrs(self.weight, {"weight_loader": self.weight_loader}) calls intact.pymllm/layers/linear.py (2)
86-111: Shape checks useassert— silently disabled underpython -O.The
assertstatements inweight_loader(lines 96-100, 107-111) serve as data-integrity guards. In optimized mode (python -O/PYTHONOPTIMIZE), assertions are stripped, which would silently allow shape-mismatched weight loads, potentially causing hard-to-debug corruption. The same pattern exists inRowParallelLinear.weight_loaderandVocabParallelEmbedding.weight_loader.Consider converting to explicit
if/raise ValueError(...)for safety-critical checks. This is a pattern-level suggestion applicable to allweight_loadermethods.♻️ Example fix (ColumnParallelLinear.weight_loader)
if output_dim is None or self.tp_size == 1: - assert param.data.shape == loaded_weight.shape, ( - f"Shape mismatch: param {param.data.shape} vs " - f"loaded {loaded_weight.shape}" - ) + if param.data.shape != loaded_weight.shape: + raise ValueError( + f"Shape mismatch: param {param.data.shape} vs " + f"loaded {loaded_weight.shape}" + ) param.data.copy_(loaded_weight) else: shard_weight = loaded_weight.narrow( output_dim, self.output_start_index, self.out_features_per_partition, ) - assert param.data.shape == shard_weight.shape, ( - f"Shard shape mismatch: param {param.data.shape} vs " - f"shard {shard_weight.shape}" - ) + if param.data.shape != shard_weight.shape: + raise ValueError( + f"Shard shape mismatch: param {param.data.shape} vs " + f"shard {shard_weight.shape}" + ) param.data.copy_(shard_weight)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/layers/linear.py` around lines 86 - 111, The weight_loader methods use assert for shape checks which are disabled with python -O; replace the asserts in ColumnParallelLinear.weight_loader (and the same pattern in RowParallelLinear.weight_loader and VocabParallelEmbedding.weight_loader) with explicit runtime checks that raise a ValueError containing the expected and actual shapes and context (e.g., include param.data.shape and loaded_weight.shape or shard_weight.shape and mention shard vs full load), keeping the same control flow for the tp_size/output_dim branch and copying behavior when shapes match.
72-84: Removebias_flagfrom ColumnParallelLinear and RowParallelLinear.
self.bias_flagis set in both branches (lines 73, 83 in ColumnParallelLinear; lines 178, 182 in RowParallelLinear) but never read. The canonical way to check for bias isif self.bias is not None:, which is already done in RowParallelLinear.forward (line 218). Since the information is already captured in whetherself.biasis a Parameter or None,bias_flagis redundant dead state and should be removed.Note: The Linear class (lines 224–251) does not use
bias_flagand does not need changes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/layers/linear.py` around lines 72 - 84, The classes ColumnParallelLinear and RowParallelLinear define and set self.bias_flag but never read it; remove this redundant attribute and rely on the existing bias presence check (if self.bias is not None). Update the constructors in ColumnParallelLinear and RowParallelLinear to stop assigning self.bias_flag in both branches (when bias is truthy and when registering None via register_parameter("bias", None)), leaving only the existing Parameter creation and set_weight_attrs path and the register_parameter("bias", None) path; ensure any forward logic continues to check bias via self.bias is not None (e.g., RowParallelLinear.forward) and no other code references bias_flag.pymllm/layers/__init__.py (1)
18-35: Sort__all__alphabetically (Ruff RUF022).Ruff reports
__all__is not in isort-style sorted order, which will cause CI linting to fail.♻️ Proposed fix
__all__ = [ - "MllmBaseLayer", - "set_weight_attrs", - "VocabParallelEmbedding", "ColumnParallelLinear", + "GemmaRMSNorm", + "LayerNorm", "Linear", - "RowParallelLinear", "MLP", + "MllmBaseLayer", "ParallelMLP", - "LayerNorm", "RMSNorm", - "GemmaRMSNorm", - "apply_rope", + "RowParallelLinear", + "VocabParallelEmbedding", "apply_llama31_rope", - "apply_rope_pos_ids", "apply_llama31_rope_pos_ids", + "apply_rope", + "apply_rope_pos_ids", "apply_rope_with_cos_sin_cache", + "set_weight_attrs", ]As per coding guidelines, "Adhere to language-specific best practices and idioms (e.g., PEP 8 for Python)" — maintaining a sorted
__all__is the idiomatic convention enforced by the project's Ruff configuration (RUF022).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/layers/__init__.py` around lines 18 - 35, The __all__ list in the module is not alphabetically sorted and triggers Ruff RUF022; reorder the entries in the __all__ list (the list containing "MllmBaseLayer", "set_weight_attrs", "VocabParallelEmbedding", "ColumnParallelLinear", "Linear", "RowParallelLinear", "MLP", "ParallelMLP", "LayerNorm", "RMSNorm", "GemmaRMSNorm", "apply_rope", "apply_llama31_rope", "apply_rope_pos_ids", "apply_llama31_rope_pos_ids", "apply_rope_with_cos_sin_cache") into strict alphabetical order (by string) to satisfy isort-style sorting and eliminate the lint error.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pymllm/layers/base.py`:
- Line 4: Remove the unused import set_weight_attrs: delete the line that
imports set_weight_attrs from pymllm.layers.utils in this module (the import of
set_weight_attrs is unused in this file, so just remove that identifier to clean
up imports).
In `@pymllm/layers/embedding.py`:
- Line 31: The parameter annotation for padding_idx in pymllm.layers.embedding
(the function/class signature where padding_idx: int = None is declared) uses an
implicit Optional which violates PEP 484; update the annotation to be explicit
by changing padding_idx to either "int | None = None" (and add "from __future__
import annotations" at the top) or to "Optional[int] = None" and add "from
typing import Optional" to imports so the type is explicit and Ruff RUF013 is
resolved.
---
Nitpick comments:
In `@pymllm/layers/__init__.py`:
- Around line 18-35: The __all__ list in the module is not alphabetically sorted
and triggers Ruff RUF022; reorder the entries in the __all__ list (the list
containing "MllmBaseLayer", "set_weight_attrs", "VocabParallelEmbedding",
"ColumnParallelLinear", "Linear", "RowParallelLinear", "MLP", "ParallelMLP",
"LayerNorm", "RMSNorm", "GemmaRMSNorm", "apply_rope", "apply_llama31_rope",
"apply_rope_pos_ids", "apply_llama31_rope_pos_ids",
"apply_rope_with_cos_sin_cache") into strict alphabetical order (by string) to
satisfy isort-style sorting and eliminate the lint error.
In `@pymllm/layers/linear.py`:
- Around line 86-111: The weight_loader methods use assert for shape checks
which are disabled with python -O; replace the asserts in
ColumnParallelLinear.weight_loader (and the same pattern in
RowParallelLinear.weight_loader and VocabParallelEmbedding.weight_loader) with
explicit runtime checks that raise a ValueError containing the expected and
actual shapes and context (e.g., include param.data.shape and
loaded_weight.shape or shard_weight.shape and mention shard vs full load),
keeping the same control flow for the tp_size/output_dim branch and copying
behavior when shapes match.
- Around line 72-84: The classes ColumnParallelLinear and RowParallelLinear
define and set self.bias_flag but never read it; remove this redundant attribute
and rely on the existing bias presence check (if self.bias is not None). Update
the constructors in ColumnParallelLinear and RowParallelLinear to stop assigning
self.bias_flag in both branches (when bias is truthy and when registering None
via register_parameter("bias", None)), leaving only the existing Parameter
creation and set_weight_attrs path and the register_parameter("bias", None)
path; ensure any forward logic continues to check bias via self.bias is not None
(e.g., RowParallelLinear.forward) and no other code references bias_flag.
In `@pymllm/layers/rms_norm.py`:
- Around line 19-20: RMSNorm and GemmaRMSNorm initialize self.weight with
torch.empty which may contain garbage if weights aren't loaded; change both
initializations to use a safe identity default (torch.ones) so the Parameter is
created with ones (e.g., replace Parameter(torch.empty(hidden_size)) with
Parameter(torch.ones(hidden_size))) while keeping the subsequent
set_weight_attrs(self.weight, {"weight_loader": self.weight_loader}) calls
intact.
| import torch | ||
| from torch import nn | ||
| from torch.nn import Parameter | ||
| from pymllm.layers.utils import set_weight_attrs |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if the file exists and view the first 20 lines to see the import
echo "=== First 20 lines of pymllm/layers/base.py ==="
head -20 pymllm/layers/base.py
echo ""
echo "=== Search for any usage of set_weight_attrs in the file ==="
rg -n 'set_weight_attrs' pymllm/layers/base.py || echo "No matches found"
echo ""
echo "=== Full file line count ==="
wc -l pymllm/layers/base.pyRepository: UbiquitousLearning/mllm
Length of output: 971
Remove unused import set_weight_attrs.
Line 4 imports set_weight_attrs from pymllm.layers.utils, but it is never used in this file. Remove the import to keep the base module clean.
♻️ Proposed fix
-from pymllm.layers.utils import set_weight_attrs📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from pymllm.layers.utils import set_weight_attrs |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/layers/base.py` at line 4, Remove the unused import set_weight_attrs:
delete the line that imports set_weight_attrs from pymllm.layers.utils in this
module (the import of set_weight_attrs is unused in this file, so just remove
that identifier to clean up imports).
| self, | ||
| num_embeddings: int, | ||
| embedding_dim: int, | ||
| padding_idx: int = None, |
There was a problem hiding this comment.
PEP 484 violation: implicit Optional for padding_idx.
padding_idx: int = None should be padding_idx: int | None = None (or Optional[int] = None). Ruff correctly flags this as RUF013. The file doesn't currently import from __future__ import annotations or Optional, so either add the import or use the Optional form.
✏️ Proposed fix
+from __future__ import annotations
+
import torch
import torch.nn.functional as F
from torch.nn import Parameter- padding_idx: int = None,
+ padding_idx: int | None = None,🧰 Tools
🪛 Ruff (0.15.1)
[warning] 31-31: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/layers/embedding.py` at line 31, The parameter annotation for
padding_idx in pymllm.layers.embedding (the function/class signature where
padding_idx: int = None is declared) uses an implicit Optional which violates
PEP 484; update the annotation to be explicit by changing padding_idx to either
"int | None = None" (and add "from __future__ import annotations" at the top) or
to "Optional[int] = None" and add "from typing import Optional" to imports so
the type is explicit and Ruff RUF013 is resolved.
- Introduced a new architecture diagram image `pymllm-arch.png` in the assets directory. - Updated `README.md` to include the architecture diagram. - Created initial `launch.py` files in both the engine and server directories for future functionality. - Added an empty `scheduler.py` file in the orchestrator directory to support scheduling features.
- Updated `apache-tvm-ffi` version to `0.1.8.post2` in both `pyproject.toml` files. - Added `pyzmq` to the optional `cuda` dependencies in `pymllm`. - Introduced `pymllm-server` script for server launch functionality. - Refactored configuration imports in `pymllm/configs/__init__.py` to streamline access to model and quantization configurations. - Created new configuration files for model and quantization settings to support enhanced model management.
There was a problem hiding this comment.
Actionable comments posted: 13
🧹 Nitpick comments (14)
pymllm/tests/test_vocab_parallel_embedding.py (5)
87-87:world_sizeis unused in both worker functions.Both
embedding_forward_tp8_worker_cudaandweight_loading_tp8_worker_cudadeclareworld_sizebut never reference it (Ruff ARG001). Remove the parameter from both signatures and the corresponding call sites inrun_worker_tp8_cuda.Also applies to: 147-147
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/tests/test_vocab_parallel_embedding.py` at line 87, Remove the unused world_size parameter from the worker function signatures and their invocations: delete the world_size argument from embedding_forward_tp8_worker_cuda and weight_loading_tp8_worker_cuda, and update run_worker_tp8_cuda (and any other call sites) to call these functions with only (rank, local_rank); ensure any function definitions, type hints, and references to world_size are removed so no unused-argument lint errors remain.
197-277: Both test methods are identical boilerplate — extract a helper.
test_forward_pass_tp8_realandtest_weight_loading_tp8_realshare ~30 lines of identical process-spawning, joining, and result-checking logic. Extract a shared_run_tp8_test(test_func)helper to eliminate the duplication.♻️ Proposed refactor
+ def _run_tp8_test(self, test_func: Callable, port: int) -> None: + world_size = 8 + manager = mp.Manager() + return_dict = manager.dict() + processes = [ + mp.Process( + target=run_worker_tp8_cuda, + args=(rank, rank, world_size, test_func, return_dict, port), + ) + for rank in range(world_size) + ] + for p in processes: + p.start() + for p in processes: + p.join(timeout=120) + if p.is_alive(): + p.terminate() + p.join() + for rank in range(world_size): + result = return_dict.get(rank, "TIMEOUT") + expected = "PASSED" if rank == 0 else "OK" + assert result == expected, f"Rank {rank} failed: {result}" def test_forward_pass_tp8_real(self): """Test forward pass with real TP=8 using 8 processes on CUDA.""" - world_size = 8 - local_world_size = 8 - mp.set_start_method("spawn", force=True) - manager = mp.Manager() - return_dict = manager.dict() - processes = [] - for rank in range(world_size): - local_rank = rank - p = mp.Process(...) - p.start() - processes.append(p) - for p in processes: - p.join(timeout=120) - ... - for rank in range(world_size): - ... + self._run_tp8_test(embedding_forward_tp8_worker_cuda, _get_free_port())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/tests/test_vocab_parallel_embedding.py` around lines 197 - 277, Both tests duplicate the same multiprocessing boilerplate; extract a helper _run_tp8_test(test_worker_fn) that encapsulates setting mp.set_start_method("spawn", force=True), creating mp.Manager/return_dict, spawning mp.Process for ranks 0..7 with local_rank==rank calling run_worker_tp8_cuda with the provided worker function (embedding_forward_tp8_worker_cuda or weight_loading_tp8_worker_cuda), joining/terminating processes with timeout, and performing the same assertions on return_dict; then replace test_forward_pass_tp8_real to call _run_tp8_test(embedding_forward_tp8_worker_cuda) and test_weight_loading_tp8_real to call _run_tp8_test(weight_loading_tp8_worker_cuda).
48-48: Unusedlocal_world_sizeparameter;import tracebackshould be top-level.
local_world_sizeis never used in the function body (Ruff ARG001). Either remove it or document its intended use.import tracebackinside theexceptblock works but defers the import unnecessarily. Move it to the top of the file alongside other stdlib imports.✏️ Proposed fix
+import traceback import os import logging ...def run_worker_tp8_cuda( rank: int, local_rank: int, world_size: int, - local_world_size: int, test_func: Callable, return_dict: dict, + master_port: int = 29500, ):except Exception as e: - import traceback - return_dict[rank] = f"ERROR: {e}\n{traceback.format_exc()}"Also applies to: 79-82
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/tests/test_vocab_parallel_embedding.py` at line 48, Remove the unused local_world_size parameter from the function signature (it's unused and triggers ARG001) or if it was intended for test configuration, add a comment explaining its purpose and use it accordingly; also move the import traceback out of the except block(s) and place it with the other top-level stdlib imports so traceback is imported once at module import time (look for occurrences of local_world_size in the test function signature and the traceback import inside the except blocks around the same test).
293-308: Test methods inTestVocabParallelEmbeddingCUDAare missing docstrings.
test_cuda_forwardandtest_cuda_weight_loaderhave no docstrings, inconsistent with the other worker functions in the file and with the coding guidelines requiring docstrings for public functions.✏️ Proposed fix
def test_cuda_forward(self): + """Verify forward pass shape and device placement in TP=1 CUDA mode.""" layer = VocabParallelEmbedding(1000, 512).cuda() def test_cuda_weight_loader(self): + """Verify that weight_loader correctly copies weights in TP=1 CUDA mode.""" layer = VocabParallelEmbedding(100, 64).cuda()As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/tests/test_vocab_parallel_embedding.py` around lines 293 - 308, Add concise docstrings to the two test methods test_cuda_forward and test_cuda_weight_loader in TestVocabParallelEmbeddingCUDA: for test_cuda_forward describe that it verifies the VocabParallelEmbedding moves to CUDA and returns a tensor of shape (batch, seq, embed_dim) on GPU; for test_cuda_weight_loader describe that it checks load_weight correctly copies a CPU tensor into the module weight on CUDA and that values match after moving back to CPU; keep them short (one or two sentences) and follow the file's existing docstring style.
98-98: Replace magic numbers with named constants.
8(TP size),29500(master port),1024/64(vocab/embed dims) appear as bare literals in multiple places. Define module-level constants to make intent clear and changes local.✏️ Proposed fix
+_TP_SIZE_TEST = 8 +_MASTER_PORT_BASE = 29500 +_VOCAB_SIZE = 1024 +_EMBED_DIM = 64Then replace occurrences like
vocab_size // 8→_VOCAB_SIZE // _TP_SIZE_TEST, etc.As per coding guidelines, "Use named constants instead of magic numbers."
Also applies to: 111-111, 167-167, 177-177
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/tests/test_vocab_parallel_embedding.py` at line 98, Replace the hard-coded literals with module-level named constants: introduce constants such as _TP_SIZE_TEST (for TP size 8), _MASTER_PORT_TEST (for 29500), _VOCAB_SIZE (for 29500-related vocab), and _EMBED_DIM/_TOKEN_DIM (for 1024/64) at the top of the test module and use those constants where the literals appear (e.g., replace tp_size == 8 -> tp_size == _TP_SIZE_TEST, vocab_size // 8 -> _VOCAB_SIZE // _TP_SIZE_TEST, and any uses of 1024/64 -> _EMBED_DIM/_TOKEN_DIM, and replace the master port literal with _MASTER_PORT_TEST); update all occurrences referenced in the review (the assertions and calculations around tp_size, master port, vocab/embed dimensions) so intent is clear and changes remain local to the test module.pymllm/orchestrator/parallel_state.py (3)
50-56: Replaceassertwith an explicitValueErrorfor input validation.
assertstatements are stripped when Python is run with the-Oflag, silently disabling this critical product-size check. As per coding guidelines, functions that can fail should raise exceptions.🛡️ Proposed fix
- assert ( - tensor_model_parallel_size * data_parallel_size * pipeline_model_parallel_size - == world_size - ), ( - f"TP({tensor_model_parallel_size}) * DP({data_parallel_size}) * " - f"PP({pipeline_model_parallel_size}) != World({world_size})" - ) + if ( + tensor_model_parallel_size * data_parallel_size * pipeline_model_parallel_size + != world_size + ): + raise ValueError( + f"TP({tensor_model_parallel_size}) * DP({data_parallel_size}) * " + f"PP({pipeline_model_parallel_size}) != World({world_size})" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/orchestrator/parallel_state.py` around lines 50 - 56, Replace the runtime-only assert with explicit input validation that raises a ValueError: check that tensor_model_parallel_size * data_parallel_size * pipeline_model_parallel_size == world_size and if not raise ValueError with the same descriptive message (use the existing f-string built from tensor_model_parallel_size, data_parallel_size, pipeline_model_parallel_size, and world_size) so the check remains active even when Python is run with -O; update the code around the current assert in parallel_state.py accordingly.
30-35:initialize_model_paralleland communication helpers are missing docstrings.The primary public entry point and the three communication helpers (
tensor_model_parallel_all_reduce,tensor_model_parallel_all_gather,data_parallel_all_reduce) have no docstrings documenting parameters, behavior, or the passthrough semantics for the world_size=1 case. As per coding guidelines, public APIs, classes, and functions must have clear docstrings explaining purpose, parameters, returns, and errors.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/orchestrator/parallel_state.py` around lines 30 - 35, Add comprehensive docstrings to the public API function initialize_model_parallel and the three communication helper functions tensor_model_parallel_all_reduce, tensor_model_parallel_all_gather, and data_parallel_all_reduce: for each docstring describe the function’s purpose, parameters (including tensor_model_parallel_size, data_parallel_size, pipeline_model_parallel_size, backend, and the tensor argument for helpers), return value, raised errors, and the passthrough semantics when world_size == 1 (i.e., no-op behavior / return input unchanged). Use the project’s docstring style (short summary line, parameter list, returns, and notes about behavior), and place them immediately above the corresponding def lines (initialize_model_parallel, tensor_model_parallel_all_reduce, tensor_model_parallel_all_gather, data_parallel_all_reduce).
155-156:model_parallel_is_initialized()returnsFalsefor the primary single-GPU use case.When
initialize_model_parallel()is called with all sizes equal to1(the documented default/expected path), no groups are created, so this predicate always returnsFalse. Callers cannot distinguish "never initialized" from "initialized for single-GPU mode", which can silently suppress downstream logic that gates on this check.♻️ Proposed fix: use a dedicated flag
+_INITIALIZED: bool = False def initialize_model_parallel( ... ) -> None: - global _TP_GROUP, _DP_GROUP, _PP_GROUP - global _TP_RANK, _TP_SIZE, _DP_RANK, _DP_SIZE, _PP_RANK, _PP_SIZE + global _TP_GROUP, _DP_GROUP, _PP_GROUP, _INITIALIZED + global _TP_RANK, _TP_SIZE, _DP_RANK, _DP_SIZE, _PP_RANK, _PP_SIZE ... + _INITIALIZED = True def model_parallel_is_initialized() -> bool: - return _TP_GROUP is not None or _DP_GROUP is not None or _PP_GROUP is not None + return _INITIALIZED🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/orchestrator/parallel_state.py` around lines 155 - 156, The current model_parallel_is_initialized() wrongly infers initialization by checking _TP_GROUP/_DP_GROUP/_PP_GROUP and thus returns False for the valid single-GPU path; add a dedicated boolean flag (e.g., _MODEL_PARALLEL_INITIALIZED) that initialize_model_parallel() sets to True unconditionally when it completes (even if it creates no groups), have model_parallel_is_initialized() return that flag instead of checking the group variables, and ensure any teardown/reset code clears _MODEL_PARALLEL_INITIALIZED so future checks reflect actual state; reference initialize_model_parallel, model_parallel_is_initialized, _TP_GROUP, _DP_GROUP, and _PP_GROUP when applying this change.pymllm/orchestrator/request_response_process.py (1)
8-10: Add a class-level docstring toRequestResponseProcess.The module docstring explains the intent but the class itself lacks a docstring, violating the coding guideline for public APIs.
💡 Suggested scaffold
class RequestResponseProcess: - def __init__(self): - pass + """Main orchestrator thread handling incoming requests and outgoing responses. + + Must be used only as the main thread of the orchestrator. + """ + + def __init__(self) -> None: + passAs per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/orchestrator/request_response_process.py` around lines 8 - 10, Add a class-level docstring to RequestResponseProcess describing its purpose and behavior as a public API entry point; update the class definition (class RequestResponseProcess and its __init__ method) to include a concise docstring that explains what the class does, expected inputs/state, any important attributes, return behavior of its public methods, and possible errors/exceptions raised so callers and docs are clear.pymllm/configs/quantization_config.py (1)
9-18: Consider adding__post_init__runtime validation forkv_cache_dtype.
Literaltype hints are not enforced at runtime, so an invalid string (e.g.,"float32") would silently be accepted and only fail later when the engine tries to use it.💡 Suggested addition
+ _VALID_KV_DTYPES = {"auto", "float16", "bfloat16", "fp8_e4m3", "fp8_e5m2"} + + def __post_init__(self) -> None: + if self.kv_cache_dtype not in self._VALID_KV_DTYPES: + raise ValueError( + f"Invalid kv_cache_dtype {self.kv_cache_dtype!r}. " + f"Must be one of {self._VALID_KV_DTYPES}." + )As per coding guidelines, "Validate inputs for public APIs and critical internal functions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/configs/quantization_config.py` around lines 9 - 18, The kv_cache_dtype field on QuantizationConfig is only type-hinted and can accept invalid runtime strings; add a __post_init__ method on the QuantizationConfig dataclass that validates self.kv_cache_dtype is one of the allowed literals ("auto","float16","bfloat16","fp8_e4m3","fp8_e5m2") and raise a ValueError with a clear message if not; place this validation inside the QuantizationConfig class so it runs on instantiation and include the invalid value in the error for easier debugging.pymllm/configs/model_config.py (1)
28-31: Shorten theAttributeErrormessage to resolve Ruff TRY003.The multi-line string concatenation in the
raiseis flagged by Ruff (TRY003). A tighter single-line message resolves the warning without losing clarity.💡 Proposed fix
- raise AttributeError( - f"'{type(self).__name__}' has no attribute '{name}' " - f"(also not found on hf_config)" - ) + raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/configs/model_config.py` around lines 28 - 31, The raised AttributeError message in the code uses multi-line concatenation which triggers Ruff TRY003; update the raise in the relevant method (e.g., the __getattr__ in model_config.py that references type(self).__name__, name and hf_config) to a single-line f-string such as: raise AttributeError(f"'{type(self).__name__}' has no attribute '{name}' (also not found on hf_config)") so the message remains clear but is a single string literal.pymllm/orchestrator/tokenizer_process.py (1)
1-3: Add module and class docstrings.
TokenizerProcessis a public API class but has neither a module-level docstring nor a class docstring explaining its purpose and intended lifecycle. The same pattern applies to all four of the other bare placeholder files in this orchestrator (AsyncDiskIoProcess,DetokenizerProcess,ModelRunnerProcess).💡 Suggested scaffold
+"""Tokenizer worker process for the orchestrator pipeline.""" + + class TokenizerProcess: - def __init__(self): - pass + """Runs tokenization in a dedicated process. + + Converts raw text requests into token sequences and forwards them + to the model runner. Intended to be managed by the orchestrator. + """ + + def __init__(self) -> None: + passAs per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/orchestrator/tokenizer_process.py` around lines 1 - 3, Add a module-level docstring at the top of tokenizer_process.py describing the module's role in the orchestrator (responsible for running tokenization in a separate process and its expected lifecycle), and add a class docstring to TokenizerProcess that explains its purpose, public API expectations, lifecycle (init, start/stop behavior if any), parameters (none for __init__ currently), return values, and possible errors; apply the same pattern to the other placeholder classes mentioned (AsyncDiskIoProcess, DetokenizerProcess, ModelRunnerProcess) so each public class and its module have clear, concise docstrings documenting usage and responsibilities.pymllm/configs/__init__.py (1)
8-14:__all__is not sorted (Ruff RUF022).♻️ Proposed fix
__all__ = [ - "GlobalConfig", - "get_global_config", - "ServerConfig", - "ModelConfig", "QuantizationConfig", + "GlobalConfig", + "ModelConfig", + "QuantizationConfig", + "ServerConfig", + "get_global_config", ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/configs/__init__.py` around lines 8 - 14, The __all__ list in pymllm/configs/__init__.py is not alphabetically ordered; update the __all__ variable so its items are sorted lexicographically (e.g., arrange "GlobalConfig", "ModelConfig", "QuantizationConfig", "ServerConfig", "get_global_config" or whatever alphabetical order is correct for the exact names) to satisfy Ruff RUF022 and ensure predictable exports; locate the __all__ definition and reorder the entries accordingly.pymllm/configs/global_config.py (1)
281-295: DuplicateServerConfigimport insideread_args.
ServerConfigis already imported at the top of the module (line 21). The in-function import on line 282 is redundant.♻️ Proposed fix
# Server: reconstruct to preserve validation behavior. - from pymllm.configs.server_config import ServerConfig - server_updates: dict[str, Any] = {}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/configs/global_config.py` around lines 281 - 295, Remove the redundant in-function import of ServerConfig inside the read_args block and rely on the module-level import already present; specifically delete the line "from pymllm.configs.server_config import ServerConfig" within the code that builds server_updates (the block that references cfg.server, server_updates, server_values, and reassigns cfg.server = ServerConfig(**server_values)), and run linters/tests to ensure no unused-import or name-resolution issues remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pymllm/configs/global_config.py`:
- Around line 307-316: The code mutates cfg.server.model_path directly which
bypasses ServerConfig.__post_init__ (so tokenizer_path stays stale); instead of
assigning cfg.server.model_path = cfg.model.model_path, reinstantiate the
ServerConfig (e.g., replace cfg.server with ServerConfig(... ) or call its
constructor/validator) using the updated model_path so __post_init__ runs and
tokenizer_path is set correctly; update the branch that checks
model_model_overridden and server_model_overridden to reconstruct cfg.server
from cfg.server fields with model_path = cfg.model.model_path rather than doing
a bare field assignment.
In `@pymllm/configs/server_config.py`:
- Around line 97-101: The code currently sets served_model_name =
str(self.model_path) even when model_path is None, producing the string "None"
and masking an unset model; update the initialization so served_model_name is
only assigned from model_path when model_path is not None (e.g. if
self.served_model_name is None and self.model_path is not None:
self.served_model_name = str(self.model_path)), leaving served_model_name as
None otherwise; keep the existing tokenizer_path handling and ensure
ServerConfig._validate() is still called afterwards.
- Line 43: Change the type annotation of the field max_prefill_tokens in
server_config.py from int to Optional[int] to match its default None and the
existing guard in _validate; import Optional from typing (or use
typing.Optional) if not already imported, and update any related type
hints/annotations that reference max_prefill_tokens to use Optional[int] so
static type checkers no longer report that int cannot be None.
In `@pymllm/engine/launch.py`:
- Around line 29-32: Replace the TODO marker in the _launch_processes method
docstring with the required format by changing "TODO issue processes here" to
"TODO: issue processes here" (or expand it to a real actionable sentence),
ensuring the TODO: prefix is used as per guidelines in the _launch_processes
function/method documentation.
- Around line 5-15: The file imports unused symbols that create unconditional
ImportError risk; move the imports for zmq, torch.multiprocessing as mp,
TokenizerProcess, DetokenizerProcess, ModelRunnerProcess, and AsyncDiskIoProcess
into a typing-only guard (e.g., inside a TYPE_CHECKING block) or remove them
until they are actually needed so importing pymllm.engine.launch won’t fail when
optional dependencies like pyzmq/torch aren’t installed; update the top of the
module to from typing import TYPE_CHECKING and place those six imports inside if
TYPE_CHECKING: (or delete them) while leaving used imports (AutoConfig,
snapshot_download, get_global_config, RequestResponseProcess) untouched.
In `@pymllm/orchestrator/parallel_state.py`:
- Around line 36-41: Reset and destroy stale global parallel group and rank
state at the start of the initializer: at function entry clear _TP_GROUP,
_DP_GROUP, _PP_GROUP and their rank/size globals (_TP_RANK, _TP_SIZE, _DP_RANK,
_DP_SIZE, _PP_RANK, _PP_SIZE) by checking if an existing ProcessGroup exists and
calling torch.distributed.destroy_process_group on it before reassigning new
values; then proceed to set _TP_SIZE/_DP_SIZE/_PP_SIZE and recreate the groups
so the accessor functions won't return stale ProcessGroup objects. Ensure the
logic references the existing globals (_TP_GROUP, _DP_GROUP, _PP_GROUP) and uses
torch.distributed.destroy_process_group() to clean them up before replacement.
In `@pymllm/server/launch.py`:
- Around line 10-13: main() calls engine.launch() but Engine lacks that method;
add a public launch(self) method to the Engine class (pymllm/engine/launch.py)
that orchestrates startup by invoking the existing private helpers in the
correct order: call _set_default_torch_dtype(), _config_logging(),
_check_model_and_tokenizer(), _maybe_download(), and then _launch_processes();
ensure launch handles/propagates exceptions consistently with the rest of the
class (no signature change) so main() can call engine.launch() without raising
AttributeError.
In `@pymllm/tests/test_vocab_parallel_embedding.py`:
- Line 202: mp.set_start_method("spawn", force=True) is being invoked inside
individual test functions; remove that call from the test (e.g., tests in
test_vocab_parallel_embedding.py) and instead set the start method in a
session-scoped pytest fixture in conftest.py (or wrap the call under a script
guard). Create a fixture (session scope) that calls mp.set_start_method("spawn",
force=True) once before tests run and ensure tests reference no direct call to
mp.set_start_method; alternatively protect the existing call with if __name__ ==
"__main__" so it never executes during pytest collection. Ensure you remove or
comment out any other in-test occurrences (the line at 243 as noted) so
multiprocessing start method is set only once at session setup.
- Around line 231-236: The loop that checks per-rank results uses
return_dict.get(rank, "TIMEOUT") which allows a timed-out worker to produce
"TIMEOUT" and slip past the non-zero-rank assertion; update the checks in the
block that iterates ranks (the loop using return_dict and rank) so that non-zero
ranks explicitly fail on the "TIMEOUT" sentinel (for example by asserting result
== "PASSED" or asserting result not in ("TIMEOUT",) and "ERROR" not in
str(result)); apply the same change to the second occurrence of this loop (the
one around lines 272-277) so hung/killed workers are treated as failures rather
than silently passing.
- Around line 62-63: Both tests (test_forward_pass_tp8_real and
test_weight_loading_tp8_real) hardcode os.environ["MASTER_PORT"]="29500",
causing TIME_WAIT collisions; instead, in the parent test process choose a free
port (e.g., via socket.bind(('',0)) to get an ephemeral port), set
os.environ["MASTER_PORT"] to that value in the parent, and pass it into the
spawned worker processes so each run uses a unique port; update the test
functions to compute the free port before spawning, thread that port through the
worker spawn args (or ensure worker reads MASTER_PORT from the parent env) and
remove the hardcoded "29500" usage.
In `@pyproject.toml`:
- Line 36: Document that flashinfer-python is Linux-only by updating the cuda
extras entry in pyproject.toml: annotate the cuda = ["tilelang",
"flashinfer-python", "pyzmq"] extras list with either a platform environment
marker (e.g., mark "flashinfer-python" for sys_platform == "linux") or add an
inline comment explaining the Linux-only constraint so CI and contributors see
the limitation; modify the extras definition that contains the "cuda" key and
the "flashinfer-python" identifier accordingly.
- Around line 3-4: The pyproject declares conflicting pins for apache-tvm-ffi:
the build-system lists "apache-tvm-ffi == 0.1.8" while the runtime dependencies
pin "apache-tvm-ffi == 0.1.8.post2"; make them consistent by choosing one
canonical pin and updating the other entry to match (e.g., change the
build-system entry to "apache-tvm-ffi == 0.1.8.post2" or change the runtime pin
to "== 0.1.8"), ensuring both the build-system and the runtime dependencies
reference the exact same apache-tvm-ffi version.
- Around line 22-23: Remove the test-only packages "pytest" and "pytest-html"
from the main dependencies list and add them to an appropriate development/test
optional group (for example add them under [tool.poetry.dev-dependencies] or
under [project.optional-dependencies]["test"] depending on the project format),
ensuring the names "pytest" and "pytest-html" appear only in that dev/test
section and not in the runtime dependencies block.
---
Duplicate comments:
In `@pymllm/configs/global_config.py`:
- Around line 1-321: Strip all trailing spaces from this file (every line ending
with whitespace) so it conforms to the repo style; inspect and fix occurrences
inside the GlobalConfig class (including the `@dataclass` decorator line and blank
lines), and elsewhere in functions like make_args, read_args, _parse_bool,
_unwrap_optional, and get_global_config; you can verify with grep -Pn ' +$'
pymllm/configs/global_config.py and then remove the trailing whitespace (or run
an editor/formatter that trims trailing whitespace) and commit the cleaned file.
- Around line 47-65: The class-level singleton via __new__/get_instance/reset
causes dataclass __init__ to reset fields on each direct GlobalConfig() call;
remove the __new__, get_instance, and reset methods from GlobalConfig and
instead create a single module-level instance (e.g., _global_config =
GlobalConfig() right after the class definition) and expose an accessor
get_global_config() that returns that instance; alternatively, if you must keep
instantiation, protect __post_init__ by checking an _initialized flag at the
start and set it using object.__setattr__(self, "_initialized", True) so
dataclass __init__ won’t overwrite already-initialized sub-configs (reference
symbols: GlobalConfig, __new__, __post_init__, get_instance, reset, and
_initialized).
In `@pymllm/orchestrator/async_disk_io_process.py`:
- Around line 1-3: Add a module-level docstring at the top of
pymllm.orchestrator.async_disk_io_process describing the module's responsibility
(async disk I/O helper for the orchestrator) and add a class docstring for
AsyncDiskIoProcess that follows the same scaffold used in tokenizer_process.py:
one-line summary, short description of purpose, and brief notes about public
methods/usage. Ensure the wording mirrors TokenizerProcess's style and includes
any important behavior or thread/async considerations for AsyncDiskIoProcess.
In `@pymllm/orchestrator/detokenizer_process.py`:
- Around line 1-3: Add a module-level docstring at the top of
pymllm/orchestrator/detokenizer_process.py describing the module purpose, and
add a class docstring to the DetokenizerProcess class that follows the same
scaffold used in tokenizer_process.py (brief summary, responsibilities, any key
methods or usage notes); ensure the wording and style match the existing
tokenizer_process.py docstrings for consistency.
In `@pymllm/orchestrator/model_runner_process.py`:
- Around line 1-3: Add a module-level docstring at the top of
pymllm/orchestrator/model_runner_process.py describing the module's purpose and
responsibilities, and add a class docstring for ModelRunnerProcess that explains
what the class does, its primary responsibilities, and any important usage
notes; follow the same scaffold and tone used in tokenizer_process.py so the
module and the ModelRunnerProcess class docstrings are consistent with the
project's documentation style.
In `@pymllm/orchestrator/parallel_state.py`:
- Around line 98-110: The PP group construction currently uses contiguous ranges
which collides with TP group partitioning when tensor_model_parallel_size > 1;
update the ranks computation in the loop that builds _PP_GROUP so each pipeline
group picks ranks by striding across the world (e.g. ranks = [i + k *
num_pp_groups for k in range(pipeline_model_parallel_size)]) instead of
list(range(start, start + pipeline_model_parallel_size)), keeping the same loop
and using GroupCoordinator to set _PP_GROUP and _PP_RANK; ensure you still use
world_size, world_rank, local_rank and backend as before and break when the
matching group is found.
In `@pymllm/tests/test_vocab_parallel_embedding.py`:
- Around line 164-187: The reconstruction check is self-referential because each
rank creates its own full_weight via torch.randn; to fix, set a deterministic
RNG seed before creating full_weight so all processes generate the same tensor:
call torch.manual_seed(42) immediately before the line full_weight =
torch.randn(vocab_size, embed_dim) (so load_weight, layer.weight,
expected_shard, and the rank‑0 reconstruction use the same source tensor across
ranks).
---
Nitpick comments:
In `@pymllm/configs/__init__.py`:
- Around line 8-14: The __all__ list in pymllm/configs/__init__.py is not
alphabetically ordered; update the __all__ variable so its items are sorted
lexicographically (e.g., arrange "GlobalConfig", "ModelConfig",
"QuantizationConfig", "ServerConfig", "get_global_config" or whatever
alphabetical order is correct for the exact names) to satisfy Ruff RUF022 and
ensure predictable exports; locate the __all__ definition and reorder the
entries accordingly.
In `@pymllm/configs/global_config.py`:
- Around line 281-295: Remove the redundant in-function import of ServerConfig
inside the read_args block and rely on the module-level import already present;
specifically delete the line "from pymllm.configs.server_config import
ServerConfig" within the code that builds server_updates (the block that
references cfg.server, server_updates, server_values, and reassigns cfg.server =
ServerConfig(**server_values)), and run linters/tests to ensure no unused-import
or name-resolution issues remain.
In `@pymllm/configs/model_config.py`:
- Around line 28-31: The raised AttributeError message in the code uses
multi-line concatenation which triggers Ruff TRY003; update the raise in the
relevant method (e.g., the __getattr__ in model_config.py that references
type(self).__name__, name and hf_config) to a single-line f-string such as:
raise AttributeError(f"'{type(self).__name__}' has no attribute '{name}' (also
not found on hf_config)") so the message remains clear but is a single string
literal.
In `@pymllm/configs/quantization_config.py`:
- Around line 9-18: The kv_cache_dtype field on QuantizationConfig is only
type-hinted and can accept invalid runtime strings; add a __post_init__ method
on the QuantizationConfig dataclass that validates self.kv_cache_dtype is one of
the allowed literals ("auto","float16","bfloat16","fp8_e4m3","fp8_e5m2") and
raise a ValueError with a clear message if not; place this validation inside the
QuantizationConfig class so it runs on instantiation and include the invalid
value in the error for easier debugging.
In `@pymllm/orchestrator/parallel_state.py`:
- Around line 50-56: Replace the runtime-only assert with explicit input
validation that raises a ValueError: check that tensor_model_parallel_size *
data_parallel_size * pipeline_model_parallel_size == world_size and if not raise
ValueError with the same descriptive message (use the existing f-string built
from tensor_model_parallel_size, data_parallel_size,
pipeline_model_parallel_size, and world_size) so the check remains active even
when Python is run with -O; update the code around the current assert in
parallel_state.py accordingly.
- Around line 30-35: Add comprehensive docstrings to the public API function
initialize_model_parallel and the three communication helper functions
tensor_model_parallel_all_reduce, tensor_model_parallel_all_gather, and
data_parallel_all_reduce: for each docstring describe the function’s purpose,
parameters (including tensor_model_parallel_size, data_parallel_size,
pipeline_model_parallel_size, backend, and the tensor argument for helpers),
return value, raised errors, and the passthrough semantics when world_size == 1
(i.e., no-op behavior / return input unchanged). Use the project’s docstring
style (short summary line, parameter list, returns, and notes about behavior),
and place them immediately above the corresponding def lines
(initialize_model_parallel, tensor_model_parallel_all_reduce,
tensor_model_parallel_all_gather, data_parallel_all_reduce).
- Around line 155-156: The current model_parallel_is_initialized() wrongly
infers initialization by checking _TP_GROUP/_DP_GROUP/_PP_GROUP and thus returns
False for the valid single-GPU path; add a dedicated boolean flag (e.g.,
_MODEL_PARALLEL_INITIALIZED) that initialize_model_parallel() sets to True
unconditionally when it completes (even if it creates no groups), have
model_parallel_is_initialized() return that flag instead of checking the group
variables, and ensure any teardown/reset code clears _MODEL_PARALLEL_INITIALIZED
so future checks reflect actual state; reference initialize_model_parallel,
model_parallel_is_initialized, _TP_GROUP, _DP_GROUP, and _PP_GROUP when applying
this change.
In `@pymllm/orchestrator/request_response_process.py`:
- Around line 8-10: Add a class-level docstring to RequestResponseProcess
describing its purpose and behavior as a public API entry point; update the
class definition (class RequestResponseProcess and its __init__ method) to
include a concise docstring that explains what the class does, expected
inputs/state, any important attributes, return behavior of its public methods,
and possible errors/exceptions raised so callers and docs are clear.
In `@pymllm/orchestrator/tokenizer_process.py`:
- Around line 1-3: Add a module-level docstring at the top of
tokenizer_process.py describing the module's role in the orchestrator
(responsible for running tokenization in a separate process and its expected
lifecycle), and add a class docstring to TokenizerProcess that explains its
purpose, public API expectations, lifecycle (init, start/stop behavior if any),
parameters (none for __init__ currently), return values, and possible errors;
apply the same pattern to the other placeholder classes mentioned
(AsyncDiskIoProcess, DetokenizerProcess, ModelRunnerProcess) so each public
class and its module have clear, concise docstrings documenting usage and
responsibilities.
In `@pymllm/tests/test_vocab_parallel_embedding.py`:
- Line 87: Remove the unused world_size parameter from the worker function
signatures and their invocations: delete the world_size argument from
embedding_forward_tp8_worker_cuda and weight_loading_tp8_worker_cuda, and update
run_worker_tp8_cuda (and any other call sites) to call these functions with only
(rank, local_rank); ensure any function definitions, type hints, and references
to world_size are removed so no unused-argument lint errors remain.
- Around line 197-277: Both tests duplicate the same multiprocessing
boilerplate; extract a helper _run_tp8_test(test_worker_fn) that encapsulates
setting mp.set_start_method("spawn", force=True), creating
mp.Manager/return_dict, spawning mp.Process for ranks 0..7 with local_rank==rank
calling run_worker_tp8_cuda with the provided worker function
(embedding_forward_tp8_worker_cuda or weight_loading_tp8_worker_cuda),
joining/terminating processes with timeout, and performing the same assertions
on return_dict; then replace test_forward_pass_tp8_real to call
_run_tp8_test(embedding_forward_tp8_worker_cuda) and
test_weight_loading_tp8_real to call
_run_tp8_test(weight_loading_tp8_worker_cuda).
- Line 48: Remove the unused local_world_size parameter from the function
signature (it's unused and triggers ARG001) or if it was intended for test
configuration, add a comment explaining its purpose and use it accordingly; also
move the import traceback out of the except block(s) and place it with the other
top-level stdlib imports so traceback is imported once at module import time
(look for occurrences of local_world_size in the test function signature and the
traceback import inside the except blocks around the same test).
- Around line 293-308: Add concise docstrings to the two test methods
test_cuda_forward and test_cuda_weight_loader in TestVocabParallelEmbeddingCUDA:
for test_cuda_forward describe that it verifies the VocabParallelEmbedding moves
to CUDA and returns a tensor of shape (batch, seq, embed_dim) on GPU; for
test_cuda_weight_loader describe that it checks load_weight correctly copies a
CPU tensor into the module weight on CUDA and that values match after moving
back to CPU; keep them short (one or two sentences) and follow the file's
existing docstring style.
- Line 98: Replace the hard-coded literals with module-level named constants:
introduce constants such as _TP_SIZE_TEST (for TP size 8), _MASTER_PORT_TEST
(for 29500), _VOCAB_SIZE (for 29500-related vocab), and _EMBED_DIM/_TOKEN_DIM
(for 1024/64) at the top of the test module and use those constants where the
literals appear (e.g., replace tp_size == 8 -> tp_size == _TP_SIZE_TEST,
vocab_size // 8 -> _VOCAB_SIZE // _TP_SIZE_TEST, and any uses of 1024/64 ->
_EMBED_DIM/_TOKEN_DIM, and replace the master port literal with
_MASTER_PORT_TEST); update all occurrences referenced in the review (the
assertions and calculations around tp_size, master port, vocab/embed dimensions)
so intent is clear and changes remain local to the test module.
| # Keep model path synchronized when only one side is explicitly overridden. | ||
| server_model_overridden = "server__model_path" in parsed | ||
| model_model_overridden = "model__model_path" in parsed | ||
| if server_model_overridden and not model_model_overridden: | ||
| cfg.model.model_path = cfg.server.model_path | ||
| elif model_model_overridden and not server_model_overridden: | ||
| cfg.server.model_path = cfg.model.model_path | ||
|
|
||
| cfg._initialized = True | ||
| return cfg |
There was a problem hiding this comment.
Direct mutation of cfg.server.model_path may leave tokenizer_path stale.
ServerConfig.__post_init__ sets tokenizer_path = model_path when tokenizer_path is None. After cfg.server is already reconstructed on line 295 (with tokenizer_path resolved to the original model_path), lines 311 and 313 directly assign cfg.server.model_path without triggering __post_init__. If a user provides only --model.model_path (not --server.*), line 313 silently diverges model_path and tokenizer_path.
The fix is to reconstruct ServerConfig for this path too, instead of a bare field assignment:
🐛 Proposed fix
- elif model_model_overridden and not server_model_overridden:
- cfg.server.model_path = cfg.model.model_path
+ elif model_model_overridden and not server_model_overridden:
+ # Rebuild so __post_init__ / _validate re-run with the new model_path.
+ server_values = {
+ dc_field.name: getattr(cfg.server, dc_field.name)
+ for dc_field in fields(cfg.server)
+ }
+ server_values["model_path"] = cfg.model.model_path
+ cfg.server = ServerConfig(**server_values)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/configs/global_config.py` around lines 307 - 316, The code mutates
cfg.server.model_path directly which bypasses ServerConfig.__post_init__ (so
tokenizer_path stays stale); instead of assigning cfg.server.model_path =
cfg.model.model_path, reinstantiate the ServerConfig (e.g., replace cfg.server
with ServerConfig(... ) or call its constructor/validator) using the updated
model_path so __post_init__ runs and tokenizer_path is set correctly; update the
branch that checks model_model_overridden and server_model_overridden to
reconstruct cfg.server from cfg.server fields with model_path =
cfg.model.model_path rather than doing a bare field assignment.
| max_queued_requests: Optional[int] = None | ||
| max_total_tokens: Optional[int] = None | ||
| chunked_prefill_size: Optional[int] = None | ||
| max_prefill_tokens: int = None |
There was a problem hiding this comment.
max_prefill_tokens annotation should be Optional[int], not int.
The default is None and _validate already guards with if self.max_prefill_tokens is not None, making it clearly optional. The current annotation causes a type-checker error (int cannot be None).
🐛 Proposed fix
- max_prefill_tokens: int = None
+ max_prefill_tokens: Optional[int] = None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| max_prefill_tokens: int = None | |
| max_prefill_tokens: Optional[int] = None |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/configs/server_config.py` at line 43, Change the type annotation of
the field max_prefill_tokens in server_config.py from int to Optional[int] to
match its default None and the existing guard in _validate; import Optional from
typing (or use typing.Optional) if not already imported, and update any related
type hints/annotations that reference max_prefill_tokens to use Optional[int] so
static type checkers no longer report that int cannot be None.
| if self.tokenizer_path is None: | ||
| self.tokenizer_path = self.model_path | ||
| if self.served_model_name is None: | ||
| self.served_model_name = str(self.model_path) | ||
| self._validate() |
There was a problem hiding this comment.
str(model_path) when model_path=None silently sets served_model_name to the string "None".
When GlobalConfig.__post_init__ calls ServerConfig(model_path=None), this path runs and sets served_model_name = "None" (a non-empty, non-None string). Any consumer checking if cfg.server.served_model_name will see a truthy string, masking the "no model configured" state.
🐛 Proposed fix
if self.served_model_name is None:
- self.served_model_name = str(self.model_path)
+ if self.model_path is not None:
+ self.served_model_name = str(self.model_path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/configs/server_config.py` around lines 97 - 101, The code currently
sets served_model_name = str(self.model_path) even when model_path is None,
producing the string "None" and masking an unset model; update the
initialization so served_model_name is only assigned from model_path when
model_path is not None (e.g. if self.served_model_name is None and
self.model_path is not None: self.served_model_name = str(self.model_path)),
leaving served_model_name as None otherwise; keep the existing tokenizer_path
handling and ensure ServerConfig._validate() is still called afterwards.
| world_size = 8 | ||
| local_world_size = 8 # Single node with 8 GPUs | ||
|
|
||
| mp.set_start_method("spawn", force=True) |
There was a problem hiding this comment.
mp.set_start_method must not be called inside individual test methods.
mp.set_start_method("spawn", force=True) modifies global multiprocessing state for the entire process. Calling it inside a pytest test method (even with force=True) is fragile: other tests sharing the same process may be affected, and the call order matters.
Move the call to a session-scoped fixture in conftest.py or guard it under if __name__ == "__main__".
Also applies to: 243-243
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/tests/test_vocab_parallel_embedding.py` at line 202,
mp.set_start_method("spawn", force=True) is being invoked inside individual test
functions; remove that call from the test (e.g., tests in
test_vocab_parallel_embedding.py) and instead set the start method in a
session-scoped pytest fixture in conftest.py (or wrap the call under a script
guard). Create a fixture (session scope) that calls mp.set_start_method("spawn",
force=True) once before tests run and ensure tests reference no direct call to
mp.set_start_method; alternatively protect the existing call with if __name__ ==
"__main__" so it never executes during pytest collection. Ensure you remove or
comment out any other in-test occurrences (the line at 243 as noted) so
multiprocessing start method is set only once at session setup.
| for rank in range(world_size): | ||
| result = return_dict.get(rank, "TIMEOUT") | ||
| if rank == 0: | ||
| assert result == "PASSED", f"Rank {rank} failed: {result}" | ||
| else: | ||
| assert "ERROR" not in str(result), f"Rank {rank} error: {result}" |
There was a problem hiding this comment.
Timed-out non-zero ranks silently pass the assertion.
When a worker exceeds the 120-second join timeout and is killed, return_dict[rank] is never set. return_dict.get(rank, "TIMEOUT") returns "TIMEOUT". For non-zero ranks, the assertion is "ERROR" not in str(result) — "TIMEOUT" passes that check, so a hung or killed worker is silently ignored.
🔧 Proposed fix
for rank in range(world_size):
result = return_dict.get(rank, "TIMEOUT")
if rank == 0:
assert result == "PASSED", f"Rank {rank} failed: {result}"
else:
- assert "ERROR" not in str(result), f"Rank {rank} error: {result}"
+ assert result == "OK", f"Rank {rank} failed: {result}"Also applies to: 272-277
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/tests/test_vocab_parallel_embedding.py` around lines 231 - 236, The
loop that checks per-rank results uses return_dict.get(rank, "TIMEOUT") which
allows a timed-out worker to produce "TIMEOUT" and slip past the non-zero-rank
assertion; update the checks in the block that iterates ranks (the loop using
return_dict and rank) so that non-zero ranks explicitly fail on the "TIMEOUT"
sentinel (for example by asserting result == "PASSED" or asserting result not in
("TIMEOUT",) and "ERROR" not in str(result)); apply the same change to the
second occurrence of this loop (the one around lines 272-277) so hung/killed
workers are treated as failures rather than silently passing.
| "scikit-build-core>=0.11.0", "apache-tvm-ffi == 0.1.8" | ||
| ] |
There was a problem hiding this comment.
apache-tvm-ffi version mismatch between build-system and runtime.
The build-system requires apache-tvm-ffi == 0.1.8 (Line 3) while the runtime dependency pins == 0.1.8.post2 (Line 24). Although .post2 is a post-release fix of the same core, the two pins are inconsistent and may confuse dependency resolvers or CI environments that install both.
💡 Proposed fix
- "scikit-build-core>=0.11.0", "apache-tvm-ffi == 0.1.8"
+ "scikit-build-core>=0.11.0", "apache-tvm-ffi == 0.1.8.post2"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "scikit-build-core>=0.11.0", "apache-tvm-ffi == 0.1.8" | |
| ] | |
| "scikit-build-core>=0.11.0", "apache-tvm-ffi == 0.1.8.post2" | |
| ] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyproject.toml` around lines 3 - 4, The pyproject declares conflicting pins
for apache-tvm-ffi: the build-system lists "apache-tvm-ffi == 0.1.8" while the
runtime dependencies pin "apache-tvm-ffi == 0.1.8.post2"; make them consistent
by choosing one canonical pin and updating the other entry to match (e.g.,
change the build-system entry to "apache-tvm-ffi == 0.1.8.post2" or change the
runtime pin to "== 0.1.8"), ensuring both the build-system and the runtime
dependencies reference the exact same apache-tvm-ffi version.
| "pytest", | ||
| "pytest-html", |
There was a problem hiding this comment.
Move pytest and pytest-html out of the main dependencies list.
Listing test-only packages as required runtime dependencies forces them onto every end user. They should be in an optional group (e.g., dev or test).
💡 Proposed fix
dependencies=[
"packaging",
- "pytest",
- "pytest-html",
"apache-tvm-ffi == 0.1.8.post2",
...
]
[project.optional-dependencies]
cuda = ["tilelang", "flashinfer-python", "pyzmq"]
+dev = ["pytest", "pytest-html"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyproject.toml` around lines 22 - 23, Remove the test-only packages "pytest"
and "pytest-html" from the main dependencies list and add them to an appropriate
development/test optional group (for example add them under
[tool.poetry.dev-dependencies] or under [project.optional-dependencies]["test"]
depending on the project format), ensuring the names "pytest" and "pytest-html"
appear only in that dev/test section and not in the runtime dependencies block.
|
|
||
| [project.optional-dependencies] | ||
| cuda = ["tilelang"] | ||
| cuda = ["tilelang", "flashinfer-python", "pyzmq"] |
There was a problem hiding this comment.
flashinfer-python is Linux-only — document the platform constraint.
The official FlashInfer documentation explicitly notes that "the package currently used by FlashInfer is named flashinfer-python, not flashinfer", confirming the package name is correct. However, FlashInfer's prerequisites list "OS: Linux only", meaning pip install pymllm[cuda] will not work on Windows or macOS even for developers.
Consider adding an environment marker or a comment in pyproject.toml to document this platform restriction so CI and contributors are aware.
As per coding guidelines, "Ensure code is portable across supported platforms (e.g., Linux, Windows) unless explicitly platform-specific."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pyproject.toml` at line 36, Document that flashinfer-python is Linux-only by
updating the cuda extras entry in pyproject.toml: annotate the cuda =
["tilelang", "flashinfer-python", "pyzmq"] extras list with either a platform
environment marker (e.g., mark "flashinfer-python" for sys_platform == "linux")
or add an inline comment explaining the Linux-only constraint so CI and
contributors see the limitation; modify the extras definition that contains the
"cuda" key and the "flashinfer-python" identifier accordingly.
- Added a new `store_cache` CUDA kernel for efficient key/value tensor storage in a cache. - Introduced Python interface for the `store_cache` kernel, enabling its use in PyTorch. - Created benchmarks to compare `store_cache` performance against standard PyTorch indexing. - Updated `.gitignore` to exclude `.claude` directory and added `.pytest_cache` to `mllm-kernel`. - Added tests for `store_cache` functionality to ensure correctness and performance. - Refactored memory management in `KVPool` to utilize the new `store_cache` kernel when applicable.
- Enhanced readability by formatting socket creation parameters across multiple lines in the `init_sockets` method of `TokenizerProcess`. - Maintained functionality while improving code clarity for future maintenance.
There was a problem hiding this comment.
Actionable comments posted: 20
🧹 Nitpick comments (21)
.claude/skills/update-codeowners/SKILL.md (1)
38-38: Nitpick: "exact same" is wordy.Consider replacing "exact same" with "identical" or simply "the same" for tighter prose.
✏️ Proposed wording fix
- - [ ] No accidental duplicate rule for the exact same path pattern. + - [ ] No accidental duplicate rule for the same path pattern.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.claude/skills/update-codeowners/SKILL.md at line 38, Replace the wording "exact same" in the checklist item "- [ ] No accidental duplicate rule for the exact same path pattern." with a tighter phrase such as "identical" or "the same" (e.g., "- [ ] No accidental duplicate rule for the same path pattern."), ensuring the checklist line text is updated accordingly to improve concision.mllm-kernel/mllm_kernel/cuda/jit/store_cache.py (4)
56-62: Redundant manual cache alongside the@jitdecorator's internalcache_once.
_KERNEL_CACHEduplicates the memoization thatcache_onceinside the@jitdecorator already provides for each kernel variant (since_make_store_cache_kernelcreates a distinct decorated function perrow_bytes). The manual dict is harmless but adds a second caching layer. A brief inline comment explaining why both are needed (e.g., "caches the wrapper function itself, not just the compiled module") would help future readers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mllm-kernel/mllm_kernel/cuda/jit/store_cache.py` around lines 56 - 62, The current manual cache _KERNEL_CACHE duplicates the memoization already provided by the `@jit` decorator's internal cache_once for each kernel variant; update the code to keep the manual dict but add a concise inline comment above _KERNEL_CACHE explaining its purpose: that _KERNEL_CACHE stores the Python-level wrapper function returned by _make_store_cache_kernel (so callers reuse the exact wrapper object) while cache_once/@jit handles the compiled module memoization, and reference _get_kernel, _make_store_cache_kernel and cache_once/@jit in that comment to clarify why both layers exist.
18-24: PDL capability check doesn't account for multi-GPU setups and can't distinguish sm_90 vs sm_90a.Two concerns:
torch.cuda.get_device_capability()defaults to the current device. With@cache_once(no device argument in the signature), the result for device 0 is cached and reused even when the kernel later runs on a different GPU with a different compute capability.The comment says "sm_90a (Hopper)", but
get_device_capability()returns(9, 0)for both sm_90 and sm_90a — PyTorch cannot distinguish them. PDL specifically requires sm_90a. Practically this is low-risk (most Hopper GPUs are sm_90a), but it's worth a comment.Suggested improvement for multi-GPU awareness
`@cache_once` -def _is_arch_support_pdl() -> bool: +def _is_arch_support_pdl(device_index: int = 0) -> bool: if not torch.cuda.is_available(): return False - major, minor = torch.cuda.get_device_capability() - # PDL requires sm_90a (Hopper) or later + major, minor = torch.cuda.get_device_capability(device_index) + # PDL requires sm_90a (Hopper) or later. + # NOTE: get_device_capability cannot distinguish sm_90 from sm_90a; + # we assume sm_90a when (major, minor) >= (9, 0). return major > 9 or (major == 9 and minor >= 0)Then pass the device index from the tensor in
_make_store_cache_kernel/store_cache.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mllm-kernel/mllm_kernel/cuda/jit/store_cache.py` around lines 18 - 24, The PDL capability check _is_arch_support_pdl currently reads torch.cuda.get_device_capability() for the current device and caches that result once, which breaks on multi-GPU systems and also cannot distinguish sm_90 vs sm_90a; update _is_arch_support_pdl to accept a device index/torch.device parameter (or make the cache key include device) and query torch.cuda.get_device_capability(device) so results are per-GPU, then propagate that device argument from call sites in _make_store_cache_kernel and store_cache (pass the tensor.device.index or tensor.device) so the correct GPU capability is checked at runtime; also add a short comment near the check noting that PyTorch returns (9,0) for both sm_90 and sm_90a and thus this check is conservative for PDL which requires sm_90a.
91-127:store_cacheskips therow_bytes % 4validation thatcan_use_store_cacheenforces.If a caller invokes
store_cachedirectly without first callingcan_use_store_cache, an invalidrow_bytes(e.g., not a multiple of 4) will propagate to JIT compilation and produce an opaque error. A lightweight guard here would improve the developer experience.Proposed guard
def store_cache( k: torch.Tensor, v: torch.Tensor, k_cache: torch.Tensor, v_cache: torch.Tensor, indices: torch.Tensor, *, row_bytes: int = 0, num_split: int = 0, ) -> None: ... row_bytes = row_bytes or k.shape[-1] * k.element_size() + if row_bytes % 4 != 0: + raise ValueError( + f"row_bytes must be a multiple of 4, got {row_bytes}" + ) kernel = _get_kernel(row_bytes)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mllm-kernel/mllm_kernel/cuda/jit/store_cache.py` around lines 91 - 127, The store_cache function currently omits the row_bytes alignment check present in can_use_store_cache; add a guard in store_cache (after computing row_bytes = row_bytes or k.shape[-1] * k.element_size() and before calling _get_kernel or JIT) that validates row_bytes % 4 == 0 and raises a clear ValueError (or TypeError) with a descriptive message if the check fails so callers who bypass can_use_store_cache get an immediate, useful error; reference the store_cache function and can_use_store_cache behavior when implementing the guard and keep the check before selecting num_split or invoking kernel(k, v, k_cache, v_cache, indices, num_split).
65-88:@cache_oncepermanently cachesFalsefor transient JIT compilation failures.If
_get_kernelraises due to a transient issue (e.g., resource pressure during compilation),can_use_store_cachereturnsFalseand that result is cached forever for thatrow_bytes. A subsequent call will never retry. This is acceptable if JIT failures are always deterministic, but worth a docstring note so callers are aware.Also, the static analysis hint about
return Truein atryblock (TRY300) is a valid style point — moving it intoelsemakes the intent clearer:Minor style fix
try: _get_kernel(row_bytes) - return True except Exception as e: logger.warning( "Failed to load JIT store_cache kernel with row_bytes=%d: %s", row_bytes, e, ) return False + else: + return True🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mllm-kernel/mllm_kernel/cuda/jit/store_cache.py` around lines 65 - 88, can_use_store_cache currently uses `@cache_once` which will permanently cache False when _get_kernel raises transient JIT compilation errors; update the function to avoid caching transient failures (e.g., remove or modify `@cache_once` for error paths, or catch exceptions and re-raise after logging so the cache only stores successful results), add a docstring note on caching semantics for callers, and move the successful-return into a try/except/else pattern so the return True resides in the else block rather than inside the try — refer to can_use_store_cache, the `@cache_once` decorator, and _get_kernel to locate and change the behavior.mllm-kernel/benchmarks/bench_store_cache.py (2)
36-89: Profiler context doesn't capture complete CUDA timing —synchronizeis outside the recording window.
torch.cuda.synchronize()on line 57 runs after thewith profile(...) as profblock has exited, so any in-flight CUDA work from the last iteration might not be attributed to the profiler trace. Modern PyTorch's profiler typically flushes on exit, but for correctness of the timing summary you could move the synchronize inside:Suggested adjustment
with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=False, profile_memory=False, with_stack=False, ) as prof: for _ in range(iters): fn() - torch.cuda.synchronize() + torch.cuda.synchronize()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mllm-kernel/benchmarks/bench_store_cache.py` around lines 36 - 89, _in _profile_path_, the final torch.cuda.synchronize() is executed after the profiler context exits, so in-flight CUDA work from the last iteration may not be captured; move the synchronization into the profiler block immediately after the iters loop (i.e., call torch.cuda.synchronize() while still inside the with profile(...) as prof: block right after running fn() iters times) and remove or keep the external one accordingly so that the profiler records completed CUDA activity for the last iteration.
62-71:time_attrandsort_keyalways hold the same value — the branching is redundant.Both variables resolve to
"self_cuda_time_total"or"self_device_time_total"based on the same condition. You can dropsort_keyand reusetime_attr.Simplification
time_attr = ( "self_cuda_time_total" if events and hasattr(events[0], "self_cuda_time_total") else "self_device_time_total" ) - sort_key = ( - "self_cuda_time_total" - if time_attr == "self_cuda_time_total" - else "self_device_time_total" - ) total_self_device_us = sum(float(getattr(evt, time_attr, 0.0)) for evt in events) avg_self_device_us = total_self_device_us / max(iters, 1) print(f"\n=== {name} ===") print( prof.key_averages().table( - sort_by=sort_key, + sort_by=time_attr, row_limit=row_limit, ) )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mllm-kernel/benchmarks/bench_store_cache.py` around lines 62 - 71, The code defines both time_attr and sort_key using the same condition, making sort_key redundant; update the benchmarking logic to remove sort_key and use time_attr everywhere instead (replace any usage of sort_key with time_attr), keeping the original conditional that checks events and hasattr(events[0], "self_cuda_time_total") to set time_attr; ensure any sorting or attribute lookups that previously referenced sort_key now reference time_attr so behavior is unchanged.mllm-kernel/mllm_kernel/cuda/csrc/store_cache.cuh (1)
138-149: User-suppliednum_splitthat doesn't match alignment causes an unrecoverablePanic.
get_kernelfalls through allif constexpr/ifchecks and hitsPanicwhennum_splitdoesn't satisfy the alignment requirement forkElementBytes. For example, callingstore_cache(..., num_split=4)withrow_bytes=128(not divisible by 512) triggers the panic.The Python-side auto-selection is safe, but the
num_splitparameter is user-facing. Consider either:
- Validating
num_splitin the Pythonstore_cachefunction before dispatching, or- Replacing
Panicwith a descriptiveRuntimeCheckthat includes the alignment requirement.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mllm-kernel/mllm_kernel/cuda/csrc/store_cache.cuh` around lines 138 - 149, The get_kernel template can hit Panic when a user-provided num_split doesn't satisfy kElementBytes alignment (e.g., num_split=4 with row_bytes=128); change handling so this is recoverable: either add explicit validation of num_split in the Python-facing store_cache dispatcher (check row_bytes % (num_split * 128) == 0 and raise a clear Python error) or replace the Panic call inside get_kernel with a descriptive runtime check (use RuntimeCheck or equivalent to emit an error message naming get_kernel, the requested num_split, kElementBytes and the required alignment) so callers get a clear, non-crashing diagnostic; update references to get_kernel, store_cache, store_kernel and Panic accordingly.pymllm/orchestrator/request_response_process.py (2)
91-92: UseTypeErrorfor type-check failures, notValueError.The check
isinstance(request.rid, str)is validating the type ofrid, not its value.TypeErroris the conventional exception for this.Proposed fix
if not isinstance(request.rid, str): - raise ValueError("RequestResponseProcess currently accepts single requests only.") + raise TypeError("RequestResponseProcess currently accepts single requests only.")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/orchestrator/request_response_process.py` around lines 91 - 92, The type-check for request.rid in RequestResponseProcess is raising ValueError but should raise TypeError for type failures; update the conditional in request_response_process.py (the block that checks "if not isinstance(request.rid, str):") to raise TypeError with an appropriate message (e.g., indicating that rid must be a str) instead of ValueError so the exception semantics match the type validation.
68-82:create_zmq_socketfromipc_utilsis typed for synczmq.Context/zmq.Socket, but async variants are passed here.This works at runtime because
zmq.asyncio.Contextsubclasseszmq.Context, but the type annotations will cause type-checker warnings. Consider adding an overload orUnioninipc_utils.py, or use# type: ignorewith a comment here.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/orchestrator/request_response_process.py` around lines 68 - 82, The type-checker warning is caused by passing a zmq.asyncio.Context into create_zmq_socket (typed for sync zmq.Context) from start in request_response_process.py; fix by updating ipc_utils.create_zmq_socket signature to accept Union[zmq.Context, zmq.asyncio.Context] (and corresponding Socket/AsyncSocket unions or add overloads) so both sync and async contexts are typed, or if you prefer a localized change, add a precise "# type: ignore[assignment]" (with a short comment referencing create_zmq_socket expecting sync Context) on the two create_zmq_socket calls in start to suppress the warning; reference the start method and create_zmq_socket when making the change.pymllm/orchestrator/scheduler_process.py (1)
180-195: Unusedbatchparameter inprocess_batch_result.The
batchargument is not referenced in the method body (onlyresultis used). Either remove it or add a brief comment about planned future use.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/orchestrator/scheduler_process.py` around lines 180 - 195, The process_batch_result method currently accepts an unused parameter batch; either remove batch from the signature of process_batch_result and update all callers (e.g., places that call SchedulerProcess.process_batch_result) to stop passing the batch, or keep the parameter and add a one-line comment inside process_batch_result explaining that batch is reserved for future features (e.g., KV-cache or completion checks) to avoid lint warnings; update any type hints or tests accordingly so the codebase remains consistent with the chosen approach.pymllm/engine/io_struct.py (2)
85-89:__getitem__returnsselffor single-item batches — callers may mutate the original.When
batch_size == 1,__getitem__(0)returnsselfrather than a copy. Any mutation by the caller (e.g., appending tooutput_token_idsin_forward_batch) will modify the originalGenerateReqInput. This is a common aliasing trap. Consider returning a shallow copy if immutability is expected.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/engine/io_struct.py` around lines 85 - 89, The __getitem__ in GenerateReqInput currently returns self when batch_size == 1, causing callers (e.g., _forward_batch) that mutate fields like output_token_ids to alias and modify the original; change __getitem__ so that when batch_size == 1 it returns a shallow copy of the GenerateReqInput instance instead of self (either by invoking an existing clone/copy helper or by constructing a new GenerateReqInput with the same field values), ensuring mutable fields (like output_token_ids) are new lists to avoid aliasing.
8-27:BaseReqandBaseBatchReqlack class-level docstrings.These are public API dataclasses depended upon by other modules. A brief docstring explaining purpose and the
rid/ridssemantics would improve discoverability. As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/engine/io_struct.py` around lines 8 - 27, Add concise class-level docstrings to the public dataclasses BaseReq and BaseBatchReq describing their purpose and the semantics of the rid/rids fields (that rid is an optional single or list of request identifiers and regenerate_rid returns/refreshes them, and that rids is a list of request identifiers with regenerate_rids refreshing them), and include brief notes on return types for regenerate_rid and regenerate_rids; update the docstrings near the class definitions of BaseReq and BaseBatchReq to improve discoverability and comply with the public API documentation guideline.pymllm/orchestrator/model_runner_process.py (1)
50-56: Consider adding error handling aroundrecv_pyobj/send_pyobjin the event loop.If a malformed message arrives (deserialization failure) or the socket encounters a transient error, the entire process will crash with an unhandled exception. A
try/exceptinside the loop with logging would improve resilience — the outertry/finallyinrun_model_runner_processwill still triggershutdown, but the process will be lost.♻️ Suggested resilience improvement
def event_loop(self) -> None: """Infinite loop: recv batch -> forward -> sample -> send result.""" logger.info("ModelRunnerProcess event loop started") while True: - batch = self._recv_from_scheduler.recv_pyobj() - result = self._forward_batch(batch) - self._send_to_scheduler.send_pyobj(result) + try: + batch = self._recv_from_scheduler.recv_pyobj() + result = self._forward_batch(batch) + self._send_to_scheduler.send_pyobj(result) + except zmq.ZMQError: + logger.exception("ZMQ error in ModelRunnerProcess event loop") + break + except Exception: + logger.exception("Unexpected error in ModelRunnerProcess event loop")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/orchestrator/model_runner_process.py` around lines 50 - 56, Wrap the body of ModelRunnerProcess.event_loop in a try/except that catches exceptions from self._recv_from_scheduler.recv_pyobj, self._forward_batch, and self._send_to_scheduler.send_pyobj, log the error with context (including the exception) and continue the loop instead of letting the process crash; on deserialization/malformed message errors drop or ack the offending batch and continue, and on persistent socket errors optionally backoff or break to let run_model_runner_process invoke shutdown. Ensure logging references event_loop, _recv_from_scheduler.recv_pyobj, _forward_batch and _send_to_scheduler.send_pyobj so the failing component is clear.pymllm/orchestrator/ipc_utils.py (1)
14-18: No cleanup for stale IPC socket files on abnormal termination.If a process crashes without cleanly closing its ZMQ sockets, the socket files under
_IPC_DIRwill linger. On restart, ZMQ will rebind successfully, but the directory itself may accumulate stale files over time. Consider adding acleanup_ipc_dir()helper (or anatexithook) that removes known socket files during orderly shutdown or startup.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/orchestrator/ipc_utils.py` around lines 14 - 18, Add a cleanup helper to remove stale IPC socket files and wire it into startup/shutdown: implement a new function cleanup_ipc_dir() that lists files in _IPC_DIR (e.g., with glob or os.listdir), removes only known socket file patterns (or all files in that directory) using os.remove with exception handling, call cleanup_ipc_dir() from _ensure_ipc_dir() at startup to purge leftovers, and also register it with atexit.register(cleanup_ipc_dir) so sockets created by this process are removed on orderly shutdown; reference _IPC_DIR, _ensure_ipc_dir, and the new cleanup_ipc_dir function when making changes.pymllm/engine/launch.py (2)
30-36:Engineclass and__init__lack docstrings.Per coding guidelines, public APIs and classes must have docstrings explaining purpose, parameters, and errors. The
Engineclass has no class docstring, and__init__has no description of what it configures or what errors it may raise (ValueErrorif model/tokenizer paths are missing,ValueErrorfor unsupported dtype).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/engine/launch.py` around lines 30 - 36, Add docstrings to the Engine class and its __init__ method: for the class describe its responsibility (managing subprocesses, request/response process, and model setup) and for __init__ document parameters (none), what it configures (_subprocesses, _rr_process, logging, default torch dtype, model/tokenizer checks), and enumerate raised exceptions (ValueError when model/tokenizer paths are missing or when an unsupported dtype is configured). Reference the Engine class and its __init__ constructor and ensure the docs mention the side effects of calling _config_logging, _set_default_torch_dtype, and _check_model_and_tokenizer so callers know about potential ValueError raises.
270-291: Unusedridparameter in both static helpers.Static analysis (ARG004) confirms
ridis never referenced inside_wait_for_final_result(line 271) or_stream_results(line 281). Both methods operate entirely on theReqStateobject.♻️ Proposed refactor
`@staticmethod` -async def _wait_for_final_result(rid: str, state: ReqState) -> Dict[str, Any]: +async def _wait_for_final_result(state: ReqState) -> Dict[str, Any]: `@staticmethod` -async def _stream_results(rid: str, state: ReqState) -> AsyncIterator[Dict[str, Any]]: +async def _stream_results(state: ReqState) -> AsyncIterator[Dict[str, Any]]:Update call sites to drop
rid:-return await self._wait_for_final_result(rid, state) +return await self._wait_for_final_result(state) -async for chunk in self._stream_results(rid, state): +async for chunk in self._stream_results(state): -yield await self._wait_for_final_result(rid, state) +yield await self._wait_for_final_result(state)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/engine/launch.py` around lines 270 - 291, Both static helpers _wait_for_final_result and _stream_results declare an unused rid parameter; remove rid from their signatures and any internal references, leaving only (state: ReqState) as the parameter, update their return/type hints to remain the same, and then update all call sites that invoke _wait_for_final_result(...) and _stream_results(...) to pass only the ReqState instance; ensure imports/annotations still reference ReqState and run tests/lint to confirm ARG004 is resolved.pymllm/mem_cache/radix_cache.py (2)
70-70:__slots__tuples are unsorted (Ruff RUF023)Both
RadixKey.__slots__(Line 70) andTreeNode.__slots__(Lines 113-126) are flagged by Ruff RUF023. Applying natural sort aligns with Ruff's expectations and improves scan-ability.Also applies to: 113-126
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mem_cache/radix_cache.py` at line 70, RadixKey.__slots__ and TreeNode.__slots__ are unsorted; change each __slots__ tuple to use a consistently ordered (natural/alphanumeric) sequence to satisfy Ruff RUF023 — locate RadixKey.__slots__ ("token_ids", "extra_key") and the TreeNode.__slots__ declaration and reorder the strings alphabetically (or by natural sort) so member names are sorted, then run the linter to confirm the warning is resolved.
129-129:defaultdict(TreeNode)creates nodes with side-effects on accidental key accessUsing
defaultdict(TreeNode)means any unguardednode.children[missing_key]silently creates a newTreeNode— invoking_next_node_id(), settinglast_access_time, and initialising a freshdefaultdictfor its own children. All current call-sites correctly guard withif ck not in node.children:, but this default factory is a latent footgun (and makes bugs harder to spot during future maintenance). A plaindictwould immediately raiseKeyErroron unguarded access.♻️ Proposed refactor
- self.children: Dict[Any, TreeNode] = defaultdict(TreeNode) + self.children: Dict[Any, TreeNode] = {}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mem_cache/radix_cache.py` at line 129, Replace the latent-footgun defaultdict usage so TreeNode.children is a plain dict instead of defaultdict(TreeNode); update any places that previously relied on implicit creation to explicitly construct and assign a new TreeNode (calling TreeNode(...) so _next_node_id() and last_access_time are invoked only when intended). Specifically change the attribute declaration in class TreeNode from "self.children: Dict[Any, TreeNode] = defaultdict(TreeNode)" to a plain dict, and audit call-sites that check/insert children (e.g., where code currently does "if ck not in node.children: ..." or might index directly) to use explicit node.children[ck] = TreeNode(...) before accessing the child.pymllm/mem_cache/__init__.py (1)
20-37:__all__is not sorted (Ruff RUF022)Ruff flags the
__all__list as not in isort order. Sorting it keeps the public API surface easy to scan.♻️ Proposed sort
__all__ = [ # memory_pool + "KVPool", "KVPool", "TokenToKVPoolAllocator", "ReqToTokenPool", - "KVPool", - "TokenToKVPoolAllocator", - "ReqToTokenPool", - "make_full_attention_net_mem_pool", - "make_req_to_token_pool", - # radix_cache - "RadixCache", - "RadixKey", - "TreeNode", - "MatchResult", - "InsertResult", - "EvictResult", - "hash_token_ids", - "hash_to_int64", - "hash_bytes", + "EvictResult", + "InsertResult", + "KVPool", + "MatchResult", + "RadixCache", + "RadixKey", + "ReqToTokenPool", + "TokenToKVPoolAllocator", + "TreeNode", + "hash_bytes", + "hash_to_int64", + "hash_token_ids", + "make_full_attention_net_mem_pool", + "make_req_to_token_pool", ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mem_cache/__init__.py` around lines 20 - 37, The __all__ list in pymllm.mem_cache is not sorted which triggers Ruff RUF022; reorder the entries in the __all__ list into case-sensitive alphabetical order (e.g., arrange entries like EvictResult, InsertResult, KVPool, MatchResult, RadixCache, RadixKey, ReqToTokenPool, TokenToKVPoolAllocator, TreeNode, and the hash_* and factory names such as hash_bytes, hash_token_ids, hash_to_int64, make_full_attention_net_mem_pool, make_req_to_token_pool) so the public API names are sorted and Ruff stops flagging the file.pymllm/mem_cache/memory_pool.py (1)
97-97:_row_bytescomputation creates a throwaway tensor
torch.tensor([], dtype=dtype).element_size()allocates a temporary tensor purely to query element size. Prefer the direct dtype API:♻️ Proposed refactor
- self._row_bytes = k_row_dim * torch.tensor([], dtype=dtype).element_size() + self._row_bytes = k_row_dim * torch.empty(0, dtype=dtype).element_size()Or even more directly (zero allocation):
torch.finfo(dtype).bits // 8for floating-point dtypes (guard withdtype.is_floating_pointif integer dtypes need support).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/mem_cache/memory_pool.py` at line 97, The allocation of a temporary tensor to compute element size in the _row_bytes assignment should be replaced with a dtype-safe, zero-allocation computation; update the code that computes self._row_bytes (currently using k_row_dim and dtype) to derive bytes-per-element from the dtype directly (use torch.finfo(dtype).bits // 8 for floating dtypes and torch.iinfo(dtype).bits // 8 for integer dtypes, or use a try/except that falls back between finfo and iinfo) and multiply by k_row_dim so no throwaway tensor is created.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@mllm-kernel/benchmarks/bench_store_cache.py`:
- Around line 125-132: Add an explicit validation that args.batch_size <=
args.num_slots before creating indices so we don't silently truncate; if the
check fails, raise a clear ValueError describing both args.batch_size and
args.num_slots. Locate the block where k, v, indices, k_cache, and v_cache are
created (variables k, v, indices, k_cache, v_cache) and insert the guard
immediately before computing indices = torch.randperm(...), ensuring the check
runs on the same device/CPU context as the rest of the setup.
In `@pymllm/engine/io_struct.py`:
- Around line 113-133: The to_request_dict method currently calls
payload.update(self.extra_options) which allows self.extra_options to silently
overwrite core fields (e.g., "rid", "text", "input_ids", "sampling_params",
etc.); modify to_request_dict to check for key collisions between
self.extra_options and the core keys collected in payload before applying them,
and either raise a ValueError (reject) or emit a clear warning via the module
logger (warn) when a collision is detected (choose one behavior consistently),
and only merge non-colliding keys into payload if proceeding; reference the
to_request_dict function, the payload dict, and self.extra_options when
implementing this guard.
In `@pymllm/engine/launch.py`:
- Around line 143-151: The readiness loop blocks indefinitely because
reader.recv() has no timeout; update the loop that iterates procs_and_readers to
call reader.poll(timeout_sec) (e.g., a few seconds configurable) before calling
reader.recv(), raise a RuntimeError if poll returns False (timeout) and include
the process name in the message, keep the existing EOFError handling for recv,
and only call reader.recv() when poll indicates data is available; reference
procs_and_readers, reader.recv, reader.poll and the RuntimeError raised for
timeouts.
- Line 44: mp.set_start_method("spawn", force=True) is a global process-wide
mutation and must not live inside the instance method _launch_processes; remove
that call from pymllm.engine.launch._launch_processes (or any Engine.launch
helper) and instead set the start method exactly once in the program entrypoint:
add mp.set_start_method("spawn") inside an if __name__ == "__main__" guard
before constructing/launching Engine (e.g., before Engine() / engine.launch())
so the start method is configured at startup and not mutated at runtime.
- Around line 206-212: The isinstance(rid, list) guard inside the inner async
_run() is dead and should be moved to the start of generate(): validate rid (and
similarly in generate_async) immediately when the function begins—before you
build the request or call add_request—by raising ValueError("Synchronous
`generate` currently supports single request.") for list rid inputs; then remove
the unreachable check from _run(). This keeps the validation consistent with
add_request and avoids redundant/unreachable code in _run() and the analogous
check in generate_async.
- Around line 31-36: The __init__ currently never initializes self._loop which
can cause AttributeError if _launch_processes sets self._rr_process before
creating the asyncio loop or if generate() is called before launch(); initialize
self._loop = None in __init__ (add to the existing initializations) and update
shutdown() (and any other places that access self._loop, e.g., generate()) to
guard usage with "if self._loop is not None" (or hasattr(self, "_loop") and
self._loop is not None) before calling methods on it so accesses are safe even
if loop creation failed or launch() wasn’t run.
- Around line 296-299: The shutdown block currently swallows all errors with
"except Exception: pass"; update the code around
self._loop.run_until_complete(self._rr_process.shutdown()) to handle failures
explicitly by (1) checking that self._loop and self._rr_process exist/are
initialized before calling, and (2) catching exceptions as "except Exception as
e" and logging or re-raising the error instead of ignoring it so shutdown
failures (including AttributeError when _loop is missing) are visible; reference
the shutdown call self._rr_process.shutdown and the run loop invocation
self._loop.run_until_complete when making the change.
- Around line 359-363: When calling AutoConfig.from_pretrained to set
cfg.model.hf_config, detect if the trust_remote_code flag is True and emit a
clear security warning before invoking the call (e.g., use the module logger to
logger.warning("...") or similar). Reference the trust_remote_code variable and
the AutoConfig.from_pretrained call around cfg.model.hf_config so the warning is
logged only when trust_remote_code is True and includes a brief note that remote
code will be executed and operators should verify the source.
In `@pymllm/mem_cache/memory_pool.py`:
- Around line 108-125: The tensors in KVPool.__init__ (used by self.k_buffer and
self.v_buffer) are being created with pin_memory=True even when device is CUDA
which raises RuntimeError; update KVPool.__init__ to convert the device with
torch.device(device) and then override pin_memory = pin_memory and
self.device.type == "cpu" (i.e., only keep pin_memory True for CPU devices)
before calling torch.zeros, and apply the same guard/behavior in the factory
function make_full_attention_net_mem_pool (and update its docstring if it keeps
pin_memory=True by default) so torch.zeros is never called with pin_memory=True
for GPU devices.
In `@pymllm/mem_cache/radix_cache.py`:
- Around line 369-404: Phase 2 leaves node.value populated after calling
_free_swa_indices and _tombstone_node, causing Phase 1 to call _free_indices on
the same tensor later; fix by marking the node as swa-freed immediately after
SWA-freeing: after calling _free_swa_indices(node.value) (and before/after
_tombstone_node(node)) set a distinct sentinel on the node (e.g., node.value =
SWA_FREED_SENTINEL or node.swa_freed = True) so later code in evict's Phase 1
(the parent-check that calls _free_indices(p.value)) skips freeing when the
sentinel/flag indicates the indices were already freed; update any checks that
rely on value is None accordingly to treat the sentinel/flag as already freed.
- Around line 637-680: _insert_swa incorrectly discards the return value of
_split_node and then applies frees/assignments to the suffix child; capture the
node returned by self._split_node(node.key, node, plen) (e.g. prefix_node =
self._split_node(...)) and use that returned prefix_node for the following
SWA-specific operations (_free_indices on prefix_node.value slice, assign
prefix_node.value = value[:plen].clone(), and set prefix_node.swa_tombstone =
False) so the prefix (new leaf) is updated instead of the suffix; leave the
subsequent traversal logic (total_prefix, key/value slicing, and node
reassignment) unchanged except to continue using the correct node reference
returned by _split_node.
- Around line 97-103: The module-level counter _node_counter is not thread-safe;
update _next_node_id to acquire a module-level threading.Lock before
mutating/reading _node_counter to prevent races and duplicate IDs. Add an import
for threading, create a lock variable (e.g., _node_counter_lock =
threading.Lock()) next to _node_counter, and wrap the increment+return inside a
with _node_counter_lock: block inside _next_node_id so TreeNode() constructions
get unique, consistent node.id values.
- Around line 597-615: The _insert_normal loop currently calls
self._split_node(node.key, node, plen) but ignores its return value so `node`
still points at the suffix child; change this to assign the returned node (e.g.
node = self._split_node(node.key, node, plen)) and then break out of the loop
(mirroring _match_normal's new_node = self._split_node(...); node = new_node;
break) so subsequent self._add_leaf(node, key, value) attaches the new leaf to
the correct (prefix) parent.
In `@pymllm/orchestrator/request_response_process.py`:
- Around line 103-110: The abort_request implementation currently always sends
an abort message via self._send_to_tokenizer even when the rid is unknown;
update abort_request (which uses self._rid_to_state and self._send_to_tokenizer)
to only send the downstream abort when a state was found (i.e., state is not
None and you performed the state.finished/state.out_list/state.event updates).
Move the await self._send_to_tokenizer.send_pyobj({"rid": rid, "abort": True})
inside that conditional so no spurious aborts are emitted for unknown/cleaned-up
rids.
- Around line 28-42: ReqState.event is never cleared so callers that await
state.event.wait() will spin once the event is set; update the _recv_loop to
clear the event before signalling each new result (call state.event.clear()
prior to state.event.set()) to ensure wait() blocks until the next item, and
also add a short note to the ReqState docstring about the event lifecycle so
callers understand it is managed internally; reference ReqState.event and the
_recv_loop function when making the change.
In `@pymllm/orchestrator/scheduler_process.py`:
- Around line 201-205: The stream_output method currently uses
self._finished.pop(0) which is O(n); change the underlying container _finished
from a list to a collections.deque (same type used for _waiting_queue) and
replace pop(0) with popleft() in the stream_output method so removals are O(1);
update any initialization of _finished to use deque([...]) and keep existing
uses of self._send_to_detokenizer.send_pyobj(item) unchanged.
- Around line 98-108: The event_loop currently busy-waits because
recv_requests() uses poll(timeout=0) and when get_next_batch_to_run() returns
None the loop immediately repeats; modify event_loop (or adjust recv_requests)
to avoid spinning by introducing a small blocking wait when idle: after calling
get_next_batch_to_run() and finding batch is None (and _waiting_queue is empty /
no pending work), either call recv_requests() with a non-zero poll timeout or
insert a short sleep (e.g., time.sleep(0.01)) before the next iteration so the
loop yields CPU; update references in event_loop, recv_requests, and any checks
against _waiting_queue/get_next_batch_to_run to implement this idle backoff.
- Around line 155-159: The batch_id should not use id(batch_requests); update
get_next_batch_to_run to generate a stable unique ID instead — either add a
monotonic counter attribute (e.g., self._batch_counter initialized in __init__
and incremented when creating a batch) or import uuid and use uuid.uuid4().hex
to populate "batch_id" on the batch dict; update the batch creation in
get_next_batch_to_run (where batch and batch_requests are assembled) to use the
new generator and ensure the counter or uuid import is added to the
class/module.
In `@pymllm/orchestrator/tokenizer_process.py`:
- Line 73: The assignment to the local variable "text" in tokenizer_process.py
(raw_request.get("text", "")) is unused and triggers a Ruff F841; either remove
the assignment or rename it to "_text" to indicate an intentionally unused
value. Update the assignment at the top of the tokenizer/process function (the
line with raw_request.get("text", "")) to use "_text" or delete that line, and
run the linter to confirm the F841 warning is resolved.
- Around line 30-32: Change the three attributes in tokenizer_process.py to use
Optional types: update the annotations for self._zmq_ctx, self._recv_from_rr,
and self._send_to_scheduler to Optional[zmq.Context] and Optional[zmq.Socket]
respectively, and add Optional to the imports on line 10 so the type hints
correctly allow None (matching the pattern used in model_runner_process.py and
detokenizer_process.py).
---
Nitpick comments:
In @.claude/skills/update-codeowners/SKILL.md:
- Line 38: Replace the wording "exact same" in the checklist item "- [ ] No
accidental duplicate rule for the exact same path pattern." with a tighter
phrase such as "identical" or "the same" (e.g., "- [ ] No accidental duplicate
rule for the same path pattern."), ensuring the checklist line text is updated
accordingly to improve concision.
In `@mllm-kernel/benchmarks/bench_store_cache.py`:
- Around line 36-89: _in _profile_path_, the final torch.cuda.synchronize() is
executed after the profiler context exits, so in-flight CUDA work from the last
iteration may not be captured; move the synchronization into the profiler block
immediately after the iters loop (i.e., call torch.cuda.synchronize() while
still inside the with profile(...) as prof: block right after running fn() iters
times) and remove or keep the external one accordingly so that the profiler
records completed CUDA activity for the last iteration.
- Around line 62-71: The code defines both time_attr and sort_key using the same
condition, making sort_key redundant; update the benchmarking logic to remove
sort_key and use time_attr everywhere instead (replace any usage of sort_key
with time_attr), keeping the original conditional that checks events and
hasattr(events[0], "self_cuda_time_total") to set time_attr; ensure any sorting
or attribute lookups that previously referenced sort_key now reference time_attr
so behavior is unchanged.
In `@mllm-kernel/mllm_kernel/cuda/csrc/store_cache.cuh`:
- Around line 138-149: The get_kernel template can hit Panic when a
user-provided num_split doesn't satisfy kElementBytes alignment (e.g.,
num_split=4 with row_bytes=128); change handling so this is recoverable: either
add explicit validation of num_split in the Python-facing store_cache dispatcher
(check row_bytes % (num_split * 128) == 0 and raise a clear Python error) or
replace the Panic call inside get_kernel with a descriptive runtime check (use
RuntimeCheck or equivalent to emit an error message naming get_kernel, the
requested num_split, kElementBytes and the required alignment) so callers get a
clear, non-crashing diagnostic; update references to get_kernel, store_cache,
store_kernel and Panic accordingly.
In `@mllm-kernel/mllm_kernel/cuda/jit/store_cache.py`:
- Around line 56-62: The current manual cache _KERNEL_CACHE duplicates the
memoization already provided by the `@jit` decorator's internal cache_once for
each kernel variant; update the code to keep the manual dict but add a concise
inline comment above _KERNEL_CACHE explaining its purpose: that _KERNEL_CACHE
stores the Python-level wrapper function returned by _make_store_cache_kernel
(so callers reuse the exact wrapper object) while cache_once/@jit handles the
compiled module memoization, and reference _get_kernel, _make_store_cache_kernel
and cache_once/@jit in that comment to clarify why both layers exist.
- Around line 18-24: The PDL capability check _is_arch_support_pdl currently
reads torch.cuda.get_device_capability() for the current device and caches that
result once, which breaks on multi-GPU systems and also cannot distinguish sm_90
vs sm_90a; update _is_arch_support_pdl to accept a device index/torch.device
parameter (or make the cache key include device) and query
torch.cuda.get_device_capability(device) so results are per-GPU, then propagate
that device argument from call sites in _make_store_cache_kernel and store_cache
(pass the tensor.device.index or tensor.device) so the correct GPU capability is
checked at runtime; also add a short comment near the check noting that PyTorch
returns (9,0) for both sm_90 and sm_90a and thus this check is conservative for
PDL which requires sm_90a.
- Around line 91-127: The store_cache function currently omits the row_bytes
alignment check present in can_use_store_cache; add a guard in store_cache
(after computing row_bytes = row_bytes or k.shape[-1] * k.element_size() and
before calling _get_kernel or JIT) that validates row_bytes % 4 == 0 and raises
a clear ValueError (or TypeError) with a descriptive message if the check fails
so callers who bypass can_use_store_cache get an immediate, useful error;
reference the store_cache function and can_use_store_cache behavior when
implementing the guard and keep the check before selecting num_split or invoking
kernel(k, v, k_cache, v_cache, indices, num_split).
- Around line 65-88: can_use_store_cache currently uses `@cache_once` which will
permanently cache False when _get_kernel raises transient JIT compilation
errors; update the function to avoid caching transient failures (e.g., remove or
modify `@cache_once` for error paths, or catch exceptions and re-raise after
logging so the cache only stores successful results), add a docstring note on
caching semantics for callers, and move the successful-return into a
try/except/else pattern so the return True resides in the else block rather than
inside the try — refer to can_use_store_cache, the `@cache_once` decorator, and
_get_kernel to locate and change the behavior.
In `@pymllm/engine/io_struct.py`:
- Around line 85-89: The __getitem__ in GenerateReqInput currently returns self
when batch_size == 1, causing callers (e.g., _forward_batch) that mutate fields
like output_token_ids to alias and modify the original; change __getitem__ so
that when batch_size == 1 it returns a shallow copy of the GenerateReqInput
instance instead of self (either by invoking an existing clone/copy helper or by
constructing a new GenerateReqInput with the same field values), ensuring
mutable fields (like output_token_ids) are new lists to avoid aliasing.
- Around line 8-27: Add concise class-level docstrings to the public dataclasses
BaseReq and BaseBatchReq describing their purpose and the semantics of the
rid/rids fields (that rid is an optional single or list of request identifiers
and regenerate_rid returns/refreshes them, and that rids is a list of request
identifiers with regenerate_rids refreshing them), and include brief notes on
return types for regenerate_rid and regenerate_rids; update the docstrings near
the class definitions of BaseReq and BaseBatchReq to improve discoverability and
comply with the public API documentation guideline.
In `@pymllm/engine/launch.py`:
- Around line 30-36: Add docstrings to the Engine class and its __init__ method:
for the class describe its responsibility (managing subprocesses,
request/response process, and model setup) and for __init__ document parameters
(none), what it configures (_subprocesses, _rr_process, logging, default torch
dtype, model/tokenizer checks), and enumerate raised exceptions (ValueError when
model/tokenizer paths are missing or when an unsupported dtype is configured).
Reference the Engine class and its __init__ constructor and ensure the docs
mention the side effects of calling _config_logging, _set_default_torch_dtype,
and _check_model_and_tokenizer so callers know about potential ValueError
raises.
- Around line 270-291: Both static helpers _wait_for_final_result and
_stream_results declare an unused rid parameter; remove rid from their
signatures and any internal references, leaving only (state: ReqState) as the
parameter, update their return/type hints to remain the same, and then update
all call sites that invoke _wait_for_final_result(...) and _stream_results(...)
to pass only the ReqState instance; ensure imports/annotations still reference
ReqState and run tests/lint to confirm ARG004 is resolved.
In `@pymllm/mem_cache/__init__.py`:
- Around line 20-37: The __all__ list in pymllm.mem_cache is not sorted which
triggers Ruff RUF022; reorder the entries in the __all__ list into
case-sensitive alphabetical order (e.g., arrange entries like EvictResult,
InsertResult, KVPool, MatchResult, RadixCache, RadixKey, ReqToTokenPool,
TokenToKVPoolAllocator, TreeNode, and the hash_* and factory names such as
hash_bytes, hash_token_ids, hash_to_int64, make_full_attention_net_mem_pool,
make_req_to_token_pool) so the public API names are sorted and Ruff stops
flagging the file.
In `@pymllm/mem_cache/memory_pool.py`:
- Line 97: The allocation of a temporary tensor to compute element size in the
_row_bytes assignment should be replaced with a dtype-safe, zero-allocation
computation; update the code that computes self._row_bytes (currently using
k_row_dim and dtype) to derive bytes-per-element from the dtype directly (use
torch.finfo(dtype).bits // 8 for floating dtypes and torch.iinfo(dtype).bits //
8 for integer dtypes, or use a try/except that falls back between finfo and
iinfo) and multiply by k_row_dim so no throwaway tensor is created.
In `@pymllm/mem_cache/radix_cache.py`:
- Line 70: RadixKey.__slots__ and TreeNode.__slots__ are unsorted; change each
__slots__ tuple to use a consistently ordered (natural/alphanumeric) sequence to
satisfy Ruff RUF023 — locate RadixKey.__slots__ ("token_ids", "extra_key") and
the TreeNode.__slots__ declaration and reorder the strings alphabetically (or by
natural sort) so member names are sorted, then run the linter to confirm the
warning is resolved.
- Line 129: Replace the latent-footgun defaultdict usage so TreeNode.children is
a plain dict instead of defaultdict(TreeNode); update any places that previously
relied on implicit creation to explicitly construct and assign a new TreeNode
(calling TreeNode(...) so _next_node_id() and last_access_time are invoked only
when intended). Specifically change the attribute declaration in class TreeNode
from "self.children: Dict[Any, TreeNode] = defaultdict(TreeNode)" to a plain
dict, and audit call-sites that check/insert children (e.g., where code
currently does "if ck not in node.children: ..." or might index directly) to use
explicit node.children[ck] = TreeNode(...) before accessing the child.
In `@pymllm/orchestrator/ipc_utils.py`:
- Around line 14-18: Add a cleanup helper to remove stale IPC socket files and
wire it into startup/shutdown: implement a new function cleanup_ipc_dir() that
lists files in _IPC_DIR (e.g., with glob or os.listdir), removes only known
socket file patterns (or all files in that directory) using os.remove with
exception handling, call cleanup_ipc_dir() from _ensure_ipc_dir() at startup to
purge leftovers, and also register it with atexit.register(cleanup_ipc_dir) so
sockets created by this process are removed on orderly shutdown; reference
_IPC_DIR, _ensure_ipc_dir, and the new cleanup_ipc_dir function when making
changes.
In `@pymllm/orchestrator/model_runner_process.py`:
- Around line 50-56: Wrap the body of ModelRunnerProcess.event_loop in a
try/except that catches exceptions from self._recv_from_scheduler.recv_pyobj,
self._forward_batch, and self._send_to_scheduler.send_pyobj, log the error with
context (including the exception) and continue the loop instead of letting the
process crash; on deserialization/malformed message errors drop or ack the
offending batch and continue, and on persistent socket errors optionally backoff
or break to let run_model_runner_process invoke shutdown. Ensure logging
references event_loop, _recv_from_scheduler.recv_pyobj, _forward_batch and
_send_to_scheduler.send_pyobj so the failing component is clear.
In `@pymllm/orchestrator/request_response_process.py`:
- Around line 91-92: The type-check for request.rid in RequestResponseProcess is
raising ValueError but should raise TypeError for type failures; update the
conditional in request_response_process.py (the block that checks "if not
isinstance(request.rid, str):") to raise TypeError with an appropriate message
(e.g., indicating that rid must be a str) instead of ValueError so the exception
semantics match the type validation.
- Around line 68-82: The type-checker warning is caused by passing a
zmq.asyncio.Context into create_zmq_socket (typed for sync zmq.Context) from
start in request_response_process.py; fix by updating
ipc_utils.create_zmq_socket signature to accept Union[zmq.Context,
zmq.asyncio.Context] (and corresponding Socket/AsyncSocket unions or add
overloads) so both sync and async contexts are typed, or if you prefer a
localized change, add a precise "# type: ignore[assignment]" (with a short
comment referencing create_zmq_socket expecting sync Context) on the two
create_zmq_socket calls in start to suppress the warning; reference the start
method and create_zmq_socket when making the change.
In `@pymllm/orchestrator/scheduler_process.py`:
- Around line 180-195: The process_batch_result method currently accepts an
unused parameter batch; either remove batch from the signature of
process_batch_result and update all callers (e.g., places that call
SchedulerProcess.process_batch_result) to stop passing the batch, or keep the
parameter and add a one-line comment inside process_batch_result explaining that
batch is reserved for future features (e.g., KV-cache or completion checks) to
avoid lint warnings; update any type hints or tests accordingly so the codebase
remains consistent with the chosen approach.
| k = torch.randn(args.batch_size, row_dim, device=device, dtype=dtype) | ||
| v = torch.randn(args.batch_size, row_dim, device=device, dtype=dtype) | ||
| # Use unique indices to avoid write conflicts. | ||
| indices = torch.randperm(args.num_slots, device=device)[: args.batch_size].to( | ||
| torch.int64 | ||
| ) | ||
| k_cache = torch.zeros(args.num_slots, row_dim, device=device, dtype=dtype) | ||
| v_cache = torch.zeros_like(k_cache) |
There was a problem hiding this comment.
Missing validation: batch_size > num_slots silently produces fewer indices than expected.
torch.randperm(args.num_slots)[:args.batch_size] silently truncates to num_slots elements when batch_size > num_slots, causing a shape mismatch between k/v (shape [batch_size, ...]) and indices (shape [num_slots]). This would surface as a confusing runtime error.
Proposed guard
+ if args.batch_size > args.num_slots:
+ raise ValueError(
+ f"batch_size ({args.batch_size}) must not exceed num_slots ({args.num_slots})"
+ )
+
k = torch.randn(args.batch_size, row_dim, device=device, dtype=dtype)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| k = torch.randn(args.batch_size, row_dim, device=device, dtype=dtype) | |
| v = torch.randn(args.batch_size, row_dim, device=device, dtype=dtype) | |
| # Use unique indices to avoid write conflicts. | |
| indices = torch.randperm(args.num_slots, device=device)[: args.batch_size].to( | |
| torch.int64 | |
| ) | |
| k_cache = torch.zeros(args.num_slots, row_dim, device=device, dtype=dtype) | |
| v_cache = torch.zeros_like(k_cache) | |
| if args.batch_size > args.num_slots: | |
| raise ValueError( | |
| f"batch_size ({args.batch_size}) must not exceed num_slots ({args.num_slots})" | |
| ) | |
| k = torch.randn(args.batch_size, row_dim, device=device, dtype=dtype) | |
| v = torch.randn(args.batch_size, row_dim, device=device, dtype=dtype) | |
| # Use unique indices to avoid write conflicts. | |
| indices = torch.randperm(args.num_slots, device=device)[: args.batch_size].to( | |
| torch.int64 | |
| ) | |
| k_cache = torch.zeros(args.num_slots, row_dim, device=device, dtype=dtype) | |
| v_cache = torch.zeros_like(k_cache) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mllm-kernel/benchmarks/bench_store_cache.py` around lines 125 - 132, Add an
explicit validation that args.batch_size <= args.num_slots before creating
indices so we don't silently truncate; if the check fails, raise a clear
ValueError describing both args.batch_size and args.num_slots. Locate the block
where k, v, indices, k_cache, and v_cache are created (variables k, v, indices,
k_cache, v_cache) and insert the guard immediately before computing indices =
torch.randperm(...), ensuring the check runs on the same device/CPU context as
the rest of the setup.
| def to_request_dict(self) -> Dict[str, Any]: | ||
| payload: Dict[str, Any] = {} | ||
| for key, value in { | ||
| "rid": self.rid, | ||
| "text": self.text, | ||
| "input_ids": self.input_ids, | ||
| "sampling_params": self.sampling_params, | ||
| "return_logprob": self.return_logprob, | ||
| "logprob_start_len": self.logprob_start_len, | ||
| "top_logprobs_num": self.top_logprobs_num, | ||
| "stream": self.stream, | ||
| "image_data": self.image_data, | ||
| "video_data": self.video_data, | ||
| "audio_data": self.audio_data, | ||
| "lora_path": self.lora_path, | ||
| "session_params": self.session_params, | ||
| }.items(): | ||
| if value is not None: | ||
| payload[key] = value | ||
| payload.update(self.extra_options) | ||
| return payload |
There was a problem hiding this comment.
extra_options can silently overwrite core request fields.
payload.update(self.extra_options) on Line 132 runs after populating core keys like rid, text, and input_ids. If extra_options contains any of these keys, the core values are silently replaced. This could cause subtle data corruption downstream.
🛡️ Proposed fix: warn or reject collisions
+ _RESERVED = frozenset(payload.keys())
+ collisions = _RESERVED & self.extra_options.keys()
+ if collisions:
+ raise ValueError(
+ f"extra_options keys {collisions} collide with reserved fields"
+ )
payload.update(self.extra_options)
return payload🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/engine/io_struct.py` around lines 113 - 133, The to_request_dict
method currently calls payload.update(self.extra_options) which allows
self.extra_options to silently overwrite core fields (e.g., "rid", "text",
"input_ids", "sampling_params", etc.); modify to_request_dict to check for key
collisions between self.extra_options and the core keys collected in payload
before applying them, and either raise a ValueError (reject) or emit a clear
warning via the module logger (warn) when a collision is detected (choose one
behavior consistently), and only merge non-colliding keys into payload if
proceeding; reference the to_request_dict function, the payload dict, and
self.extra_options when implementing this guard.
| def __init__(self): | ||
| self._subprocesses: List[mp.Process] = [] | ||
| self._rr_process: Optional[RequestResponseProcess] = None | ||
| self._config_logging() | ||
| self._set_default_torch_dtype() | ||
| self._check_model_and_tokenizer() |
There was a problem hiding this comment.
self._loop is never initialized in __init__, risking AttributeError.
_launch_processes sets self._rr_process at line 154 before self._loop at line 162. If an exception fires between those two lines (e.g., asyncio.new_event_loop() failure), atexit will invoke shutdown(), which passes the if self._rr_process is not None guard but then crashes on the undeclared self._loop — silently swallowed by the bare except. The same AttributeError would also surface if generate() is called without a prior launch() call.
Initialize _loop to None in __init__ and guard the access in shutdown():
🐛 Proposed fix
def __init__(self):
self._subprocesses: List[mp.Process] = []
self._rr_process: Optional[RequestResponseProcess] = None
+ self._loop: Optional[asyncio.AbstractEventLoop] = None
self._config_logging()
self._set_default_torch_dtype()
self._check_model_and_tokenizer() def shutdown(self) -> None:
if self._rr_process is not None:
try:
- self._loop.run_until_complete(self._rr_process.shutdown())
+ if self._loop is not None:
+ self._loop.run_until_complete(self._rr_process.shutdown())
except Exception:
- pass
+ logger.exception("Error during RR process shutdown")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/engine/launch.py` around lines 31 - 36, The __init__ currently never
initializes self._loop which can cause AttributeError if _launch_processes sets
self._rr_process before creating the asyncio loop or if generate() is called
before launch(); initialize self._loop = None in __init__ (add to the existing
initializations) and update shutdown() (and any other places that access
self._loop, e.g., generate()) to guard usage with "if self._loop is not None"
(or hasattr(self, "_loop") and self._loop is not None) before calling methods on
it so accesses are safe even if loop creation failed or launch() wasn’t run.
|
|
||
| def _launch_processes(self) -> None: | ||
| """Spawn all subprocess workers and wire up ZMQ IPC channels.""" | ||
| mp.set_start_method("spawn", force=True) |
There was a problem hiding this comment.
mp.set_start_method("spawn", force=True) is a process-wide global mutation — move it to the entry point.
Calling this inside an instance method silently overrides any start method set by the host application or other libraries on Linux (where fork is default). If this method were ever called more than once, the forced override is harmless but intentionally hides misconfiguration. Per Python docs, set_start_method should be called exactly once in a if __name__ == "__main__" guard in the entry-point script.
💡 Suggested approach
Remove the call from _launch_processes and place it in the server entry-point script:
- mp.set_start_method("spawn", force=True)In pymllm/server/launch.py or equivalent entry script:
if __name__ == "__main__":
mp.set_start_method("spawn")
engine = Engine()
engine.launch()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/engine/launch.py` at line 44, mp.set_start_method("spawn", force=True)
is a global process-wide mutation and must not live inside the instance method
_launch_processes; remove that call from pymllm.engine.launch._launch_processes
(or any Engine.launch helper) and instead set the start method exactly once in
the program entrypoint: add mp.set_start_method("spawn") inside an if __name__
== "__main__" guard before constructing/launching Engine (e.g., before Engine()
/ engine.launch()) so the start method is configured at startup and not mutated
at runtime.
| # Wait for readiness signals | ||
| for _, reader, name in procs_and_readers: | ||
| try: | ||
| msg = reader.recv() | ||
| except EOFError: | ||
| raise RuntimeError(f"{name} process died before signalling readiness") | ||
| if msg.get("status") != "ready": | ||
| raise RuntimeError(f"{name} process failed to initialise: {msg}") | ||
| logger.info("%s process ready", name) |
There was a problem hiding this comment.
reader.recv() has no timeout — a hanging subprocess blocks the main process forever.
If any subprocess hangs during initialization (e.g., waiting on GPU memory or a stuck IPC bind), reader.recv() will block indefinitely. Use poll() with a timeout before recv():
🐛 Proposed fix
+ _READINESS_TIMEOUT_S = 120 # seconds
for _, reader, name in procs_and_readers:
try:
- msg = reader.recv()
+ if not reader.poll(_READINESS_TIMEOUT_S):
+ raise RuntimeError(
+ f"{name} process did not signal readiness within "
+ f"{_READINESS_TIMEOUT_S}s"
+ )
+ msg = reader.recv()
except EOFError as err:
- raise RuntimeError(f"{name} process died before signalling readiness")
+ raise RuntimeError(
+ f"{name} process died before signalling readiness"
+ ) from err🧰 Tools
🪛 Ruff (0.15.1)
[warning] 148-148: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
[warning] 148-148: Avoid specifying long messages outside the exception class
(TRY003)
[warning] 150-150: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/engine/launch.py` around lines 143 - 151, The readiness loop blocks
indefinitely because reader.recv() has no timeout; update the loop that iterates
procs_and_readers to call reader.poll(timeout_sec) (e.g., a few seconds
configurable) before calling reader.recv(), raise a RuntimeError if poll returns
False (timeout) and include the process name in the message, keep the existing
EOFError handling for recv, and only call reader.recv() when poll indicates data
is available; reference procs_and_readers, reader.recv, reader.poll and the
RuntimeError raised for timeouts.
| def event_loop(self) -> None: | ||
| """Infinite scheduling loop.""" | ||
| logger.info("SchedulerProcess event loop started") | ||
| while True: | ||
| self.recv_requests() | ||
| self.process_input_requests() | ||
| batch = self.get_next_batch_to_run() | ||
| if batch is not None: | ||
| result = self.run_batch(batch) | ||
| self.process_batch_result(batch, result) | ||
| self.stream_output() |
There was a problem hiding this comment.
Event loop will spin-wait at 100% CPU when idle.
recv_requests() uses poll(timeout=0) which returns immediately when no messages are available. When the _waiting_queue is empty, the loop cycles through all steps with no blocking or sleeping, consuming a full CPU core. Add a small sleep or use a non-zero poll timeout when there's no pending work.
Proposed fix: add a short sleep when idle
def event_loop(self) -> None:
"""Infinite scheduling loop."""
logger.info("SchedulerProcess event loop started")
while True:
self.recv_requests()
self.process_input_requests()
batch = self.get_next_batch_to_run()
if batch is not None:
result = self.run_batch(batch)
self.process_batch_result(batch, result)
+ else:
+ time.sleep(0.001) # avoid busy-spin when idle
self.stream_output()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/orchestrator/scheduler_process.py` around lines 98 - 108, The
event_loop currently busy-waits because recv_requests() uses poll(timeout=0) and
when get_next_batch_to_run() returns None the loop immediately repeats; modify
event_loop (or adjust recv_requests) to avoid spinning by introducing a small
blocking wait when idle: after calling get_next_batch_to_run() and finding batch
is None (and _waiting_queue is empty / no pending work), either call
recv_requests() with a non-zero poll timeout or insert a short sleep (e.g.,
time.sleep(0.01)) before the next iteration so the loop yields CPU; update
references in event_loop, recv_requests, and any checks against
_waiting_queue/get_next_batch_to_run to implement this idle backoff.
| batch = { | ||
| "requests": batch_requests, | ||
| "batch_id": id(batch_requests), | ||
| "created_at": time.time(), | ||
| } |
There was a problem hiding this comment.
id(batch_requests) is a poor choice for batch_id.
id() returns the CPython memory address of the list, which can be reused after garbage collection. This makes batch IDs non-unique over time, complicating tracing and debugging. Use a monotonic counter or uuid.uuid4().hex instead.
Proposed fix using a counter
Add to __init__:
+ self._next_batch_id: int = 0Then in get_next_batch_to_run:
batch = {
"requests": batch_requests,
- "batch_id": id(batch_requests),
+ "batch_id": self._next_batch_id,
"created_at": time.time(),
}
+ self._next_batch_id += 1
return batch🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/orchestrator/scheduler_process.py` around lines 155 - 159, The
batch_id should not use id(batch_requests); update get_next_batch_to_run to
generate a stable unique ID instead — either add a monotonic counter attribute
(e.g., self._batch_counter initialized in __init__ and incremented when creating
a batch) or import uuid and use uuid.uuid4().hex to populate "batch_id" on the
batch dict; update the batch creation in get_next_batch_to_run (where batch and
batch_requests are assembled) to use the new generator and ensure the counter or
uuid import is added to the class/module.
| def stream_output(self) -> None: | ||
| """Send finished token-ID outputs to the DetokenizerProcess.""" | ||
| while self._finished: | ||
| item = self._finished.pop(0) | ||
| self._send_to_detokenizer.send_pyobj(item) |
There was a problem hiding this comment.
self._finished.pop(0) is O(n) — use deque for _finished.
_finished is a list, and pop(0) shifts all remaining elements on every call. You already import deque and use it for _waiting_queue. Apply the same here.
Proposed fix
Change the declaration on Line 60:
- self._finished: List[Dict[str, Any]] = []
+ self._finished: Deque[Dict[str, Any]] = deque()Then in stream_output:
def stream_output(self) -> None:
"""Send finished token-ID outputs to the DetokenizerProcess."""
while self._finished:
- item = self._finished.pop(0)
+ item = self._finished.popleft()
self._send_to_detokenizer.send_pyobj(item)As per coding guidelines, "Prioritize production-ready code quality by evaluating time and space complexity of algorithms and data structures, and suggest more efficient alternatives for operations with high complexity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/orchestrator/scheduler_process.py` around lines 201 - 205, The
stream_output method currently uses self._finished.pop(0) which is O(n); change
the underlying container _finished from a list to a collections.deque (same type
used for _waiting_queue) and replace pop(0) with popleft() in the stream_output
method so removals are O(1); update any initialization of _finished to use
deque([...]) and keep existing uses of
self._send_to_detokenizer.send_pyobj(item) unchanged.
| self._zmq_ctx: zmq.Context = None | ||
| self._recv_from_rr: zmq.Socket = None | ||
| self._send_to_scheduler: zmq.Socket = None |
There was a problem hiding this comment.
Type annotations are incorrect — None is not a valid zmq.Context or zmq.Socket.
These attributes are initialized to None but annotated with non-Optional types, which is both misleading and will cause type-checker errors. The other process files in this PR (e.g., model_runner_process.py Line 30-32, detokenizer_process.py Line 31-33) correctly use Optional[...].
Proposed fix
- self._zmq_ctx: zmq.Context = None
- self._recv_from_rr: zmq.Socket = None
- self._send_to_scheduler: zmq.Socket = None
+ self._zmq_ctx: Optional[zmq.Context] = None
+ self._recv_from_rr: Optional[zmq.Socket] = None
+ self._send_to_scheduler: Optional[zmq.Socket] = NoneYou'll also need to add Optional to the imports on Line 10:
-from typing import Any, Dict, List
+from typing import Any, Dict, List, Optional🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/orchestrator/tokenizer_process.py` around lines 30 - 32, Change the
three attributes in tokenizer_process.py to use Optional types: update the
annotations for self._zmq_ctx, self._recv_from_rr, and self._send_to_scheduler
to Optional[zmq.Context] and Optional[zmq.Socket] respectively, and add Optional
to the imports on line 10 so the type hints correctly allow None (matching the
pattern used in model_runner_process.py and detokenizer_process.py).
|
|
||
| TODO: replace with real tokenizer call. | ||
| """ | ||
| text = raw_request.get("text", "") |
There was a problem hiding this comment.
Unused variable text — flagged by Ruff (F841).
The text variable is assigned but never read in the placeholder. Prefix it with _ to signal intent or remove the assignment.
Proposed fix
- text = raw_request.get("text", "")
+ _text = raw_request.get("text", "")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| text = raw_request.get("text", "") | |
| _text = raw_request.get("text", "") |
🧰 Tools
🪛 Ruff (0.15.1)
[error] 73-73: Local variable text is assigned to but never used
Remove assignment to unused variable text
(F841)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/orchestrator/tokenizer_process.py` at line 73, The assignment to the
local variable "text" in tokenizer_process.py (raw_request.get("text", "")) is
unused and triggers a Ruff F841; either remove the assignment or rename it to
"_text" to indicate an intentionally unused value. Update the assignment at the
top of the tokenizer/process function (the line with raw_request.get("text",
"")) to use "_text" or delete that line, and run the linter to confirm the F841
warning is resolved.
Summary by CodeRabbit
New Features
Dependencies
Documentation
Tests