Training Platform Core: minimal MLP end-to-end training on Siracusa by runwangdl · Pull Request #182 · pulp-platform/Deeploy

runwangdl · 2026-04-10T10:43:41Z

Summary

Minimum viable MLP end-to-end training on Siracusa (untiled + tiled). Brings in only the framework + 4 operators needed to build a complete forward → backward → gradient-accumulation → SGD graph. Targets `devel`.

Deliberately excluded (future follow-up PRs): GAP9 port, all Conv/Pool/Norm gradients, MSELoss, GroupNorm, PULPTrainlib submodule, FP32 forward kernels in `TargetLibraries/`.

Training framework

`testMVPTraining.py` / `testMVPOptimizer.py` / `generateTrainingNetwork.py` / `generateOptimizerNetwork.py` — codegen entry points for the two-graph (TrainingNetwork + OptimizerNetwork) architecture.
`deeployTrainingRunner{,_siracusa,_tiled_siracusa}.py` + `testUtils/deeployTrainingRunner.py` — end-to-end test runners mirroring the inference `deeployRunner` pattern.
`testUtils/codeGenerate.py` — training codegen helpers, shared-arena patching, L1 capacity checks, L3 hex-dump path.
`testUtils/core/execution.py` — L3 flash-image detection/load.
`testUtils/tilingUtils.py` — `TrainingSBTiler`.
`CMakeLists.txt` (top + Siracusa) + `DeeployTest/Platforms/Siracusa/src/deeploytraintest.c` — `TRAINING`-gated build targets and test harness driving both networks across multiple steps with gradient accumulation.

Verification

```bash

Untiled codegen + build

python DeeployTest/deeployTrainingRunner_siracusa.py
-t /app/Onnx4Deeploy/onnx/model/simplemlp_train
--optimizer-dir /app/Onnx4Deeploy/onnx/model/simplemlp_optimizer
--skipsim --n-accum 1
-> ✓ Test simplemlp_train PASSED

Tiled codegen + build

python DeeployTest/deeployTrainingRunner_tiled_siracusa.py
-t /app/Onnx4Deeploy/onnx/model/simplemlp_train
--optimizer-dir /app/Onnx4Deeploy/onnx/model/simplemlp_optimizer
--skipsim --n-accum 4
-> ✓ Test simplemlp_train PASSED
```

Test plan

Untiled Siracusa: `deeployTrainingRunner_siracusa.py` — codegen + build OK
Tiled Siracusa: `deeployTrainingRunner_tiled_siracusa.py` — codegen + build OK
CI passes
GVSoC simulation (follow-up; this PR scope is codegen + build only)

Bring in the minimum set of changes from the TrainingPlatform branch needed to generate working C code for an end-to-end MLP training graph (forward + backward + gradient accumulation + SGD weight update) on the Siracusa platform, both untiled and tiled. Scope is deliberately narrow - only MLP. No Conv/Pool/Norm gradients, no PULPTrainlib submodule, no new C kernel sources. The four operators involved (Gemm, SoftmaxCrossEntropyLoss + Grad, SGD, InPlaceAccumulatorV2) are all implemented as inline template strings so nothing new lands in TargetLibraries/. Operators: - InPlaceAccumulatorV2: new ORT com.microsoft operator with full Parser/Layer/TypeChecker/Template/TileConstraint/Binding scaffolding. - SoftmaxCrossEntropyLoss: parser now accepts both 1-output (legacy, log_prob only) and 2-output (loss + log_prob) signatures; checker picks the correct binding via checkOutputType based on the 'loss' key. New PULPSoftmaxCrossEntropyLossDualOutputBindings and SoftmaxCrossEntropyLossDualOutputTileConstraint handle the dual output case. - SGD: SGDTemplate now aliases weight_updated onto weight so the tiled egress DMA writes updated weights back to the weight's L2 buffer. - Gemm: tiny FloatGemmTemplate import cleanup. Framework fixes (all operator-agnostic): - DeeployTypes.VariableBuffer.isLive: treat is_input / is_output buffers as live across the whole step range. - TilingExtension.TilerExtension: skip zero-sized in-place alias outputs from the MiniMalloc CSV and resolve their addrSpace from the alias target after the solver runs. - TilingExtension.TilingVariableReplacement: change per-tile reference update from '*ref = array[i];' to 'ref = &array[i];' to avoid permanent mutation of static PI_L1 arrays between training steps. Training test framework: - testMVPTraining.py, testMVPOptimizer.py, generateTrainingNetwork.py, generateOptimizerNetwork.py: new codegen entry points for the two-graph (TrainingNetwork + OptimizerNetwork) architecture. - deeployTrainingRunner.py, deeployTrainingRunner_siracusa.py, deeployTrainingRunner_tiled_siracusa.py and their testUtils helper: training test runners mirroring the inference deeployRunner pattern. - testUtils/codeGenerate.py: training-specific codegen helpers (generateTrainingTestNetwork, generateOptimizerTestNetwork, build_shared_buffer_maps, _patch_shared_buffers / _shared_arenas, _ensure_training_l1_capacity) and L3 hex-dump path. - testUtils/core/execution.py: L3 flash-image detection and load. - testUtils/tilingUtils.py: TrainingSBTiler. - CMakeLists.txt (top + Siracusa): TRAINING-gated targets that build TrainingNetwork.c and OptimizerNetwork.c side by side. - DeeployTest/Platforms/Siracusa/src/deeploytraintest.c: training test harness that drives the two networks across multiple training steps with gradient accumulation. Scope deliberately excluded (future follow-up PRs): - GAP9 platform port (separate PR, depends on this). - All Conv/Pool/Norm/Relu gradient operators. - MSELoss/MSELossGrad, GroupNormalization, dual-output MaxPool. - PULPTrainlib submodule (only needed for ConvGrad kernels). - FP32 forward AvgPool/BatchNorm/MaxPool/Layernorm/Relu kernels. Verified locally: testMVPTraining.py -t simplemlp_train and testMVPOptimizer.py -t simplemlp_optimizer both succeed on Siracusa with --n-accum 1, producing TrainingNetwork.c (2027 lines) and OptimizerNetwork.c containing Gemm / SoftmaxCrossEntropyLoss / SoftmaxCrossEntropyLossGrad / InPlaceAccumulatorV2 / SGD.

coderabbitai · 2026-04-10T10:56:42Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds training support: new InPlaceAccumulatorV2 operator (parser, checker, layer, PULP template, tiling constraint, bindings), SoftmaxCrossEntropyLoss dual-output handling, SGD alias/allocation changes, extensive training/optimizer code-generation and runner scripts, tiler/allocation alias fixes, and CMake/runtime test-harness updates for training workflows.

Changes

Cohort / File(s)	Summary
Core runtime & types `Deeploy/DeeployTypes.py`	Treat input/output buffers as live in alias-liveness analysis; adjust network code-stack concatenation to append layer sections directly.
Generic operator & type system `Deeploy/Targets/Generic/Layers.py`, `Deeploy/Targets/Generic/Parsers.py`, `Deeploy/Targets/Generic/TypeCheckers.py`	Add `InPlaceAccumulatorV2Layer`, `InPlaceAccumulatorV2Parser`, and `InPlaceAccumulatorV2Checker`; update SoftmaxCrossEntropyLoss parser/checker to handle two outputs (log_prob and scalar loss).
PULPOpen bindings & mapping `Deeploy/Targets/PULPOpen/Bindings.py`, `Deeploy/Targets/PULPOpen/Platform.py`, `Deeploy/Targets/PULPOpen/Tiler.py`	Expose dual-output SoftmaxCrossEntropyLoss bindings and add `PULPInPlaceAccumulatorV2Bindings`, register mapper and mapping entry, and add tiling-ready bindings for InPlaceAccumulatorV2.
PULP templates `Deeploy/Targets/PULPOpen/Templates/FloatInPlaceAccumulatorV2Template.py`, `.../SoftmaxCrossEntropyLossTemplate.py`, `.../FloatGemmTemplate.py`, `.../SGDTemplate.py`	New InPlaceAccumulatorV2 template that aliases accum_buffer↔data_out; SoftmaxCrossEntropy reference template now writes scalar mean loss; GEMM default `C_type` uses `PointerClass(float32_t)` when C absent; SGD template makes weight/weight_updated alias and reuse allocation.
Tiling constraints `Deeploy/Targets/PULPOpen/TileConstraints/InPlaceAccumulatorV2TileConstraint.py`, `.../SoftmaxCrossEntropyLossDualOutputTileConstraint.py`, `.../SGDTileConstraint.py`, `.../SoftmaxCrossEntropyTileConstraint.py`	Add InPlaceAccumulatorV2 tiling constraint enforcing shared L1 arena and replacement scheme; add dual-output SoftmaxCrossEntropy tile-constraint to schedule scalar-loss egress; add new ReluGrad tile-constraint name constants and extend existing SoftmaxCrossEntropyTileConstraint to support loss patching.
Tiler & allocation fixes `Deeploy/TilingExtension/TilerExtension.py`, `DeeployTest/testUtils/tilingUtils.py`	Refine alias handling for homelevel/template adaptation; skip only out-of-scope global alias buffers and resolve skipped alias `_addrSpace` from alias targets; add `TrainingMemoryScheduler` and `TrainingSBTiler` to extend input lifetimes for training tiling.
Code generation: training & optimizer `DeeployTest/testUtils/codeGenerate.py`, `DeeployTest/generateTrainingNetwork.py`, `DeeployTest/generateOptimizerNetwork.py`, `DeeployTest/testMVPTraining.py`, `DeeployTest/testMVPOptimizer.py`	Large additions: emit training/optimizer headers & sources, test inputs/outputs, shared buffer mapping and C-source patching, `generateTrainingTestNetwork` / `generateOptimizerTestNetwork`, and zero-initialize discovered output buffers.
Test harnesses & runners `DeeployTest/Platforms/Siracusa/src/deeploytraintest.c`, `DeeployTest/deployRunner.py`, `DeeployTest/deployTrainingRunner*.py`	Add C training harness implementing accumulation + optimizer loops, multiple CLI runner scripts, and a shared training runner entrypoint for tiled/non-tiled modes.
Test config & execution `DeeployTest/testUtils/core/config.py`, `.../execution.py`	Add training-related config fields; extend `generate_network` and `configure_cmake` to support `TRAINING`, augment PATH for toolchain, stream simulator output in realtime, and run training/optimizer generation flows.
Build system changes `DeeployTest/CMakeLists.txt`, `DeeployTest/Platforms/Siracusa/CMakeLists.txt`	Introduce `TRAINING` option and conditional object libraries (`training_network`, `optimizer_network`) with propagated compile flags and training defines; switch sources/links for training mode.
Small scripts & metadata `DeeployTest/.py`, `DeeployTest/testUtils/`	Multiple CLI helpers, generator entrypoints, tiling/allocator utilities, and training metadata generation (`training_meta.json`).

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant Generator
    participant Deployer
    participant Tiler
    participant CodeGen
    participant Build
    participant Simulator

    User->>CLI: invoke training runner
    CLI->>Generator: request training + optimizer generation
    Generator->>Deployer: import ONNX, mark grad buffers & aliases
    Deployer->>Tiler: request tiling (TrainingSBTiler)
    Tiler->>Deployer: return tiling schedules & replacements
    Deployer->>CodeGen: emit TrainingNetwork & OptimizerNetwork sources
    CodeGen->>Build: write files and patch shared buffer maps
    Build->>Simulator: run built binaries
    Simulator-->>User: stream logs/results

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Demo TinyViT compatibility with tiled Siracusa #124: Overlapping changes to PULPOpen tiler/platform binding and tile-constraint machinery.

Suggested labels

Feature

Suggested reviewers

Xeratec
Victor-Jung

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 41.98% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main objective: implementing minimal MLP end-to-end training on Siracusa, which is the primary change across all modified files.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, detailing the training framework components, operators, and verification steps for the Siracusa platform.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 16

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py (1)
40-82: ⚠️ Potential issue | 🟡 Minor

Pipeline failure: formatting changes required by yapf.

The CI indicates yapf formatting is needed for lines 4-40 (import wrapping and NodeTemplate call wrapping). Please run the formatter to resolve:
yapf -i Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py` around lines 40 - 82, The
template file's formatting doesn't match yapf rules: run the formatter (e.g.,
yapf -i Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py) to fix import
wrapping and the long NodeTemplate/_PULPSGDTemplate call wrapping around
referenceTemplate and the multi-line string; ensure the _PULPSGDTemplate(...)
invocation and the assignment to referenceTemplate are wrapped/indented per yapf
so the triple-quoted template string lines align with the project's style.

🧹 Nitpick comments (12)

DeeployTest/testUtils/deeployTrainingRunner.py (3)

18-21: PATH modification at module import time may have unintended side effects.

Modifying os.environ['PATH'] at module load time affects the entire Python process globally. If this module is imported by other code (not just as an entry point), this could unexpectedly change path resolution for all subsequent subprocess calls.

Consider moving this into the main() function to limit scope:

♻️ Suggested refactor

 from typing import Optional
 
-# gapy (gvsoc launcher) uses `#!/usr/bin/env python3`.  Put /usr/bin first so
-# it resolves to /usr/bin/python3 which has all required packages (gapylib,
-# prettytable, …) rather than the minimal venv python.
-os.environ['PATH'] = '/usr/bin:' + os.environ.get('PATH', '')
-
 from .core import DeeployTestConfig, run_complete_test

Then at the start of main():

def main(tiling_enabled: bool = False, default_platform: str = 'Siracusa', default_simulator: str = 'gvsoc'):
    # gapy (gvsoc launcher) uses `#!/usr/bin/env python3`.  Put /usr/bin first so
    # it resolves to /usr/bin/python3 which has all required packages.
    os.environ['PATH'] = '/usr/bin:' + os.environ.get('PATH', '')
    ...

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/testUtils/deeployTrainingRunner.py` around lines 18 - 21, The
module currently mutates os.environ['PATH'] at import time which has global side
effects; move that assignment into the entrypoint function main() so PATH is
only changed when the script is executed (not on import). Locate the top-level
os.environ['PATH'] modification and remove it from module scope, then add the
same assignment at the start of the main(tiling_enabled: bool = False,
default_platform: str = 'Siracusa', default_simulator: str = 'gvsoc') function
so the change is scoped to execution; keep the exact string '/usr/bin:' +
os.environ.get('PATH', '') and the explanatory comment near the new location.

97-100: Consider iterable unpacking for cleaner list construction.

Per static analysis (Ruff RUF005), use iterable unpacking instead of concatenation:

♻️ Suggested fix

     if args.input_type_map:
-        gen_args.extend(['--input-type-map'] + list(args.input_type_map))
+        gen_args.extend(['--input-type-map', *args.input_type_map])
     if args.input_offset_map:
-        gen_args.extend(['--input-offset-map'] + list(args.input_offset_map))
+        gen_args.extend(['--input-offset-map', *args.input_offset_map])

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/testUtils/deeployTrainingRunner.py` around lines 97 - 100,
Replace the list concatenation when extending gen_args with iterable unpacking:
instead of gen_args.extend(['--input-type-map'] + list(args.input_type_map)) and
the similar input_offset_map line, use gen_args.extend(['--input-type-map',
*args.input_type_map]) and gen_args.extend(['--input-offset-map',
*args.input_offset_map]) (apply to the gen_args variable and the
args.input_type_map / args.input_offset_map uses) so the code uses iterable
unpacking rather than building a new list.

103-116: Asymmetric default handling for l1 vs l2 arguments.

Line 107 checks args.l2 != 1024000 before appending, but line 105-106 doesn't have similar logic for l1. According to the context snippet from DeeployRunnerArgumentParser, the default for l1 is 64000. This inconsistency means --l1=64000 would be passed explicitly while --l2=1024000 would be omitted.

If the intent is to omit default values from gen_args, apply consistent handling:

♻️ Suggested fix

         if getattr(args, 'defaultMemLevel', None):
             gen_args.append(f'--defaultMemLevel={args.defaultMemLevel}')
-        if getattr(args, 'l1', None):
+        if getattr(args, 'l1', None) and args.l1 != 64000:
             gen_args.append(f'--l1={args.l1}')
         if getattr(args, 'l2', None) and args.l2 != 1024000:
             gen_args.append(f'--l2={args.l2}')

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/testUtils/deeployTrainingRunner.py` around lines 103 - 116, The
gen_args building is inconsistent: args.l2 is only appended when not equal to
its default (1024000) but args.l1 is always appended, causing the default 64000
to be emitted; update the branch for args.l1 in
DeeployTest/testUtils/deeployTrainingRunner.py so it mirrors args.l2 by only
appending f'--l1={args.l1}' when getattr(args, 'l1', None) is truthy and args.l1
!= 64000 (the default defined in DeeployRunnerArgumentParser), keeping the
existing behavior for args.l2 and leaving other flags (defaultMemLevel,
memAllocStrategy, searchStrategy, profileTiling, plotMemAlloc) unchanged.

Deeploy/Targets/PULPOpen/Templates/SoftmaxCrossEntropyLossTemplate.py (1)

31-54: Debug printf may cause unwanted output in production.

Line 52 includes a printf statement that will output loss values during execution. While useful for debugging, this could flood logs in production or affect performance:
printf("    [SCE] loss=%.6f\\r\\n", (double)${loss}[0]);
Consider either:

Making it conditional on a debug flag (e.g., #ifdef DEBUG_LOSS)

Removing it for production builds

Using a logging mechanism that can be disabled
♻️ Suggested conditional approach
     ${loss}[0] = sce_total_loss / (float32_t)${batch};
+#ifdef DEBUG_SCE_LOSS
     printf("    [SCE] loss=%.6f\\r\\n", (double)${loss}[0]);
+#endif
 END_SINGLE_CORE
The numerical implementation using max-subtraction for stability is correct.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/PULPOpen/Templates/SoftmaxCrossEntropyLossTemplate.py` around
lines 31 - 54, The printf in the SoftmaxCrossEntropyLoss dual-output template
(in referenceDualOutputTemplate) prints loss each run and should be guarded or
removed; wrap the printf("    [SCE] loss=...") call with a compile-time or
runtime guard (e.g., `#ifdef` DEBUG_LOSS / `#endif` or check a debug flag) or
replace it with a pluggable logger that can be disabled so production builds do
not emit or pay the cost of this output, ensuring you only change the printf
line (and related surrounding template text) inside referenceDualOutputTemplate.

DeeployTest/deeployTrainingRunner_tiled_siracusa.py (1)

1-11: Consider using python3 in shebang for explicit Python 3.

The shebang #!/usr/bin/env python may resolve to Python 2 on some legacy systems. For consistency with the comment in deeployTrainingRunner.py (which mentions /usr/bin/python3), consider:
-#!/usr/bin/env python
+#!/usr/bin/env python3
Otherwise, the wrapper script is clean and correct.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/deeployTrainingRunner_tiled_siracusa.py` around lines 1 - 11,
Update the script shebang to explicitly use Python 3 by changing the first line
from "#!/usr/bin/env python" to "#!/usr/bin/env python3" so that running
deeployTrainingRunner_tiled_siracusa.py invokes Python 3 consistently; no other
code changes needed—this affects the interpreter used when executing the module
that calls main(tiling_enabled = True) from testUtils.deeployTrainingRunner.

Deeploy/DeeployTypes.py (1)

339-349: LGTM: Liveness propagation for input/output buffers prevents premature deallocation.

The change correctly ensures that input/output VariableBuffers (and their aliases) are treated as live throughout the network. This is essential for training where weight buffers need to persist across forward/backward passes.

One minor style issue: Line 343 uses next as a variable name which shadows the Python builtin (Ruff A001). Consider renaming:

♻️ Optional rename

     while len(queue) > 0:
-        next = queue.pop()
-        buffNext = ctxt.lookup(next)
+        next_name = queue.pop()
+        buffNext = ctxt.lookup(next_name)
         assert isinstance(buffNext, VariableBuffer)
         live |= buffNext._live or buffNext.is_input or buffNext.is_output
-        visited.add(next)
+        visited.add(next_name)
         queue |= buffNext.aliases - visited

,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/DeeployTypes.py` around lines 339 - 349, The code uses the builtin
name 'next' as a local variable which shadows Python's builtin; in the liveness
traversal loop (where variables live, queue, visited, buffNext and
ctxt.lookup/VariableBuffer are used) rename the local variable 'next' to a
non-conflicting identifier such as 'next_name' (and update its uses in
queue.pop(), ctxt.lookup(next_name), visited.add(next_name), and queue |=
buffNext.aliases - visited) to avoid the Ruff A001 builtin-shadowing violation.

DeeployTest/testUtils/tilingUtils.py (1)

63-66: Rename unused loop variable to satisfy linter.

Static analysis correctly flags that lifetime is not used within the loop body. Only tensorName is needed to look up the buffer.

♻️ Proposed fix

-        for tensorName, lifetime in tensorLifetimeMap.items():
+        for tensorName, _lifetime in tensorLifetimeMap.items():
             buffer = ctxt.lookup(tensorName)
             if buffer.is_input:
                 tensorLifetimeMap[tensorName] = (0, maxStepIdx)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/testUtils/tilingUtils.py` around lines 63 - 66, The loop over
tensorLifetimeMap uses an unused variable named lifetime which lints as unused;
change the loop signature in the for loop that currently reads "for tensorName,
lifetime in tensorLifetimeMap.items()" to use a throwaway name (e.g., "_" or
"_lifetime") so only tensorName is considered, leaving the body (buffer =
ctxt.lookup(tensorName); if buffer.is_input: tensorLifetimeMap[tensorName] = (0,
maxStepIdx)) unchanged; this targets the for loop that iterates
tensorLifetimeMap.items() and references ctxt.lookup and buffer.is_input.

DeeployTest/generateOptimizerNetwork.py (1)

28-28: Remove unused numpy import.

numpy is imported as np but not used directly in this file. If build_shared_buffer_maps or other imported functions need numpy, they handle their own imports.
♻️ Proposed fix
-import numpy as np
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/generateOptimizerNetwork.py` at line 28, Remove the unused
`numpy` import in generateOptimizerNetwork.py: delete the line importing `numpy
as np` since no code in this module references `np` (functions like
build_shared_buffer_maps and other imports should manage their own numpy usage);
ensure no other references to `np` remain in the file after removal.

DeeployTest/testUtils/codeGenerate.py (3)

656-658: Shell injection warning is a false positive for internal tooling.

The os.system call for clang-format uses paths derived from the dumpdir parameter which is controlled by the test framework, not external user input. However, using subprocess.run would be more robust and avoid shell parsing issues with special characters in paths.

Proposed fix using subprocess.run

-    clang_format = "{BasedOnStyle: llvm, IndentWidth: 2, ColumnLimit: 160}"
-    for fname in ['TrainingNetwork.c', 'TrainingNetwork.h', 'testinputs.h', 'testoutputs.h']:
-        os.system(f'clang-format -i --style="{clang_format}" {dumpdir}/{fname}')
+    clang_format = "{BasedOnStyle: llvm, IndentWidth: 2, ColumnLimit: 160}"
+    for fname in ['TrainingNetwork.c', 'TrainingNetwork.h', 'testinputs.h', 'testoutputs.h']:
+        subprocess.run(['clang-format', '-i', f'--style={clang_format}', f'{dumpdir}/{fname}'], check=False)

Note: subprocess import already exists at top of file; add it if missing.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/testUtils/codeGenerate.py` around lines 656 - 658, The loop that
calls os.system to run clang-format with a shell is brittle and flagged for
shell injection; replace the os.system usage in the block where clang_format is
defined and the for fname in [...] loop uses dumpdir by invoking subprocess.run
with an argument list (e.g., ['clang-format','-i', f'--style={clang_format}',
f'{dumpdir}/{fname}']), pass check=True to raise on failure, and ensure
subprocess is imported; this removes shell parsing, keeps the same
flags/behavior, and uses shell=False for safety.

1160-1162: Same os.system suggestion as for training code generation.

Consider using subprocess.run for consistency and robustness.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/testUtils/codeGenerate.py` around lines 1160 - 1162, Replace the
os.system calls used to run clang-format with subprocess.run for robustness: in
the block where clang_format is defined and the loop over ['OptimizerNetwork.c',
'OptimizerNetwork.h'] executes, call subprocess.run with a list of arguments
(e.g. ['clang-format','-i','--style='+clang_format,f'{dumpdir}/{fname}']), set
check=True to raise on non-zero exit, and avoid shell=True; optionally capture
stdout/stderr or let it inherit for visibility. Ensure you import subprocess at
the top and handle CalledProcessError (or let it propagate) so failures are not
silently ignored.

308-311: Consider using explicit Optional type hints.

PEP 484 prohibits implicit Optional (e.g., init_weights: List[np.ndarray] = None). Use explicit Optional[List[np.ndarray]] for clarity.

Proposed fix

 def generateTrainingTestInputsHeader(deployer: NetworkDeployer, all_mb_data: List[List[np.ndarray]], n_steps: int,
                                      n_accum: int, grad_buf_start_idx: int = 0, num_grad_inputs: int = 0,
-                                     learning_rate: float = 0.001, init_weights: List[np.ndarray] = None,
-                                     data_size: int = None) -> str:
+                                     learning_rate: float = 0.001, init_weights: Optional[List[np.ndarray]] = None,
+                                     data_size: Optional[int] = None) -> str:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/testUtils/codeGenerate.py` around lines 308 - 311, The function
signature for generateTrainingTestInputsHeader uses implicit Optional defaults
(init_weights: List[np.ndarray] = None and data_size: int = None); import
typing.Optional and change those annotations to explicit Optionals (e.g.,
init_weights: Optional[List[np.ndarray]] and data_size: Optional[int]) so the
types follow PEP 484; update the function signature accordingly and add the
import if missing.

DeeployTest/testMVPTraining.py (1)

34-96: Consider extracting shared inference helpers to a common module.

These helper functions (_infer_num_data_inputs, _infer_total_mb, _infer_data_size, _infer_n_accum) are duplicated from generateTrainingNetwork.py. While the comment at line 34-36 acknowledges this, extracting them to a shared utility module (e.g., testUtils/trainingUtils.py) would reduce maintenance burden and ensure consistency.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/testMVPTraining.py` around lines 34 - 96, The four helper
functions (_infer_num_data_inputs, _infer_total_mb, _infer_data_size,
_infer_n_accum) are duplicated; extract them into a shared utility module (e.g.,
testUtils/trainingUtils.py) and replace the local copies with imports. Move the
implementations into the new module, export the four functions (and
_load_reference_losses if desired), update imports in
DeeployTest/testMVPTraining.py (use from testUtils.trainingUtils import
_infer_num_data_inputs, _infer_total_mb, _infer_data_size, _infer_n_accum),
remove the duplicate definitions from testMVPTraining.py, run unit tests and
update any relative import paths or packaging so both generateTrainingNetwork.py
and testMVPTraining.py import the single shared implementation.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Deeploy/Targets/Generic/Parsers.py`:
- Around line 2728-2748: The parser only checks arity in parseNode/parseNodeCtxt
but must validate tensor extents: in parseNodeCtxt (using ctxt.lookup results
for buffer, gradient, lazy_reset_grad, data_out) verify that buffer.shape,
gradient.shape and data_out.shape are identical (same number of elements and
per-dimension sizes) and that lazy_reset_grad is a scalar/has size 1; if any
check fails, return ctxt, False (or otherwise reject the node) instead of
proceeding; compute operatorRepresentation['size'] from the validated common
shape (e.g., prod(buffer.shape) after checks) and do not accept the node when
shapes mismatch to prevent out-of-bounds reads/writes.

In `@Deeploy/Targets/Generic/TypeCheckers.py`:
- Around line 577-582: The override checkOutputType in TypeCheckers currently
only compares output count and skips the base SignPropTypeChecker validation;
update checkOutputType (method on this class) to first compute
actual_num_outputs from operatorRepresentation.get('loss', '') as it does now,
then if counts match delegate to the base implementation (call
super().checkOutputType(inputs, operatorRepresentation)) and return the logical
AND of the count check and the base check result so both count gating and the
inherited property/type validation are enforced.

In `@Deeploy/Targets/PULPOpen/Platform.py`:
- Around line 17-27: The file has formatting errors (import wrapping and mapping
formatting) causing CI failures; run the project formatter (yapf) on
Deeploy/Targets/PULPOpen/Platform.py and reformat the long import lists and any
mapping/dictionary blocks so they follow the configured style. Specifically
rewrap the multi-line import statements that include symbols like GEMMLayer,
InPlaceAccumulatorV2Layer, LayerNormLayer and the parser imports such as
AddParser, ConcatParser, GELUParser, GEMMParser, etc., and also reformat the
other problematic ranges noted (around the imports at lines ~42-53, ~109-114,
and ~156-160) to eliminate trailing commas/incorrect wrapping per yapf rules
before committing.

In `@Deeploy/Targets/PULPOpen/Templates/FloatInPlaceAccumulatorV2Template.py`:
- Around line 5-7: The file imports an unused symbol Dict from typing which
triggers the CI lint hook; remove Dict from the import list in the typing import
line so the line reads only the used types (e.g., List, Tuple) and ensure any
references to Dict in functions or classes like NetworkContext, NodeTemplate,
OperatorRepresentation, or VariableBuffer are not affected; update the import
statement in FloatInPlaceAccumulatorV2Template.py accordingly.

In `@Deeploy/Targets/PULPOpen/TileConstraints/SGDTileConstraint.py`:
- Around line 14-18: ReluGradTileConstraint is an unused/inconsistent tile
constraint: remove the entire ReluGradTileConstraint class (including its
dataIn1Name/dataIn2Name/dataOutName assignments) since it inverts the
input/output mapping compared to GeluGradTileConstraint and the GELUGradParser
and there is no ReluGradParser to accompany it; alternatively, if you intend to
support ReLU backward, implement a matching ReluGradParser and ensure the class
uses the same contract as GELUGradParser (expect grad_in as the first input and
grad_out as the output) and update the dataIn/dataOut names accordingly so
mappings align with GeluGradTileConstraint and the parser contract.

In
`@Deeploy/Targets/PULPOpen/TileConstraints/SoftmaxCrossEntropyLossDualOutputTileConstraint.py`:
- Around line 18-74: The wrapTilingSolution currently appends the scalar loss to
every tile which causes last-tile overwrite if the batch is split; modify
SoftmaxCrossEntropyLossDualOutputTileConstraint.wrapTilingSolution to detect
when tilingSchedules represents a split batch (e.g. len(tilingSchedules) > 1 or
multiple steps in each schedule) and prevent exporting a per-tile scalar: either
(a) raise an explicit error/AssertionError explaining that batch tiling is
forbidden for the dual-output variant, or (b) implement aggregation of partial
losses into a single scalar before adding cls.dataLossName to schedule; perform
this check after calling super().wrapTilingSolution (use variables
tilingSchedules, schedule.outputLoadSchedule, logProbVar, lossVar) and
return/raise accordingly.

In `@Deeploy/TilingExtension/TilerExtension.py`:
- Around line 477-490: The fallback assignment memoryBlock._addrSpace = (0, 0)
violates the addrSpace invariant (addrSpace[0] < addrSpace[1]) and must be
removed; in the alias-resolution loop (references: memoryBlock, memoryMap,
aliasBlocks, ctxt.dealiasBuffer, _addrSpace) change the "alias target not in
this memoryMap" branch to NOT assign a zero-width tuple — either raise a clear
exception indicating unresolved aliasTarget or simply skip setting
memoryBlock._addrSpace (leave it as None) so existing null checks handle it;
optionally emit a warning log mentioning the unresolved aliasTarget for
debugging.

In `@DeeployTest/generateTrainingNetwork.py`:
- Around line 178-202: The loop over graph_input_names is using an unused
variable name and silently skips empty arrays causing missing entries in
inputTypes/inputOffsets; update the for-loop to use an underscore for the unused
variable (e.g., for graph_idx, _ in enumerate(graph_input_names)) or otherwise
reference name if intended, and when np.prod(arr.shape) == 0 do not silently
pass—assign a sensible default type/offset (e.g., PointerClass(float32_t) and 0
or a sentinel) or explicitly log/raise so inputTypes[f"input_{graph_idx}"] and
inputOffsets[f"input_{graph_idx}"] are always set; ensure this change is applied
where npz_base is indexed and when calling inferTypeAndOffset so empty arrays
don’t leave entries unset.

In `@DeeployTest/Platforms/Siracusa/CMakeLists.txt`:
- Line 35: The line using target_compile_options is incorrect because
${NETWORK_LIB} is a target name, not a compile option; remove the line
target_compile_options(${ProjectId} INTERFACE ${NETWORK_LIB}) entirely. If the
intent was to inherit compile flags from the network library, ensure
target_link_libraries(${ProjectId} PUBLIC|PRIVATE ${NETWORK_LIB}) is used (it
already should propagate options) or explicitly add real option strings via
target_compile_options(${ProjectId} PUBLIC <options>) or extract flags from
${NETWORK_LIB} with get_target_property before applying them.

In `@DeeployTest/Platforms/Siracusa/src/deeploytraintest.c`:
- Around line 397-413: The test currently always returns 0 even when loss
verification reports mismatches; after the cluster loss comparison (where
loss_err_count is populated via LossCompareArgs and CompareLossesOnCluster)
change the function exit to return a non-zero failure code when loss_err_count >
0 (e.g., return loss_err_count or return 1) and return 0 otherwise so the
harness fails the process on verification errors; update the return at the end
of the function that currently returns 0 accordingly.
- Around line 121-137: The L3→L3 branch of l3_aware_copy currently allocates the
entire buffer in L2 (pi_l2_malloc(bytes)) which will fail for large tensors;
change it to perform bounded chunked transfers: allocate a fixed safe chunk size
(e.g., a MAX_L2_CHUNK constant), loop over the source with offsets, for each
iteration allocate an L2 tmp buffer of min(remaining, CHUNK), call ram_read(tmp,
src+offset, chunk_size) and ram_write(dst+offset, tmp, chunk_size), free the tmp
each iteration, and ensure you check pi_l2_malloc's return and handle allocation
failure by cleaning up and returning/propagating an error; retain existing
direct memcpy/ram_read/ram_write logic for other branches and reuse ram_read,
ram_write, pi_l2_malloc, pi_l2_free symbols.

In `@DeeployTest/testMVPTraining.py`:
- Around line 159-179: The loop over graph_input_names uses an unused variable
name (rename to _name) and silently skips empty arrays causing downstream
KeyError; update the for loop header to for graph_idx, _name in
enumerate(graph_input_names) and replace the np.prod(arr.shape) == 0 branch with
an explicit assignment so empty arrays still get entries in
inputTypes/inputOffsets (e.g., set inputTypes[f"input_{graph_idx}"] =
PointerClass(float32_t) and inputOffsets[f"input_{graph_idx}"] = 0 or another
safe default), and optionally add a debug log mentioning graph_idx when you
assign the default; this touches graph_input_names, grad_acc_set,
npz_base/npz_idx, inputTypes, inputOffsets, and inferTypeAndOffset.

In `@DeeployTest/testUtils/codeGenerate.py`:
- Around line 967-976: The code opens TrainingNetwork.h with
open(train_h_path).read() without a context manager; change it to read the file
using a with statement (e.g., with open(train_h_path, 'r') as f: train_h =
f.read()) before running the re.sub to produce train_h_new and then write with
the existing with open(..., 'w') block; ensure you reference the same variables
train_h_path, train_h, train_h_new and keep the regex replacement for
DeeployNetwork_MEMORYARENA_L1_len unchanged.
- Around line 199-210: The loop over output buffers (variables output_idx,
output_buffer, output_size, typeName using deployer.ctxt.lookup/is_buffer) is
currently dead—compute values but never emit initialization—so either remove the
loop entirely or implement the intended zero-initialization: for each
output_buffer discovered, compute the total element/byte count via output_size
and typeName and append the appropriate zeroing code to retStr (e.g., a memset
or generated loop using the buffer name) so the produced code actually
initializes output_X to zero; update retStr exactly where the loop currently
increments output_idx and ensure you reference the buffer name from
output_buffer when emitting the init.

In `@DeeployTest/testUtils/core/execution.py`:
- Around line 194-197: The second warning uses an unnecessary f-string: in the
block checking opt_script.exists() change the log.warning call that currently
uses f"generateOptimizerNetwork.py not found — skipping optimizer codegen" to a
plain string literal without the leading f so it reads
"generateOptimizerNetwork.py not found — skipping optimizer codegen"; update the
call near the Path(opt_dir).exists() check around the opt_dir/opt_script logic
in execution.py (the log.warning that references generateOptimizerNetwork.py) to
remove the extraneous f prefix.
- Around line 118-121: The second warning uses an f-string with no placeholders;
in the opt_script.exists() branch update the log.warning call that currently
uses f"testMVPOptimizer.py not found — skipping optimizer codegen" to a plain
string literal (remove the leading 'f') so the message is not unnecessarily
treated as an f-string when checking opt_script.exists() and logging the
warning.

---

Outside diff comments:
In `@Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py`:
- Around line 40-82: The template file's formatting doesn't match yapf rules:
run the formatter (e.g., yapf -i
Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py) to fix import wrapping and
the long NodeTemplate/_PULPSGDTemplate call wrapping around referenceTemplate
and the multi-line string; ensure the _PULPSGDTemplate(...) invocation and the
assignment to referenceTemplate are wrapped/indented per yapf so the
triple-quoted template string lines align with the project's style.

---

Nitpick comments:
In `@Deeploy/DeeployTypes.py`:
- Around line 339-349: The code uses the builtin name 'next' as a local variable
which shadows Python's builtin; in the liveness traversal loop (where variables
live, queue, visited, buffNext and ctxt.lookup/VariableBuffer are used) rename
the local variable 'next' to a non-conflicting identifier such as 'next_name'
(and update its uses in queue.pop(), ctxt.lookup(next_name),
visited.add(next_name), and queue |= buffNext.aliases - visited) to avoid the
Ruff A001 builtin-shadowing violation.

In `@Deeploy/Targets/PULPOpen/Templates/SoftmaxCrossEntropyLossTemplate.py`:
- Around line 31-54: The printf in the SoftmaxCrossEntropyLoss dual-output
template (in referenceDualOutputTemplate) prints loss each run and should be
guarded or removed; wrap the printf("    [SCE] loss=...") call with a
compile-time or runtime guard (e.g., `#ifdef` DEBUG_LOSS / `#endif` or check a debug
flag) or replace it with a pluggable logger that can be disabled so production
builds do not emit or pay the cost of this output, ensuring you only change the
printf line (and related surrounding template text) inside
referenceDualOutputTemplate.

In `@DeeployTest/deeployTrainingRunner_tiled_siracusa.py`:
- Around line 1-11: Update the script shebang to explicitly use Python 3 by
changing the first line from "#!/usr/bin/env python" to "#!/usr/bin/env python3"
so that running deeployTrainingRunner_tiled_siracusa.py invokes Python 3
consistently; no other code changes needed—this affects the interpreter used
when executing the module that calls main(tiling_enabled = True) from
testUtils.deeployTrainingRunner.

In `@DeeployTest/generateOptimizerNetwork.py`:
- Line 28: Remove the unused `numpy` import in generateOptimizerNetwork.py:
delete the line importing `numpy as np` since no code in this module references
`np` (functions like build_shared_buffer_maps and other imports should manage
their own numpy usage); ensure no other references to `np` remain in the file
after removal.

In `@DeeployTest/testMVPTraining.py`:
- Around line 34-96: The four helper functions (_infer_num_data_inputs,
_infer_total_mb, _infer_data_size, _infer_n_accum) are duplicated; extract them
into a shared utility module (e.g., testUtils/trainingUtils.py) and replace the
local copies with imports. Move the implementations into the new module, export
the four functions (and _load_reference_losses if desired), update imports in
DeeployTest/testMVPTraining.py (use from testUtils.trainingUtils import
_infer_num_data_inputs, _infer_total_mb, _infer_data_size, _infer_n_accum),
remove the duplicate definitions from testMVPTraining.py, run unit tests and
update any relative import paths or packaging so both generateTrainingNetwork.py
and testMVPTraining.py import the single shared implementation.

In `@DeeployTest/testUtils/codeGenerate.py`:
- Around line 656-658: The loop that calls os.system to run clang-format with a
shell is brittle and flagged for shell injection; replace the os.system usage in
the block where clang_format is defined and the for fname in [...] loop uses
dumpdir by invoking subprocess.run with an argument list (e.g.,
['clang-format','-i', f'--style={clang_format}', f'{dumpdir}/{fname}']), pass
check=True to raise on failure, and ensure subprocess is imported; this removes
shell parsing, keeps the same flags/behavior, and uses shell=False for safety.
- Around line 1160-1162: Replace the os.system calls used to run clang-format
with subprocess.run for robustness: in the block where clang_format is defined
and the loop over ['OptimizerNetwork.c', 'OptimizerNetwork.h'] executes, call
subprocess.run with a list of arguments (e.g.
['clang-format','-i','--style='+clang_format,f'{dumpdir}/{fname}']), set
check=True to raise on non-zero exit, and avoid shell=True; optionally capture
stdout/stderr or let it inherit for visibility. Ensure you import subprocess at
the top and handle CalledProcessError (or let it propagate) so failures are not
silently ignored.
- Around line 308-311: The function signature for
generateTrainingTestInputsHeader uses implicit Optional defaults (init_weights:
List[np.ndarray] = None and data_size: int = None); import typing.Optional and
change those annotations to explicit Optionals (e.g., init_weights:
Optional[List[np.ndarray]] and data_size: Optional[int]) so the types follow PEP
484; update the function signature accordingly and add the import if missing.

In `@DeeployTest/testUtils/deeployTrainingRunner.py`:
- Around line 18-21: The module currently mutates os.environ['PATH'] at import
time which has global side effects; move that assignment into the entrypoint
function main() so PATH is only changed when the script is executed (not on
import). Locate the top-level os.environ['PATH'] modification and remove it from
module scope, then add the same assignment at the start of the
main(tiling_enabled: bool = False, default_platform: str = 'Siracusa',
default_simulator: str = 'gvsoc') function so the change is scoped to execution;
keep the exact string '/usr/bin:' + os.environ.get('PATH', '') and the
explanatory comment near the new location.
- Around line 97-100: Replace the list concatenation when extending gen_args
with iterable unpacking: instead of gen_args.extend(['--input-type-map'] +
list(args.input_type_map)) and the similar input_offset_map line, use
gen_args.extend(['--input-type-map', *args.input_type_map]) and
gen_args.extend(['--input-offset-map', *args.input_offset_map]) (apply to the
gen_args variable and the args.input_type_map / args.input_offset_map uses) so
the code uses iterable unpacking rather than building a new list.
- Around line 103-116: The gen_args building is inconsistent: args.l2 is only
appended when not equal to its default (1024000) but args.l1 is always appended,
causing the default 64000 to be emitted; update the branch for args.l1 in
DeeployTest/testUtils/deeployTrainingRunner.py so it mirrors args.l2 by only
appending f'--l1={args.l1}' when getattr(args, 'l1', None) is truthy and args.l1
!= 64000 (the default defined in DeeployRunnerArgumentParser), keeping the
existing behavior for args.l2 and leaving other flags (defaultMemLevel,
memAllocStrategy, searchStrategy, profileTiling, plotMemAlloc) unchanged.

In `@DeeployTest/testUtils/tilingUtils.py`:
- Around line 63-66: The loop over tensorLifetimeMap uses an unused variable
named lifetime which lints as unused; change the loop signature in the for loop
that currently reads "for tensorName, lifetime in tensorLifetimeMap.items()" to
use a throwaway name (e.g., "_" or "_lifetime") so only tensorName is
considered, leaving the body (buffer = ctxt.lookup(tensorName); if
buffer.is_input: tensorLifetimeMap[tensorName] = (0, maxStepIdx)) unchanged;
this targets the for loop that iterates tensorLifetimeMap.items() and references
ctxt.lookup and buffer.is_input.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3d4be450-9169-4e18-beb8-8b248adb2fb6

📥 Commits

Reviewing files that changed from the base of the PR and between 13113de and cc1f68b.

📒 Files selected for processing (30)

Deeploy/DeeployTypes.py
Deeploy/Targets/Generic/Layers.py
Deeploy/Targets/Generic/Parsers.py
Deeploy/Targets/Generic/TypeCheckers.py
Deeploy/Targets/PULPOpen/Bindings.py
Deeploy/Targets/PULPOpen/Platform.py
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py
Deeploy/Targets/PULPOpen/Templates/FloatInPlaceAccumulatorV2Template.py
Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py
Deeploy/Targets/PULPOpen/Templates/SoftmaxCrossEntropyLossTemplate.py
Deeploy/Targets/PULPOpen/TileConstraints/InPlaceAccumulatorV2TileConstraint.py
Deeploy/Targets/PULPOpen/TileConstraints/SGDTileConstraint.py
Deeploy/Targets/PULPOpen/TileConstraints/SoftmaxCrossEntropyLossDualOutputTileConstraint.py
Deeploy/Targets/PULPOpen/Tiler.py
Deeploy/TilingExtension/TilerExtension.py
DeeployTest/CMakeLists.txt
DeeployTest/Platforms/Siracusa/CMakeLists.txt
DeeployTest/Platforms/Siracusa/src/deeploytraintest.c
DeeployTest/deeployTrainingRunner.py
DeeployTest/deeployTrainingRunner_siracusa.py
DeeployTest/deeployTrainingRunner_tiled_siracusa.py
DeeployTest/generateOptimizerNetwork.py
DeeployTest/generateTrainingNetwork.py
DeeployTest/testMVPOptimizer.py
DeeployTest/testMVPTraining.py
DeeployTest/testUtils/codeGenerate.py
DeeployTest/testUtils/core/config.py
DeeployTest/testUtils/core/execution.py
DeeployTest/testUtils/deeployTrainingRunner.py
DeeployTest/testUtils/tilingUtils.py

coderabbitai · 2026-04-10T10:56:45Z

Deeploy/Targets/Generic/Parsers.py

+    def parseNode(self, node: gs.Node) -> bool:
+        # Require exactly 3 inputs (buffer, gradient, lazy_reset_grad) and 1 output
+        return len(node.inputs) == 3 and len(node.outputs) == 1
+
+    def parseNodeCtxt(self,
+                      ctxt: NetworkContext,
+                      node: gs.Node,
+                      channels_first: bool = True) -> Tuple[NetworkContext, bool]:
+
+        buffer = ctxt.lookup(node.inputs[0].name)
+        gradient = ctxt.lookup(node.inputs[1].name)
+        lazy_reset_grad = ctxt.lookup(node.inputs[2].name)
+        data_out = ctxt.lookup(node.outputs[0].name)
+
+        self.operatorRepresentation['accum_buffer'] = buffer.name
+        self.operatorRepresentation['gradient'] = gradient.name
+        self.operatorRepresentation['lazy_reset_grad'] = lazy_reset_grad.name
+        self.operatorRepresentation['data_out'] = data_out.name
+        self.operatorRepresentation['size'] = int(np.prod(buffer.shape))
+
+        return ctxt, True


⚠️ Potential issue | 🟠 Major

Reject shape-mismatched InPlaceAccumulatorV2 nodes.

This parser only checks arity, but the generated kernel/tile path assumes buffer, gradient, and data_out have the same extent and that lazy_reset_grad is a single flag. If a malformed graph slips through, size is derived from buffer.shape and later code can read/write past the smaller tensor.

Proposed validation

def parseNodeCtxt(self, ctxt: NetworkContext, node: gs.Node, channels_first: bool = True) -> Tuple[NetworkContext, bool]: buffer = ctxt.lookup(node.inputs[0].name) gradient = ctxt.lookup(node.inputs[1].name) lazy_reset_grad = ctxt.lookup(node.inputs[2].name) data_out = ctxt.lookup(node.outputs[0].name) + + if tuple(buffer.shape) != tuple(gradient.shape) or tuple(buffer.shape) != tuple(data_out.shape): + return ctxt, False + if int(np.prod(lazy_reset_grad.shape)) != 1: + return ctxt, False self.operatorRepresentation['accum_buffer'] = buffer.name self.operatorRepresentation['gradient'] = gradient.name self.operatorRepresentation['lazy_reset_grad'] = lazy_reset_grad.name self.operatorRepresentation['data_out'] = data_out.name

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/Generic/Parsers.py` around lines 2728 - 2748, The parser only checks arity in parseNode/parseNodeCtxt but must validate tensor extents: in parseNodeCtxt (using ctxt.lookup results for buffer, gradient, lazy_reset_grad, data_out) verify that buffer.shape, gradient.shape and data_out.shape are identical (same number of elements and per-dimension sizes) and that lazy_reset_grad is a scalar/has size 1; if any check fails, return ctxt, False (or otherwise reject the node) instead of proceeding; compute operatorRepresentation['size'] from the validated common shape (e.g., prod(buffer.shape) after checks) and do not accept the node when shapes mismatch to prevent out-of-bounds reads/writes.

Deeploy/Targets/Generic/TypeCheckers.py

Deeploy/Targets/PULPOpen/Platform.py

Deeploy/Targets/PULPOpen/Templates/FloatInPlaceAccumulatorV2Template.py

Deeploy/Targets/PULPOpen/TileConstraints/SGDTileConstraint.py

DeeployTest/testMVPTraining.py

DeeployTest/testUtils/codeGenerate.py

DeeployTest/testUtils/core/execution.py

…-format)

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

Deeploy/Targets/Generic/TypeCheckers.py (1)

577-581: ⚠️ Potential issue | 🟠 Major

Keep the inherited output-type validation in this override.

This arity check is necessary, but returning only the count match still bypasses SignPropTypeChecker.checkOutputType(). That means dual-output SCE can bind as long as the output count matches, even if the declared output pointer types do not.

✅ Minimal fix

     def checkOutputType(self, inputs: List[VariableBuffer], operatorRepresentation: OperatorRepresentation) -> bool:
         # The parser sets 'loss' to a non-empty string for 2-output nodes, '' for 1-output.
         # Use this to determine the actual output count and match it against this binding.
         actual_num_outputs = 2 if operatorRepresentation.get('loss', '') != '' else 1
-        return actual_num_outputs == len(self.output_types)
+        return actual_num_outputs == len(self.output_types) and \
+            super().checkOutputType(inputs, operatorRepresentation)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/Generic/TypeCheckers.py` around lines 577 - 581, The override
of checkOutputType currently only verifies arity and skips the inherited
type-pointer validation; update Generic.TypeCheckers.checkOutputType to first
compute actual_num_outputs (as it does using operatorRepresentation.get('loss',
'')), return False if actual_num_outputs != len(self.output_types), and
otherwise call and return the superclass
SignPropTypeChecker.checkOutputType(self, inputs, operatorRepresentation) so the
original pointer/type validation still runs; reference the method names
checkOutputType, SignPropTypeChecker.checkOutputType, self.output_types, and
operatorRepresentation.get('loss', '') to locate the code to change.

DeeployTest/testMVPTraining.py (1)

160-174: ⚠️ Potential issue | 🟡 Minor

Register zero-length inputs instead of skipping them.

For zero-sized non-bool/non-float tensors, Lines 173-174 leave inputTypes / inputOffsets unset for that graph input. The rest of the pipeline treats those maps as complete, so the first empty tensor here will fail later during deployer setup or code generation. If empty inputs are valid, assign a safe default entry instead of pass; while touching this block, rename name to _name as well.

Proposed fix

-    for graph_idx, name in enumerate(graph_input_names):
+    for graph_idx, _name in enumerate(graph_input_names):
         if graph_idx in grad_acc_set:
             inputTypes[f"input_{graph_idx}"] = PointerClass(float32_t)
             inputOffsets[f"input_{graph_idx}"] = 0
         else:
             arr = npz_base[npz_idx]
             npz_idx += 1
             if arr.dtype == bool or arr.dtype == np.bool_:
                 inputTypes[f"input_{graph_idx}"] = PointerClass(uint8_t)
                 inputOffsets[f"input_{graph_idx}"] = 0
             elif arr.dtype in (np.float32, np.float64):
                 inputTypes[f"input_{graph_idx}"] = PointerClass(float32_t)
                 inputOffsets[f"input_{graph_idx}"] = 0
             elif np.prod(arr.shape) == 0:
-                pass
+                inputTypes[f"input_{graph_idx}"] = PointerClass(float32_t)
+                inputOffsets[f"input_{graph_idx}"] = 0
             else:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/testMVPTraining.py` around lines 160 - 174, The loop over
graph_input_names leaves inputTypes/inputOffsets unset for zero-length
non-bool/non-float arrays causing downstream failures; inside the for loop
(variables graph_idx, name -> rename to _name) when you hit the branch elif
np.prod(arr.shape) == 0, register a safe default entry instead of pass (e.g.,
set inputTypes[f"input_{graph_idx}"] to a PointerClass fallback such as
float32_t and inputOffsets[f"input_{graph_idx}"] = 0) so the maps remain
complete; ensure you still increment npz_idx and keep handling of grad_acc_set,
PointerClass, float32_t and uint8_t unchanged.

🧹 Nitpick comments (4)

DeeployTest/testMVPOptimizer.py (1)

110-118: Clarify or remove the contradictory comment on line 110.

Line 110's comment states "optimizer is forward-only, no lifetime extension needed" but then immediately uses TrainingSBTiler which explicitly extends input lifetimes. The follow-up comment (lines 114-117) correctly explains why lifetime extension IS needed. Consider removing or updating line 110 to avoid confusion.

📝 Fix the misleading comment

-    # 7. Wrap with SBTiler (single-buffering; optimizer is forward-only, no lifetime extension needed).
+    # 7. Wrap with TrainingSBTiler (single-buffering with extended input lifetimes).
     unique_params = f"{args.dumpdir}_L1{args.l1}_L2{args.l2}_{args.defaultMemLevel}_optimizer"
     testIdentifier = hashlib.md5(unique_params.encode()).hexdigest()[:16]

     # TrainingSBTiler extends all input buffer lifetimes to the end of the

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/testMVPOptimizer.py` around lines 110 - 118, The top comment
before creating unique_params and wrapping with TilerDeployerWrapper is
contradictory—remove or update the phrase "optimizer is forward-only, no
lifetime extension needed" and replace it with a clear note that TrainingSBTiler
is intentionally used to extend input buffer lifetimes (via
TrainingMemoryScheduler) to prevent allocator reuse of consumed inputs for later
outputs; adjust the comment near the TilerDeployerWrapper/TrainingSBTiler
mention so it accurately states that optimizer training requires lifetime
extension to avoid weight buffer corruption.

DeeployTest/testUtils/tilingUtils.py (1)

50-67: LGTM! Training memory scheduler implementation is correct.

The TrainingMemoryScheduler correctly extends input tensor lifetimes to span the entire tiling schedule, ensuring forward-pass inputs remain allocated during the backward pass.

One minor nit from static analysis: the loop variable lifetime on line 62 is unused.

🧹 Rename unused loop variable

-        for tensorName, lifetime in tensorLifetimeMap.items():
+        for tensorName, _lifetime in tensorLifetimeMap.items():
             buffer = ctxt.lookup(tensorName)
             if buffer.is_input:
                 tensorLifetimeMap[tensorName] = (0, maxStepIdx)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/testUtils/tilingUtils.py` around lines 50 - 67, The loop in
TrainingMemoryScheduler._calculateLifetimes iterates "for tensorName, lifetime
in tensorLifetimeMap.items()" but "lifetime" is unused; rename it to "_" or
"_lifetime" to satisfy static analysis. Update the iteration over
tensorLifetimeMap.items() (in the _calculateLifetimes method) to use the
underscore name and keep the rest of the logic (looking up buffer via
ctxt.lookup(tensorName) and updating tensorLifetimeMap) unchanged.

DeeployTest/testUtils/deeployTrainingRunner.py (1)

16-19: Hardcoded /usr/bin PATH prepend may cause issues.

Unconditionally prepending /usr/bin to PATH at module import time could cause unexpected behavior in environments where a different Python or tool version is expected. This global side effect happens even if the module is just imported.

Move PATH manipulation into main() or make it conditional

-# gapy (gvsoc launcher) uses `#!/usr/bin/env python3`.  Put /usr/bin first so
-# it resolves to /usr/bin/python3 which has all required packages (gapylib,
-# prettytable, …) rather than the minimal venv python.
-os.environ['PATH'] = '/usr/bin:' + os.environ.get('PATH', '')
+def _setup_gapy_path():
+    """Prepend /usr/bin to PATH so gapy resolves to system python3."""
+    # gapy (gvsoc launcher) uses `#!/usr/bin/env python3`.
+    current_path = os.environ.get('PATH', '')
+    if not current_path.startswith('/usr/bin:'):
+        os.environ['PATH'] = '/usr/bin:' + current_path

Then call _setup_gapy_path() at the start of main().

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/testUtils/deeployTrainingRunner.py` around lines 16 - 19, The
module-level PATH mutation should be moved into a small helper and invoked from
main to avoid global side effects; create a function named _setup_gapy_path()
that performs the os.environ['PATH'] = '/usr/bin:' + os.environ.get('PATH', '')
change (optionally guarded by a check such as if '/usr/bin' not in
os.environ.get('PATH','') or an env var like DEEPLY_ALLOW_PATH_PREPEND), remove
the top-level PATH assignment, and call _setup_gapy_path() at the start of
main() so the PATH modification occurs only when the script is executed, not on
import.

DeeployTest/testMVPTraining.py (1)

411-418: Preserve the original traceback here.

raise e rewrites the traceback to this handler. A bare raise keeps the original failure location, which is much more useful when codegen fails under CI or GVSoC.

Proposed fix

         if args.shouldFail:
             print("\033[92mTiled training network generation ended, failed as expected!\033[0m")
             sys.exit(0)
         else:
-            raise e
+            raise

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/testMVPTraining.py` around lines 411 - 418, The except block
around generateTiledTrainingNetwork currently re-raises the exception with
"raise e", which resets the traceback; change it to a bare "raise" so the
original traceback is preserved (inside the same except handling that checks
args.shouldFail and prints the success message when appropriate).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Deeploy/Targets/PULPOpen/Bindings.py`:
- Around line 361-365: The list comprehension in
PULPSoftmaxCrossEntropyLossDualOutputBindings (and the other ~11 binding
definitions) uses the comprehension variable named type which shadows the
built-in; rename that variable to label_type (or similar) in the comprehension
for PULPSoftmaxCrossEntropyLossDualOutputBindings and do the same consistently
for other bindings using IntegerDataTypes so the generator expression for
NodeBinding(SoftmaxCrossEntropyLossChecker(...),
SoftmaxCrossEntropyLossTemplate.referenceDualOutputTemplate, ForkTransformer)
becomes a comprehension using label_type instead of type; update all occurrences
referencing the comprehension variable in this file (e.g., any usages inside
SoftmaxCrossEntropyLossChecker or other binding list comprehensions) to match
the new name.

In `@DeeployTest/testMVPTraining.py`:
- Around line 233-242: The auto-inference code currently uses floor division so
any remainder mini-batches get dropped when total_mb is later recomputed; change
the inference to preserve all samples by using ceiling division (e.g., compute
n_steps = ceil(total_mb / n_accum) or n_accum = ceil(total_mb / n_steps) via
math.ceil) or else explicitly handle the leftover partial batch instead of
truncating; update the branches around _infer_total_mb, _infer_n_accum, and the
recompute of total_mb so the final total_mb covers all original samples (or emit
a clear warning if you choose to pad/handle a partial final step).

---

Duplicate comments:
In `@Deeploy/Targets/Generic/TypeCheckers.py`:
- Around line 577-581: The override of checkOutputType currently only verifies
arity and skips the inherited type-pointer validation; update
Generic.TypeCheckers.checkOutputType to first compute actual_num_outputs (as it
does using operatorRepresentation.get('loss', '')), return False if
actual_num_outputs != len(self.output_types), and otherwise call and return the
superclass SignPropTypeChecker.checkOutputType(self, inputs,
operatorRepresentation) so the original pointer/type validation still runs;
reference the method names checkOutputType, SignPropTypeChecker.checkOutputType,
self.output_types, and operatorRepresentation.get('loss', '') to locate the code
to change.

In `@DeeployTest/testMVPTraining.py`:
- Around line 160-174: The loop over graph_input_names leaves
inputTypes/inputOffsets unset for zero-length non-bool/non-float arrays causing
downstream failures; inside the for loop (variables graph_idx, name -> rename to
_name) when you hit the branch elif np.prod(arr.shape) == 0, register a safe
default entry instead of pass (e.g., set inputTypes[f"input_{graph_idx}"] to a
PointerClass fallback such as float32_t and inputOffsets[f"input_{graph_idx}"] =
0) so the maps remain complete; ensure you still increment npz_idx and keep
handling of grad_acc_set, PointerClass, float32_t and uint8_t unchanged.

---

Nitpick comments:
In `@DeeployTest/testMVPOptimizer.py`:
- Around line 110-118: The top comment before creating unique_params and
wrapping with TilerDeployerWrapper is contradictory—remove or update the phrase
"optimizer is forward-only, no lifetime extension needed" and replace it with a
clear note that TrainingSBTiler is intentionally used to extend input buffer
lifetimes (via TrainingMemoryScheduler) to prevent allocator reuse of consumed
inputs for later outputs; adjust the comment near the
TilerDeployerWrapper/TrainingSBTiler mention so it accurately states that
optimizer training requires lifetime extension to avoid weight buffer
corruption.

In `@DeeployTest/testMVPTraining.py`:
- Around line 411-418: The except block around generateTiledTrainingNetwork
currently re-raises the exception with "raise e", which resets the traceback;
change it to a bare "raise" so the original traceback is preserved (inside the
same except handling that checks args.shouldFail and prints the success message
when appropriate).

In `@DeeployTest/testUtils/deeployTrainingRunner.py`:
- Around line 16-19: The module-level PATH mutation should be moved into a small
helper and invoked from main to avoid global side effects; create a function
named _setup_gapy_path() that performs the os.environ['PATH'] = '/usr/bin:' +
os.environ.get('PATH', '') change (optionally guarded by a check such as if
'/usr/bin' not in os.environ.get('PATH','') or an env var like
DEEPLY_ALLOW_PATH_PREPEND), remove the top-level PATH assignment, and call
_setup_gapy_path() at the start of main() so the PATH modification occurs only
when the script is executed, not on import.

In `@DeeployTest/testUtils/tilingUtils.py`:
- Around line 50-67: The loop in TrainingMemoryScheduler._calculateLifetimes
iterates "for tensorName, lifetime in tensorLifetimeMap.items()" but "lifetime"
is unused; rename it to "_" or "_lifetime" to satisfy static analysis. Update
the iteration over tensorLifetimeMap.items() (in the _calculateLifetimes method)
to use the underscore name and keep the rest of the logic (looking up buffer via
ctxt.lookup(tensorName) and updating tensorLifetimeMap) unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 87d73b5e-2e69-4f9f-ad5d-4cccd6348c23

📥 Commits

Reviewing files that changed from the base of the PR and between cc1f68b and 284f145.

📒 Files selected for processing (19)

Deeploy/Targets/Generic/TypeCheckers.py
Deeploy/Targets/PULPOpen/Bindings.py
Deeploy/Targets/PULPOpen/Platform.py
Deeploy/Targets/PULPOpen/Templates/FloatInPlaceAccumulatorV2Template.py
Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py
Deeploy/Targets/PULPOpen/TileConstraints/SGDTileConstraint.py
Deeploy/Targets/PULPOpen/TileConstraints/SoftmaxCrossEntropyLossDualOutputTileConstraint.py
Deeploy/Targets/PULPOpen/Tiler.py
Deeploy/TilingExtension/TilerExtension.py
DeeployTest/Platforms/Siracusa/src/deeploytraintest.c
DeeployTest/deeployTrainingRunner.py
DeeployTest/generateOptimizerNetwork.py
DeeployTest/generateTrainingNetwork.py
DeeployTest/testMVPOptimizer.py
DeeployTest/testMVPTraining.py
DeeployTest/testUtils/codeGenerate.py
DeeployTest/testUtils/core/execution.py
DeeployTest/testUtils/deeployTrainingRunner.py
DeeployTest/testUtils/tilingUtils.py

✅ Files skipped from review due to trivial changes (1)

DeeployTest/deeployTrainingRunner.py

🚧 Files skipped from review as they are similar to previous changes (7)

Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py
Deeploy/TilingExtension/TilerExtension.py
Deeploy/Targets/PULPOpen/TileConstraints/SGDTileConstraint.py
Deeploy/Targets/PULPOpen/Templates/FloatInPlaceAccumulatorV2Template.py
Deeploy/Targets/PULPOpen/Tiler.py
Deeploy/Targets/PULPOpen/Platform.py
Deeploy/Targets/PULPOpen/TileConstraints/SoftmaxCrossEntropyLossDualOutputTileConstraint.py

coderabbitai · 2026-04-10T11:56:59Z

Deeploy/Targets/PULPOpen/Bindings.py

+PULPSoftmaxCrossEntropyLossDualOutputBindings = [
+    NodeBinding(
+        SoftmaxCrossEntropyLossChecker([PointerClass(float32_t), PointerClass(type)],
+                                       [PointerClass(float32_t), PointerClass(float32_t)]),
+        SoftmaxCrossEntropyLossTemplate.referenceDualOutputTemplate, ForkTransformer) for type in IntegerDataTypes


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n Deeploy/Targets/PULPOpen/Bindings.py | sed -n '355,375p'

Repository: pulp-platform/Deeploy

Length of output: 1271

🏁 Script executed:

rg "for type in IntegerDataTypes" Deeploy/Targets/PULPOpen/Bindings.py

Repository: pulp-platform/Deeploy

Length of output: 1155

Rename the comprehension variable to avoid Ruff A001 throughout the file.

The variable type in the list comprehension shadows Python's built-in type, which Ruff correctly flags. This pattern appears in multiple binding definitions in this file (at least 11 instances). Rename type to label_type (or similar) consistently across all occurrences.

At lines 361-365 specifically:

♻️ Fix for PULPSoftmaxCrossEntropyLossDualOutputBindings

PULPSoftmaxCrossEntropyLossDualOutputBindings = [ NodeBinding( - SoftmaxCrossEntropyLossChecker([PointerClass(float32_t), PointerClass(type)], + SoftmaxCrossEntropyLossChecker([PointerClass(float32_t), PointerClass(label_type)], [PointerClass(float32_t), PointerClass(float32_t)]), - SoftmaxCrossEntropyLossTemplate.referenceDualOutputTemplate, ForkTransformer) for type in IntegerDataTypes + SoftmaxCrossEntropyLossTemplate.referenceDualOutputTemplate, ForkTransformer) + for label_type in IntegerDataTypes ]

🧰 Tools

🪛 Ruff (0.15.9)

[error] 365-365: Variable type is shadowing a Python builtin

(A001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/PULPOpen/Bindings.py` around lines 361 - 365, The list comprehension in PULPSoftmaxCrossEntropyLossDualOutputBindings (and the other ~11 binding definitions) uses the comprehension variable named type which shadows the built-in; rename that variable to label_type (or similar) in the comprehension for PULPSoftmaxCrossEntropyLossDualOutputBindings and do the same consistently for other bindings using IntegerDataTypes so the generator expression for NodeBinding(SoftmaxCrossEntropyLossChecker(...), SoftmaxCrossEntropyLossTemplate.referenceDualOutputTemplate, ForkTransformer) becomes a comprehension using label_type instead of type; update all occurrences referencing the comprehension variable in this file (e.g., any usages inside SoftmaxCrossEntropyLossChecker or other binding list comprehensions) to match the new name.

DeeployTest/testMVPTraining.py

…puts Two fixes for CI regressions on PR pulp-platform#182: 1. SoftmaxCrossEntropyLoss: canonical form is now 2 outputs (loss + log_prob). The previous design carried both a legacy 1-output form and a new 2-output 'dual output' form side by side, which tripped Deeploy's root-layer backtracking: Layer.parse() commits to the first mapper whose parser succeeds, and if that mapper's bindings all later fail typeCheck, the root-layer backtracker raises instead of trying the next mapper. That broke the existing upstream Tests/Kernels/FP32/Softmax/CrossEntropy. Fix: collapse everything to a single 2-output path. SoftmaxCrossEntropy LossParser now requires exactly 2 outputs, the SCE template is the former 'referenceDualOutputTemplate' (loss + log_prob), the PULPOpen binding uses 2-output pointer types, and the tile constraint contains the loss patching logic that previously lived in SoftmaxCrossEntropyLossDualOutputTileConstraint (now deleted). SoftmaxCrossEntropyGradTileConstraint overrides dataLossName to '' so it falls straight through to the base-class single-output wrapper. The upstream Tests/Kernels/FP32/Softmax/CrossEntropy ONNX is regenerated with the canonical 2-output signature and its outputs.npz is updated with the computed scalar loss. 2. TilingExtension.TilerExtension: drop the 4-byte MiniMalloc size alignment that was added in the training branch to work around a Siracusa L3 DMA corner case. That change makes Snitch's tight-fit 5 kB Kernels/Integer/Add/Large test overflow L1 because the rounded-up per-buffer sizes no longer fit. For MLP core the alignment is not needed. The alias-skip logic (skip zero-sized in-place alias outputs from the MiniMalloc CSV and resolve their addrSpace from the alias target after solving) is kept. Verified locally: - Tests/Kernels/FP32/Softmax/CrossEntropy (upstream) PASSED - Tests/Kernels/Integer/Add/Large (snitch tiled, L1=5 kB) PASSED - deeployTrainingRunner_siracusa.py -t simplemlp_train PASSED - deeployTrainingRunner_tiled_siracusa.py -t simplemlp_train PASSED

…gle binding Drop the separate tiledReferenceTemplate / PULPInPlaceAccumulatorV2TiledBindings path. The non-tiled template's data_out write is redundant: InPlaceAccumulatorV2 is terminal in the training graph, no downstream kernel consumes data_out, and the alias registered in alignToContext already makes the graph output pointer resolve to accum_buffer's L2 slot. Emitting the write is not only unnecessary — in the tiled path it also triggers an L2 egress DMA whose destination may overlap with live buffers and corrupt L2, which is why the tiled variant existed in the first place. One template that only writes accum_buffer is correct for both modes. Tiler.py now pulls the (single) PULPInPlaceAccumulatorV2Bindings through PULPInPlaceAccumulatorV2TilingReadyBindings. Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled (deeployTrainingRunner_siracusa.py) and tiled (deeployTrainingRunner_tiled_siracusa.py --l1 64000 --l2 2000000) runs.

The weight/weight_updated alias + shared allocTemplate mechanism works for any memory level weight lives in, not only L2. Update the class docstring and inline comment to say "L2 or L3" instead of hard-coding L2. No code change.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py`:
- Around line 36-37: The allocTemplate for weight_updated currently inlines the
concrete instance by using str(weight._instance), which can become stale; update
the allocTemplate construction to reference the aliased symbol (e.g.,
weight.name) instead of weight._instance so allocation/scheduling changes
propagate — locate weight_updated.allocTemplate and its NodeTemplate
instantiation and replace the frozen _instance reference with a symbolic
reference to weight.name (preserving the same format string and type cast).
- Around line 26-31: The code currently adds aliases and unconditionally sets
weight_updated._alias = weight.name which can break tiling if the two buffers
live in different scopes; before adding aliases or setting _alias for the
buffers returned by ctxt.lookup(operatorRepresentation['weight']) and
ctxt.lookup(operatorRepresentation['weight_updated']), check that their
memory/scope attributes are the same (e.g., compare weight.scope or the buffer’s
scope/memory property) and only perform weight.aliases.add(...),
weight_updated.aliases.add(...), and weight_updated._alias = weight.name when
scopes match; if they differ, either skip aliasing or explicitly co-locate the
buffers (e.g., promote one to the other's scope) instead.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6b2f7cd6-d366-42fb-963e-94f656c8fcea

📥 Commits

Reviewing files that changed from the base of the PR and between e348863 and b844fe1.

📒 Files selected for processing (1)

Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py

coderabbitai · 2026-04-10T13:42:12Z

Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py

+        weight = ctxt.lookup(operatorRepresentation['weight'])
+        weight_updated = ctxt.lookup(operatorRepresentation['weight_updated'])
+
+        weight.aliases.add(weight_updated.name)
+        weight_updated.aliases.add(weight.name)
+        weight_updated._alias = weight.name


⚠️ Potential issue | 🟠 Major

Add a scope-consistency guard before aliasing buffers.

At Line 31, _alias is set unconditionally. If weight and weight_updated are in different scopes (global/local), this can produce inconsistent memory-level expectations during tiling/scheduling. Validate scope parity (or explicitly co-locate) before aliasing.

Suggested fix

weight = ctxt.lookup(operatorRepresentation['weight']) weight_updated = ctxt.lookup(operatorRepresentation['weight_updated']) + + weight_is_global = ctxt.is_global(weight.name) + updated_is_global = ctxt.is_global(weight_updated.name) + if weight_is_global != updated_is_global: + raise RuntimeError( + f"SGD alias requires same scope: weight={weight.name} " + f"({ 'global' if weight_is_global else 'local' }), " + f"weight_updated={weight_updated.name} " + f"({ 'global' if updated_is_global else 'local' })" + )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py` around lines 26 - 31, The code currently adds aliases and unconditionally sets weight_updated._alias = weight.name which can break tiling if the two buffers live in different scopes; before adding aliases or setting _alias for the buffers returned by ctxt.lookup(operatorRepresentation['weight']) and ctxt.lookup(operatorRepresentation['weight_updated']), check that their memory/scope attributes are the same (e.g., compare weight.scope or the buffer’s scope/memory property) and only perform weight.aliases.add(...), weight_updated.aliases.add(...), and weight_updated._alias = weight.name when scopes match; if they differ, either skip aliasing or explicitly co-locate the buffers (e.g., promote one to the other's scope) instead.

coderabbitai · 2026-04-10T13:42:12Z

Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py

+        weight_updated.allocTemplate = NodeTemplate(" ${name} = (${type.typeName}) " + str(weight._instance) + ";")
+        weight_updated.deallocTemplate = NodeTemplate("")


⚠️ Potential issue | 🟠 Major

Avoid freezing weight._instance into the alloc template at align-time.

Line 36 snapshots str(weight._instance) early. Since allocation/scheduling can still adapt buffer instances later, this risks stale alias assignment code for weight_updated. Prefer referencing the aliased symbol (weight.name) rather than the current _instance value.

Suggested fix

- weight_updated.allocTemplate = NodeTemplate(" ${name} = (${type.typeName}) " + str(weight._instance) + ";") + weight_updated.allocTemplate = NodeTemplate( + f" ${{name}} = (${{type.typeName}}) {weight.name};" + ) weight_updated.deallocTemplate = NodeTemplate("")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py` around lines 36 - 37, The allocTemplate for weight_updated currently inlines the concrete instance by using str(weight._instance), which can become stale; update the allocTemplate construction to reference the aliased symbol (e.g., weight.name) instead of weight._instance so allocation/scheduling changes propagate — locate weight_updated.allocTemplate and its NodeTemplate instantiation and replace the frozen _instance reference with a symbolic reference to weight.name (preserving the same format string and type cast).

…ments Drop restating-the-code comments and inline verbose "trick" narration; keep only the class docstring and the one-line note about egressing through data_out rather than accum_buffer. Also compact the load-schedule loops into list comprehensions. No behaviour change. Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.

ReluGradTileConstraint was an unused BOPTileConstraint subclass that got pulled in with the initial training-platform cherry-pick. Nothing in the tree imports or registers it, simplemlp_train does not contain any Relu or ReluGrad nodes, and Relu gradients are explicitly out of scope for this PR. Remove it. Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.

…tTileConstraint The canonicalisation commit 763b464 moved the loss-patching wrapTilingSolution logic into the base SoftmaxCrossEntropyTileConstraint (dataLossName = 'loss' plus loss-output extension) and the commit message declared the dual-output subclass "now deleted", but the git rm never actually happened — the file was still sitting on disk with no importers anywhere in the tree. Finish the deletion. Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.

Collapse the in-place alias collection and resolution in Tiler.minimalloc() into set/dict comprehensions, drop the verbose multi-line comment, and remove the JUNGVI: attribution from a block that wasn't authored by Victor. The target lookup is now O(1) via a name-keyed dict instead of an inner linear scan over memoryMap. No behaviour change. Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.

…nceCode Upstream PR pulp-platform#177 (13113de, "Fix/tiling stack scoping and tiling information corruption", Pu DENG) wraps each layer's emitted code in a C block: layerCode = reduce(lambda a, b: a + b, sections, "") callStack += "{\n" + layerCode + "\n}\n" so that per-layer call args become short-lived stack variables and RunNetwork's overall stack footprint goes down. cc1f68b silently reverted this hunk during a merge from devel — the training-platform branch was based on a pre-pulp-platform#177 snapshot and the conflict resolution went the wrong way. Restore the wrapping verbatim. Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.

…est fixtures Revert the regenerated 2-output network.onnx and outputs.npz under DeeployTest/Tests/Kernels/FP32/Softmax/CrossEntropy/ back to the upstream-devel versions. The training MLP path doesn't depend on these fixtures, and keeping the original files avoids touching unrelated kernel test data in this PR. Verified on Siracusa: simplemlp_train still passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.

…nel test The new SoftmaxCrossEntropyLossParser requires exactly 2 outputs (loss + log_prob), so the upstream Tests/Kernels/FP32/Softmax/CrossEntropy fixture (legacy 1-output ONNX) is no longer parseable. Rather than regenerate it in this PR, just remove the test fixture directory and its entry in test_siracusa_config.py. CrossEntropyGrad is unaffected and stays. Verified on Siracusa: simplemlp_train still passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.

…d code The Siracusa training harness was computing loss_err_count, printing "Errors: N out of M", and then unconditionally returning 0. Numerical mismatches were therefore non-blocking — the simulator exit code stayed green even when the training graph diverged from the reference. Make the return value reflect the comparison result: return loss_err_count == 0 ? 0 : 1; Also drop dead code that was never reached or never used: - training_cycles / optimizer_cycles locals (only commented-out printf) - connect_optimizer_buffers() (body was just (void)0; never called) - Several commented-out section header blocks and helper-call lines No control-flow change apart from the return value. Verified on Siracusa: simplemlp_train still passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.

Step A and Step C in run_optimizer_step had explicit comment headers but the cluster task call between them (which actually runs the optimizer kernel) had none, so the labelling jumped A → C. Add a one-line "Step B: run optimizer kernel on cluster" header for symmetry. Pure documentation, no logic change.

The _augment_path helper prepended GVSOC_INSTALL_DIR/bin, LLVM_INSTALL_DIR/bin and (if a venv was active) the venv's bin to PATH before invoking cmake / gvsoc. Two reasons to remove it from this PR: 1. On Siracusa it is a no-op — CMake's gvsoc_<target> custom command and the LLVM toolchain calls go through ${GVSOC_INSTALL_DIR}/... and ${LLVM_INSTALL_DIR}/... absolute paths and never depend on PATH. 2. The venv kconfigtool.py shebang fixup is GAP9-specific. GAP9 is explicitly out of scope for this PR (the commit message of cc1f68b lists "GAP9 platform port (separate PR, depends on this)" under the excluded scope), so this plumbing belongs in the GAP9 follow-up, not here. Test runners silently mutating PATH is also a documentation/debug hazard — if a user has intentionally pinned a different gvsoc/llvm in their PATH, this helper would silently override it. Removed both call sites in cmake_configure() and run_simulation(). Verified on Siracusa: simplemlp_train still passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.

…tils.py generateTrainingNetwork.py and testMVPTraining.py carried byte-identical copies of _load_reference_losses, _infer_num_data_inputs, _infer_total_mb, _infer_data_size, _infer_n_accum and the _GRAD_ACC constant. testMVPOptimizer.py and testMVPTraining.py each defined their own _mockScheduler, also byte-identical. Move all of these to a new testUtils/trainingUtils.py module and import them from the three entry points so the next helper edit only has to happen in one place. No behaviour change; the three entry-point scripts still exist as before (user asked to keep testMVP*.py and the two deeployTrainingRunner_siracusa.py / _tiled_siracusa.py stubs). Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.

…ranches in generate_network The training path in execution.generate_network had two sibling branches (config.training and config.tiling / config.training) that were 90% identical: - same Step 1 (run training codegen script with --n-steps / --n-accum / --num-data-inputs / -v / --debug / gen_args) - same training_meta.json read-back - same Step 2 optimizer loop with passthrough args and --defaultMemLevel default The only real differences were the two script names (testMVPTraining.py vs generateTrainingNetwork.py and the corresponding optimizer pair), a 4-entry vs 8-entry passthrough list, and the "Tiled training" vs "Training" error-message prefix. Collapse into a single `if config.training:` branch that selects the three variants up front and reuses one body. The two inference branches (`elif config.tiling:` and `else:`) are left untouched. Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.

… trainingUtils.py The four codegen entry points (generateTrainingNetwork.py, testMVPTraining.py, generateOptimizerNetwork.py, testMVPOptimizer.py) each had a nearly identical argparse block in their if __name__ == '__main__' tail, plus the same try/except --shouldFail handshake. Each addition of a training-side arg meant editing up to four scripts in lockstep. Move the shared argparse groups and the handshake runner to testUtils/trainingUtils.py: - add_cores_arg(parser) (shared by all four) - add_training_inference_args(parser) (--num-data-inputs / --n-steps / --n-accum / --learning-rate / --tolerance; shared by both training scripts) - add_memory_level_args(parser) (--l1 / --l2 / --defaultMemLevel; shared by the tiled training and both optimizer scripts) - add_tiling_solver_args(parser) (--memAllocStrategy / --searchStrategy / --plotMemAlloc / --profileTiling; shared by the two tiled variants) - add_optimizer_training_dir_arg(parser) (shared by both optimizer scripts) - add_should_fail_arg(parser) (shared by all four) - run_with_shouldfail(fn, args, label) (try/except --shouldFail runner) Each entry point's __main__ block shrinks to ~8 lines that compose the groups it needs. CLI behaviour is unchanged — every argument keeps the same name, default, dest and help text. Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.

… to trainingUtils The training branch of generate_network and the training-flag block in configure_cmake had four pieces of boilerplate worth extracting: - building the `[python, script, -d gen_dir, -t path, -p platform]` prefix twice (training + optimizer step) - the same `log.debug + subprocess.run + returncode check + raise` tail twice - the opt-side gen_args passthrough filter (4-line for-loop) - the six-line -DTRAINING=ON / -DN_TRAIN_STEPS=... / -DN_ACCUM_STEPS=... / -DTRAINING_NUM_DATA_INPUTS=... block in configure_cmake Move all four into testUtils/trainingUtils.py as build_codegen_cmd / run_codegen_subprocess / filter_passthrough_args / add_training_cmake_flags. They take primitive parameters only — no DeeployTestConfig dependency — so trainingUtils does not gain a back-edge into testUtils.core. The inference branches of generate_network (`elif config.tiling:` / `else:`) are left untouched. Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.

The last training-specific helper left in execution.py was _resolve_optimizer_dir, which derived the optimizer ONNX directory from the training test's name (replacing `_train` with `_optimizer`) with a config.optimizer_dir override. Move it to trainingUtils.py alongside the other test-harness helpers and change its signature to primitive parameters (test_dir: str, optimizer_dir: Optional[str]) so trainingUtils still has no back-edge to DeeployTestConfig. Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.

Minimize the footprint this PR leaves in testUtils/core/execution.py. Previously the diff vs upstream devel was ~170 lines spread across generate_network, configure_cmake, run_simulation, run_complete_test and a variety of unrelated comment removals / inference-branch restructuring. Most of that was incidental and had nothing to do with training. Reset execution.py to the upstream-devel state and add only the three surgical hooks the training path actually needs: 1. One import line: from ..trainingUtils import add_training_cmake_flags, run_training_codegen 2. A 3-line early-return block at the top of generate_network: if config.training: run_training_codegen(config, script_dir) return 3. A one-line call inside configure_cmake to append the training cmake defines: add_training_cmake_flags(cmd, config.training, config.n_train_steps, config.n_accum_steps, config.training_num_data_inputs) Everything else — inference's generate_network body, run_simulation, run_complete_test, all upstream comments — is byte-identical to upstream. The full two-stage training codegen pipeline (training-script subprocess, training_meta.json readback, optimizer-script subprocess with passthrough filter and skip checks) now lives in a new run_training_codegen() in testUtils/trainingUtils.py, which is the sole module where training test-harness logic is concentrated. Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.

…ateTraining.py testUtils/codeGenerate.py had accumulated ~880 lines of training / optimizer codegen helpers piled on top of the upstream inference module, plus a dead "Initialize all output buffers to zero" block that had crept into generateTestNetworkImplementation (it declared local variables and looped over graph outputs without ever emitting anything into retStr). The net effect was that this PR's diff in codeGenerate.py was +883 lines and an interleaving that made the inference vs training boundary invisible to reviewers. Reset testUtils/codeGenerate.py to byte-identical upstream/devel and move every training-side helper to a new testUtils/codeGenerateTraining.py: - generateTrainingTestInputsHeader / generateTrainingTestOutputsHeader - generateTrainingNetworkHeader / generateTrainingNetworkImplementation - generateTrainingTestNetwork - build_shared_buffer_maps / _patch_shared_buffers / _patch_shared_arenas - _ensure_training_l1_capacity - generateOptimizerNetworkHeader / generateOptimizerNetworkImplementation - generateOptimizerTestNetwork codeGenerateTraining.py imports generateL3HexDump from codeGenerate (the only inference helper the training functions reuse), which is a clean forward dependency. No back-edge from codeGenerate to training. The four codegen entry points are updated to pull their training / optimizer helpers from testUtils.codeGenerateTraining instead of testUtils.codeGenerate. No other files change. Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.

…r.py DeeployTest/deeployTrainingRunner.py was a third training-runner entry point that peeked at --tiled / -p ahead of testUtils.deeployTrainingRunner.main() and forwarded tiling_enabled accordingly. It duplicated what the two explicit stubs already do by file name: deeployTrainingRunner_siracusa.py → main(tiling_enabled=False) deeployTrainingRunner_tiled_siracusa.py → main(tiling_enabled=True) Nothing in the repo (py imports, CMake, CI workflows) references the top-level file; only its own docstring and an offline AI_AGENT planning doc mention it. Remove it so the training CLI surface is just the two file-name-scoped stubs, consistent with the preference to keep the non-tiled and tiled variants visually split. Verified on Siracusa: simplemlp_train still passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.

…ngUtils Apply a strict "only training-specific helpers get a dedicated function" rule to testUtils/trainingUtils.py. The following wrappers were removed because their content was not training-specific: - add_cores_arg (--cores is generic codegen) - add_memory_level_args (--l1 / --l2 / --defaultMemLevel — tiling generic) - add_tiling_solver_args (--memAllocStrategy / --searchStrategy / --plotMemAlloc / --profileTiling — tiling generic) - add_should_fail_arg (--shouldFail is used by all codegen scripts) - run_with_shouldfail (the try/except shouldFail handshake is used by all codegen scripts) - build_codegen_cmd (generic [python, script, -d, -t, -p] prefix) - run_codegen_subprocess (generic log + subprocess.run + check) - filter_passthrough_args (generic list comprehension) The first five are now inlined in each of the four training codegen entry points (generateTrainingNetwork.py, testMVPTraining.py, generateOptimizerNetwork.py, testMVPOptimizer.py), matching the style of the upstream inference scripts which define these args inline as well. The last three (subprocess helpers) were only called from inside run_training_codegen() itself, so they become inline code there. What stays in trainingUtils.py are only helpers whose content is genuinely training-specific: the inputs.npz/outputs.npz readers, the Tiler _mockScheduler, the training-argument argparse builders (add_training_inference_args, add_optimizer_training_dir_arg), resolve_optimizer_dir, add_training_cmake_flags and run_training_codegen. Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.

…input defaults Two fixes to the inputTypes / inputOffsets build loop in both generateTrainingNetwork.py and testMVPTraining.py (the two files share this pattern verbatim). 1. The `name` loop variable in `for graph_idx, name in enumerate(...)` was dead — switched to `for graph_idx in range(len(...))` so the intent is explicit and the lint warning goes away. 2. The zero-sized-input branch (`np.prod(arr.shape) == 0`) previously `pass`ed without populating inputTypes[f"input_{idx}"] or inputOffsets[f"input_{idx}"]. ONNX permits optional placeholder inputs with shape like (0,) and the downstream deployer looks up every input by key, so a zero-sized input would later raise a confusing KeyError far from the cause. Populate the entries with a trivial default (float32 pointer, offset 0) — the type does not matter for codegen since the buffer has no elements, but the key must exist. Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.

CI pre-commit hook flagged: - yapf: collapse 3-line set comprehension in Tiler.aliasBlocks and rewrap two-line arg lists in trainingUtils.add_training_cmake_flags and run_training_codegen. - autoflake: drop unused `from typing import List` imports in testMVPTraining.py and testMVPOptimizer.py. Reformat only, no semantic change. MLP non-tiled + tiled regression still pass with diff=0.000000 on all four steps.

CCT_Train/CCT2_FT2 ONNX still uses the legacy 1-output SoftmaxCrossEntropyLoss / SoftmaxCrossEntropyLossGrad signature, which this PR deprecated in favour of the canonical 2-output (loss + log_prob) form in 763b464. The parser now rejects the old form, so CCT2_FT2 fails network generation with 'Did not find adequate mapping for graph ... SoftmaxCrossEntropyLossParser: Exhausted backtracking' at the L3-singlebuffer tiled check. Regenerating CCT_Train fixtures to the canonical 2-output form is out of scope for this PR; drop CCT2_FT2 from the L3 singlebuffer and L3 doublebuffer model lists. The other upstream coverage for CCT_2_32_32_128 (inference-side CCT) is retained.

runwangdl requested review from Victor-Jung and Xeratec as code owners April 10, 2026 10:43

coderabbitai bot reviewed Apr 10, 2026

View reviewed changes

training-platform core: apply pre-commit formatting (yapf/isort/clang…

284f145

…-format)

coderabbitai bot reviewed Apr 10, 2026

View reviewed changes

runwangdl added 3 commits April 10, 2026 12:35

coderabbitai bot reviewed Apr 10, 2026

View reviewed changes

runwangdl added 19 commits April 10, 2026 13:53

training-platform core: drop leftover egress-target comment

ecbffa0

runwangdl added 4 commits April 10, 2026 18:55

		weight_updated.allocTemplate = NodeTemplate(" ${name} = (${type.typeName}) " + str(weight._instance) + ";")
		weight_updated.deallocTemplate = NodeTemplate("")

Conversation

runwangdl commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Training framework

Verification

Untiled codegen + build

Tiled codegen + build

Test plan

Uh oh!

coderabbitai bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

runwangdl commented Apr 10, 2026 •

edited

Loading

coderabbitai bot commented Apr 10, 2026 •

edited

Loading