Training Platform Core: minimal MLP end-to-end training on Siracusa#182
Training Platform Core: minimal MLP end-to-end training on Siracusa#182runwangdl wants to merge 28 commits intopulp-platform:develfrom
Conversation
Bring in the minimum set of changes from the TrainingPlatform branch needed to generate working C code for an end-to-end MLP training graph (forward + backward + gradient accumulation + SGD weight update) on the Siracusa platform, both untiled and tiled. Scope is deliberately narrow - only MLP. No Conv/Pool/Norm gradients, no PULPTrainlib submodule, no new C kernel sources. The four operators involved (Gemm, SoftmaxCrossEntropyLoss + Grad, SGD, InPlaceAccumulatorV2) are all implemented as inline template strings so nothing new lands in TargetLibraries/. Operators: - InPlaceAccumulatorV2: new ORT com.microsoft operator with full Parser/Layer/TypeChecker/Template/TileConstraint/Binding scaffolding. - SoftmaxCrossEntropyLoss: parser now accepts both 1-output (legacy, log_prob only) and 2-output (loss + log_prob) signatures; checker picks the correct binding via checkOutputType based on the 'loss' key. New PULPSoftmaxCrossEntropyLossDualOutputBindings and SoftmaxCrossEntropyLossDualOutputTileConstraint handle the dual output case. - SGD: SGDTemplate now aliases weight_updated onto weight so the tiled egress DMA writes updated weights back to the weight's L2 buffer. - Gemm: tiny FloatGemmTemplate import cleanup. Framework fixes (all operator-agnostic): - DeeployTypes.VariableBuffer.isLive: treat is_input / is_output buffers as live across the whole step range. - TilingExtension.TilerExtension: skip zero-sized in-place alias outputs from the MiniMalloc CSV and resolve their addrSpace from the alias target after the solver runs. - TilingExtension.TilingVariableReplacement: change per-tile reference update from '*ref = array[i];' to 'ref = &array[i];' to avoid permanent mutation of static PI_L1 arrays between training steps. Training test framework: - testMVPTraining.py, testMVPOptimizer.py, generateTrainingNetwork.py, generateOptimizerNetwork.py: new codegen entry points for the two-graph (TrainingNetwork + OptimizerNetwork) architecture. - deeployTrainingRunner.py, deeployTrainingRunner_siracusa.py, deeployTrainingRunner_tiled_siracusa.py and their testUtils helper: training test runners mirroring the inference deeployRunner pattern. - testUtils/codeGenerate.py: training-specific codegen helpers (generateTrainingTestNetwork, generateOptimizerTestNetwork, build_shared_buffer_maps, _patch_shared_buffers / _shared_arenas, _ensure_training_l1_capacity) and L3 hex-dump path. - testUtils/core/execution.py: L3 flash-image detection and load. - testUtils/tilingUtils.py: TrainingSBTiler. - CMakeLists.txt (top + Siracusa): TRAINING-gated targets that build TrainingNetwork.c and OptimizerNetwork.c side by side. - DeeployTest/Platforms/Siracusa/src/deeploytraintest.c: training test harness that drives the two networks across multiple training steps with gradient accumulation. Scope deliberately excluded (future follow-up PRs): - GAP9 platform port (separate PR, depends on this). - All Conv/Pool/Norm/Relu gradient operators. - MSELoss/MSELossGrad, GroupNormalization, dual-output MaxPool. - PULPTrainlib submodule (only needed for ConvGrad kernels). - FP32 forward AvgPool/BatchNorm/MaxPool/Layernorm/Relu kernels. Verified locally: testMVPTraining.py -t simplemlp_train and testMVPOptimizer.py -t simplemlp_optimizer both succeed on Siracusa with --n-accum 1, producing TrainingNetwork.c (2027 lines) and OptimizerNetwork.c containing Gemm / SoftmaxCrossEntropyLoss / SoftmaxCrossEntropyLossGrad / InPlaceAccumulatorV2 / SGD.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds training support: new InPlaceAccumulatorV2 operator (parser, checker, layer, PULP template, tiling constraint, bindings), SoftmaxCrossEntropyLoss dual-output handling, SGD alias/allocation changes, extensive training/optimizer code-generation and runner scripts, tiler/allocation alias fixes, and CMake/runtime test-harness updates for training workflows. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI
participant Generator
participant Deployer
participant Tiler
participant CodeGen
participant Build
participant Simulator
User->>CLI: invoke training runner
CLI->>Generator: request training + optimizer generation
Generator->>Deployer: import ONNX, mark grad buffers & aliases
Deployer->>Tiler: request tiling (TrainingSBTiler)
Tiler->>Deployer: return tiling schedules & replacements
Deployer->>CodeGen: emit TrainingNetwork & OptimizerNetwork sources
CodeGen->>Build: write files and patch shared buffer maps
Build->>Simulator: run built binaries
Simulator-->>User: stream logs/results
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 16
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py (1)
40-82:⚠️ Potential issue | 🟡 MinorPipeline failure: formatting changes required by yapf.
The CI indicates yapf formatting is needed for lines 4-40 (import wrapping and NodeTemplate call wrapping). Please run the formatter to resolve:
yapf -i Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py` around lines 40 - 82, The template file's formatting doesn't match yapf rules: run the formatter (e.g., yapf -i Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py) to fix import wrapping and the long NodeTemplate/_PULPSGDTemplate call wrapping around referenceTemplate and the multi-line string; ensure the _PULPSGDTemplate(...) invocation and the assignment to referenceTemplate are wrapped/indented per yapf so the triple-quoted template string lines align with the project's style.
🧹 Nitpick comments (12)
DeeployTest/testUtils/deeployTrainingRunner.py (3)
18-21: PATH modification at module import time may have unintended side effects.Modifying
os.environ['PATH']at module load time affects the entire Python process globally. If this module is imported by other code (not just as an entry point), this could unexpectedly change path resolution for all subsequent subprocess calls.Consider moving this into the
main()function to limit scope:♻️ Suggested refactor
from typing import Optional -# gapy (gvsoc launcher) uses `#!/usr/bin/env python3`. Put /usr/bin first so -# it resolves to /usr/bin/python3 which has all required packages (gapylib, -# prettytable, …) rather than the minimal venv python. -os.environ['PATH'] = '/usr/bin:' + os.environ.get('PATH', '') - from .core import DeeployTestConfig, run_complete_testThen at the start of
main():def main(tiling_enabled: bool = False, default_platform: str = 'Siracusa', default_simulator: str = 'gvsoc'): # gapy (gvsoc launcher) uses `#!/usr/bin/env python3`. Put /usr/bin first so # it resolves to /usr/bin/python3 which has all required packages. os.environ['PATH'] = '/usr/bin:' + os.environ.get('PATH', '') ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@DeeployTest/testUtils/deeployTrainingRunner.py` around lines 18 - 21, The module currently mutates os.environ['PATH'] at import time which has global side effects; move that assignment into the entrypoint function main() so PATH is only changed when the script is executed (not on import). Locate the top-level os.environ['PATH'] modification and remove it from module scope, then add the same assignment at the start of the main(tiling_enabled: bool = False, default_platform: str = 'Siracusa', default_simulator: str = 'gvsoc') function so the change is scoped to execution; keep the exact string '/usr/bin:' + os.environ.get('PATH', '') and the explanatory comment near the new location.
97-100: Consider iterable unpacking for cleaner list construction.Per static analysis (Ruff RUF005), use iterable unpacking instead of concatenation:
♻️ Suggested fix
if args.input_type_map: - gen_args.extend(['--input-type-map'] + list(args.input_type_map)) + gen_args.extend(['--input-type-map', *args.input_type_map]) if args.input_offset_map: - gen_args.extend(['--input-offset-map'] + list(args.input_offset_map)) + gen_args.extend(['--input-offset-map', *args.input_offset_map])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@DeeployTest/testUtils/deeployTrainingRunner.py` around lines 97 - 100, Replace the list concatenation when extending gen_args with iterable unpacking: instead of gen_args.extend(['--input-type-map'] + list(args.input_type_map)) and the similar input_offset_map line, use gen_args.extend(['--input-type-map', *args.input_type_map]) and gen_args.extend(['--input-offset-map', *args.input_offset_map]) (apply to the gen_args variable and the args.input_type_map / args.input_offset_map uses) so the code uses iterable unpacking rather than building a new list.
103-116: Asymmetric default handling forl1vsl2arguments.Line 107 checks
args.l2 != 1024000before appending, but line 105-106 doesn't have similar logic forl1. According to the context snippet fromDeeployRunnerArgumentParser, the default forl1is64000. This inconsistency means--l1=64000would be passed explicitly while--l2=1024000would be omitted.If the intent is to omit default values from gen_args, apply consistent handling:
♻️ Suggested fix
if getattr(args, 'defaultMemLevel', None): gen_args.append(f'--defaultMemLevel={args.defaultMemLevel}') - if getattr(args, 'l1', None): + if getattr(args, 'l1', None) and args.l1 != 64000: gen_args.append(f'--l1={args.l1}') if getattr(args, 'l2', None) and args.l2 != 1024000: gen_args.append(f'--l2={args.l2}')🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@DeeployTest/testUtils/deeployTrainingRunner.py` around lines 103 - 116, The gen_args building is inconsistent: args.l2 is only appended when not equal to its default (1024000) but args.l1 is always appended, causing the default 64000 to be emitted; update the branch for args.l1 in DeeployTest/testUtils/deeployTrainingRunner.py so it mirrors args.l2 by only appending f'--l1={args.l1}' when getattr(args, 'l1', None) is truthy and args.l1 != 64000 (the default defined in DeeployRunnerArgumentParser), keeping the existing behavior for args.l2 and leaving other flags (defaultMemLevel, memAllocStrategy, searchStrategy, profileTiling, plotMemAlloc) unchanged.Deeploy/Targets/PULPOpen/Templates/SoftmaxCrossEntropyLossTemplate.py (1)
31-54: Debug printf may cause unwanted output in production.Line 52 includes a
printfstatement that will output loss values during execution. While useful for debugging, this could flood logs in production or affect performance:printf(" [SCE] loss=%.6f\\r\\n", (double)${loss}[0]);Consider either:
- Making it conditional on a debug flag (e.g.,
#ifdef DEBUG_LOSS)- Removing it for production builds
- Using a logging mechanism that can be disabled
♻️ Suggested conditional approach
${loss}[0] = sce_total_loss / (float32_t)${batch}; +#ifdef DEBUG_SCE_LOSS printf(" [SCE] loss=%.6f\\r\\n", (double)${loss}[0]); +#endif END_SINGLE_COREThe numerical implementation using max-subtraction for stability is correct.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/PULPOpen/Templates/SoftmaxCrossEntropyLossTemplate.py` around lines 31 - 54, The printf in the SoftmaxCrossEntropyLoss dual-output template (in referenceDualOutputTemplate) prints loss each run and should be guarded or removed; wrap the printf(" [SCE] loss=...") call with a compile-time or runtime guard (e.g., `#ifdef` DEBUG_LOSS / `#endif` or check a debug flag) or replace it with a pluggable logger that can be disabled so production builds do not emit or pay the cost of this output, ensuring you only change the printf line (and related surrounding template text) inside referenceDualOutputTemplate.DeeployTest/deeployTrainingRunner_tiled_siracusa.py (1)
1-11: Consider usingpython3in shebang for explicit Python 3.The shebang
#!/usr/bin/env pythonmay resolve to Python 2 on some legacy systems. For consistency with the comment indeeployTrainingRunner.py(which mentions/usr/bin/python3), consider:-#!/usr/bin/env python +#!/usr/bin/env python3Otherwise, the wrapper script is clean and correct.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@DeeployTest/deeployTrainingRunner_tiled_siracusa.py` around lines 1 - 11, Update the script shebang to explicitly use Python 3 by changing the first line from "#!/usr/bin/env python" to "#!/usr/bin/env python3" so that running deeployTrainingRunner_tiled_siracusa.py invokes Python 3 consistently; no other code changes needed—this affects the interpreter used when executing the module that calls main(tiling_enabled = True) from testUtils.deeployTrainingRunner.Deeploy/DeeployTypes.py (1)
339-349: LGTM: Liveness propagation for input/output buffers prevents premature deallocation.The change correctly ensures that input/output
VariableBuffers (and their aliases) are treated as live throughout the network. This is essential for training where weight buffers need to persist across forward/backward passes.One minor style issue: Line 343 uses
nextas a variable name which shadows the Python builtin (Ruff A001). Consider renaming:♻️ Optional rename
while len(queue) > 0: - next = queue.pop() - buffNext = ctxt.lookup(next) + next_name = queue.pop() + buffNext = ctxt.lookup(next_name) assert isinstance(buffNext, VariableBuffer) live |= buffNext._live or buffNext.is_input or buffNext.is_output - visited.add(next) + visited.add(next_name) queue |= buffNext.aliases - visited,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/DeeployTypes.py` around lines 339 - 349, The code uses the builtin name 'next' as a local variable which shadows Python's builtin; in the liveness traversal loop (where variables live, queue, visited, buffNext and ctxt.lookup/VariableBuffer are used) rename the local variable 'next' to a non-conflicting identifier such as 'next_name' (and update its uses in queue.pop(), ctxt.lookup(next_name), visited.add(next_name), and queue |= buffNext.aliases - visited) to avoid the Ruff A001 builtin-shadowing violation.DeeployTest/testUtils/tilingUtils.py (1)
63-66: Rename unused loop variable to satisfy linter.Static analysis correctly flags that
lifetimeis not used within the loop body. OnlytensorNameis needed to look up the buffer.♻️ Proposed fix
- for tensorName, lifetime in tensorLifetimeMap.items(): + for tensorName, _lifetime in tensorLifetimeMap.items(): buffer = ctxt.lookup(tensorName) if buffer.is_input: tensorLifetimeMap[tensorName] = (0, maxStepIdx)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@DeeployTest/testUtils/tilingUtils.py` around lines 63 - 66, The loop over tensorLifetimeMap uses an unused variable named lifetime which lints as unused; change the loop signature in the for loop that currently reads "for tensorName, lifetime in tensorLifetimeMap.items()" to use a throwaway name (e.g., "_" or "_lifetime") so only tensorName is considered, leaving the body (buffer = ctxt.lookup(tensorName); if buffer.is_input: tensorLifetimeMap[tensorName] = (0, maxStepIdx)) unchanged; this targets the for loop that iterates tensorLifetimeMap.items() and references ctxt.lookup and buffer.is_input.DeeployTest/generateOptimizerNetwork.py (1)
28-28: Remove unusednumpyimport.
numpyis imported asnpbut not used directly in this file. Ifbuild_shared_buffer_mapsor other imported functions need numpy, they handle their own imports.♻️ Proposed fix
-import numpy as np🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@DeeployTest/generateOptimizerNetwork.py` at line 28, Remove the unused `numpy` import in generateOptimizerNetwork.py: delete the line importing `numpy as np` since no code in this module references `np` (functions like build_shared_buffer_maps and other imports should manage their own numpy usage); ensure no other references to `np` remain in the file after removal.DeeployTest/testUtils/codeGenerate.py (3)
656-658: Shell injection warning is a false positive for internal tooling.The
os.systemcall forclang-formatuses paths derived from thedumpdirparameter which is controlled by the test framework, not external user input. However, usingsubprocess.runwould be more robust and avoid shell parsing issues with special characters in paths.Proposed fix using subprocess.run
- clang_format = "{BasedOnStyle: llvm, IndentWidth: 2, ColumnLimit: 160}" - for fname in ['TrainingNetwork.c', 'TrainingNetwork.h', 'testinputs.h', 'testoutputs.h']: - os.system(f'clang-format -i --style="{clang_format}" {dumpdir}/{fname}') + clang_format = "{BasedOnStyle: llvm, IndentWidth: 2, ColumnLimit: 160}" + for fname in ['TrainingNetwork.c', 'TrainingNetwork.h', 'testinputs.h', 'testoutputs.h']: + subprocess.run(['clang-format', '-i', f'--style={clang_format}', f'{dumpdir}/{fname}'], check=False)Note:
subprocessimport already exists at top of file; add it if missing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@DeeployTest/testUtils/codeGenerate.py` around lines 656 - 658, The loop that calls os.system to run clang-format with a shell is brittle and flagged for shell injection; replace the os.system usage in the block where clang_format is defined and the for fname in [...] loop uses dumpdir by invoking subprocess.run with an argument list (e.g., ['clang-format','-i', f'--style={clang_format}', f'{dumpdir}/{fname}']), pass check=True to raise on failure, and ensure subprocess is imported; this removes shell parsing, keeps the same flags/behavior, and uses shell=False for safety.
1160-1162: Sameos.systemsuggestion as for training code generation.Consider using
subprocess.runfor consistency and robustness.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@DeeployTest/testUtils/codeGenerate.py` around lines 1160 - 1162, Replace the os.system calls used to run clang-format with subprocess.run for robustness: in the block where clang_format is defined and the loop over ['OptimizerNetwork.c', 'OptimizerNetwork.h'] executes, call subprocess.run with a list of arguments (e.g. ['clang-format','-i','--style='+clang_format,f'{dumpdir}/{fname}']), set check=True to raise on non-zero exit, and avoid shell=True; optionally capture stdout/stderr or let it inherit for visibility. Ensure you import subprocess at the top and handle CalledProcessError (or let it propagate) so failures are not silently ignored.
308-311: Consider using explicitOptionaltype hints.PEP 484 prohibits implicit
Optional(e.g.,init_weights: List[np.ndarray] = None). Use explicitOptional[List[np.ndarray]]for clarity.Proposed fix
def generateTrainingTestInputsHeader(deployer: NetworkDeployer, all_mb_data: List[List[np.ndarray]], n_steps: int, n_accum: int, grad_buf_start_idx: int = 0, num_grad_inputs: int = 0, - learning_rate: float = 0.001, init_weights: List[np.ndarray] = None, - data_size: int = None) -> str: + learning_rate: float = 0.001, init_weights: Optional[List[np.ndarray]] = None, + data_size: Optional[int] = None) -> str:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@DeeployTest/testUtils/codeGenerate.py` around lines 308 - 311, The function signature for generateTrainingTestInputsHeader uses implicit Optional defaults (init_weights: List[np.ndarray] = None and data_size: int = None); import typing.Optional and change those annotations to explicit Optionals (e.g., init_weights: Optional[List[np.ndarray]] and data_size: Optional[int]) so the types follow PEP 484; update the function signature accordingly and add the import if missing.DeeployTest/testMVPTraining.py (1)
34-96: Consider extracting shared inference helpers to a common module.These helper functions (
_infer_num_data_inputs,_infer_total_mb,_infer_data_size,_infer_n_accum) are duplicated fromgenerateTrainingNetwork.py. While the comment at line 34-36 acknowledges this, extracting them to a shared utility module (e.g.,testUtils/trainingUtils.py) would reduce maintenance burden and ensure consistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@DeeployTest/testMVPTraining.py` around lines 34 - 96, The four helper functions (_infer_num_data_inputs, _infer_total_mb, _infer_data_size, _infer_n_accum) are duplicated; extract them into a shared utility module (e.g., testUtils/trainingUtils.py) and replace the local copies with imports. Move the implementations into the new module, export the four functions (and _load_reference_losses if desired), update imports in DeeployTest/testMVPTraining.py (use from testUtils.trainingUtils import _infer_num_data_inputs, _infer_total_mb, _infer_data_size, _infer_n_accum), remove the duplicate definitions from testMVPTraining.py, run unit tests and update any relative import paths or packaging so both generateTrainingNetwork.py and testMVPTraining.py import the single shared implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@Deeploy/Targets/Generic/Parsers.py`:
- Around line 2728-2748: The parser only checks arity in parseNode/parseNodeCtxt
but must validate tensor extents: in parseNodeCtxt (using ctxt.lookup results
for buffer, gradient, lazy_reset_grad, data_out) verify that buffer.shape,
gradient.shape and data_out.shape are identical (same number of elements and
per-dimension sizes) and that lazy_reset_grad is a scalar/has size 1; if any
check fails, return ctxt, False (or otherwise reject the node) instead of
proceeding; compute operatorRepresentation['size'] from the validated common
shape (e.g., prod(buffer.shape) after checks) and do not accept the node when
shapes mismatch to prevent out-of-bounds reads/writes.
In `@Deeploy/Targets/Generic/TypeCheckers.py`:
- Around line 577-582: The override checkOutputType in TypeCheckers currently
only compares output count and skips the base SignPropTypeChecker validation;
update checkOutputType (method on this class) to first compute
actual_num_outputs from operatorRepresentation.get('loss', '') as it does now,
then if counts match delegate to the base implementation (call
super().checkOutputType(inputs, operatorRepresentation)) and return the logical
AND of the count check and the base check result so both count gating and the
inherited property/type validation are enforced.
In `@Deeploy/Targets/PULPOpen/Platform.py`:
- Around line 17-27: The file has formatting errors (import wrapping and mapping
formatting) causing CI failures; run the project formatter (yapf) on
Deeploy/Targets/PULPOpen/Platform.py and reformat the long import lists and any
mapping/dictionary blocks so they follow the configured style. Specifically
rewrap the multi-line import statements that include symbols like GEMMLayer,
InPlaceAccumulatorV2Layer, LayerNormLayer and the parser imports such as
AddParser, ConcatParser, GELUParser, GEMMParser, etc., and also reformat the
other problematic ranges noted (around the imports at lines ~42-53, ~109-114,
and ~156-160) to eliminate trailing commas/incorrect wrapping per yapf rules
before committing.
In `@Deeploy/Targets/PULPOpen/Templates/FloatInPlaceAccumulatorV2Template.py`:
- Around line 5-7: The file imports an unused symbol Dict from typing which
triggers the CI lint hook; remove Dict from the import list in the typing import
line so the line reads only the used types (e.g., List, Tuple) and ensure any
references to Dict in functions or classes like NetworkContext, NodeTemplate,
OperatorRepresentation, or VariableBuffer are not affected; update the import
statement in FloatInPlaceAccumulatorV2Template.py accordingly.
In `@Deeploy/Targets/PULPOpen/TileConstraints/SGDTileConstraint.py`:
- Around line 14-18: ReluGradTileConstraint is an unused/inconsistent tile
constraint: remove the entire ReluGradTileConstraint class (including its
dataIn1Name/dataIn2Name/dataOutName assignments) since it inverts the
input/output mapping compared to GeluGradTileConstraint and the GELUGradParser
and there is no ReluGradParser to accompany it; alternatively, if you intend to
support ReLU backward, implement a matching ReluGradParser and ensure the class
uses the same contract as GELUGradParser (expect grad_in as the first input and
grad_out as the output) and update the dataIn/dataOut names accordingly so
mappings align with GeluGradTileConstraint and the parser contract.
In
`@Deeploy/Targets/PULPOpen/TileConstraints/SoftmaxCrossEntropyLossDualOutputTileConstraint.py`:
- Around line 18-74: The wrapTilingSolution currently appends the scalar loss to
every tile which causes last-tile overwrite if the batch is split; modify
SoftmaxCrossEntropyLossDualOutputTileConstraint.wrapTilingSolution to detect
when tilingSchedules represents a split batch (e.g. len(tilingSchedules) > 1 or
multiple steps in each schedule) and prevent exporting a per-tile scalar: either
(a) raise an explicit error/AssertionError explaining that batch tiling is
forbidden for the dual-output variant, or (b) implement aggregation of partial
losses into a single scalar before adding cls.dataLossName to schedule; perform
this check after calling super().wrapTilingSolution (use variables
tilingSchedules, schedule.outputLoadSchedule, logProbVar, lossVar) and
return/raise accordingly.
In `@Deeploy/TilingExtension/TilerExtension.py`:
- Around line 477-490: The fallback assignment memoryBlock._addrSpace = (0, 0)
violates the addrSpace invariant (addrSpace[0] < addrSpace[1]) and must be
removed; in the alias-resolution loop (references: memoryBlock, memoryMap,
aliasBlocks, ctxt.dealiasBuffer, _addrSpace) change the "alias target not in
this memoryMap" branch to NOT assign a zero-width tuple — either raise a clear
exception indicating unresolved aliasTarget or simply skip setting
memoryBlock._addrSpace (leave it as None) so existing null checks handle it;
optionally emit a warning log mentioning the unresolved aliasTarget for
debugging.
In `@DeeployTest/generateTrainingNetwork.py`:
- Around line 178-202: The loop over graph_input_names is using an unused
variable name and silently skips empty arrays causing missing entries in
inputTypes/inputOffsets; update the for-loop to use an underscore for the unused
variable (e.g., for graph_idx, _ in enumerate(graph_input_names)) or otherwise
reference name if intended, and when np.prod(arr.shape) == 0 do not silently
pass—assign a sensible default type/offset (e.g., PointerClass(float32_t) and 0
or a sentinel) or explicitly log/raise so inputTypes[f"input_{graph_idx}"] and
inputOffsets[f"input_{graph_idx}"] are always set; ensure this change is applied
where npz_base is indexed and when calling inferTypeAndOffset so empty arrays
don’t leave entries unset.
In `@DeeployTest/Platforms/Siracusa/CMakeLists.txt`:
- Line 35: The line using target_compile_options is incorrect because
${NETWORK_LIB} is a target name, not a compile option; remove the line
target_compile_options(${ProjectId} INTERFACE ${NETWORK_LIB}) entirely. If the
intent was to inherit compile flags from the network library, ensure
target_link_libraries(${ProjectId} PUBLIC|PRIVATE ${NETWORK_LIB}) is used (it
already should propagate options) or explicitly add real option strings via
target_compile_options(${ProjectId} PUBLIC <options>) or extract flags from
${NETWORK_LIB} with get_target_property before applying them.
In `@DeeployTest/Platforms/Siracusa/src/deeploytraintest.c`:
- Around line 397-413: The test currently always returns 0 even when loss
verification reports mismatches; after the cluster loss comparison (where
loss_err_count is populated via LossCompareArgs and CompareLossesOnCluster)
change the function exit to return a non-zero failure code when loss_err_count >
0 (e.g., return loss_err_count or return 1) and return 0 otherwise so the
harness fails the process on verification errors; update the return at the end
of the function that currently returns 0 accordingly.
- Around line 121-137: The L3→L3 branch of l3_aware_copy currently allocates the
entire buffer in L2 (pi_l2_malloc(bytes)) which will fail for large tensors;
change it to perform bounded chunked transfers: allocate a fixed safe chunk size
(e.g., a MAX_L2_CHUNK constant), loop over the source with offsets, for each
iteration allocate an L2 tmp buffer of min(remaining, CHUNK), call ram_read(tmp,
src+offset, chunk_size) and ram_write(dst+offset, tmp, chunk_size), free the tmp
each iteration, and ensure you check pi_l2_malloc's return and handle allocation
failure by cleaning up and returning/propagating an error; retain existing
direct memcpy/ram_read/ram_write logic for other branches and reuse ram_read,
ram_write, pi_l2_malloc, pi_l2_free symbols.
In `@DeeployTest/testMVPTraining.py`:
- Around line 159-179: The loop over graph_input_names uses an unused variable
name (rename to _name) and silently skips empty arrays causing downstream
KeyError; update the for loop header to for graph_idx, _name in
enumerate(graph_input_names) and replace the np.prod(arr.shape) == 0 branch with
an explicit assignment so empty arrays still get entries in
inputTypes/inputOffsets (e.g., set inputTypes[f"input_{graph_idx}"] =
PointerClass(float32_t) and inputOffsets[f"input_{graph_idx}"] = 0 or another
safe default), and optionally add a debug log mentioning graph_idx when you
assign the default; this touches graph_input_names, grad_acc_set,
npz_base/npz_idx, inputTypes, inputOffsets, and inferTypeAndOffset.
In `@DeeployTest/testUtils/codeGenerate.py`:
- Around line 967-976: The code opens TrainingNetwork.h with
open(train_h_path).read() without a context manager; change it to read the file
using a with statement (e.g., with open(train_h_path, 'r') as f: train_h =
f.read()) before running the re.sub to produce train_h_new and then write with
the existing with open(..., 'w') block; ensure you reference the same variables
train_h_path, train_h, train_h_new and keep the regex replacement for
DeeployNetwork_MEMORYARENA_L1_len unchanged.
- Around line 199-210: The loop over output buffers (variables output_idx,
output_buffer, output_size, typeName using deployer.ctxt.lookup/is_buffer) is
currently dead—compute values but never emit initialization—so either remove the
loop entirely or implement the intended zero-initialization: for each
output_buffer discovered, compute the total element/byte count via output_size
and typeName and append the appropriate zeroing code to retStr (e.g., a memset
or generated loop using the buffer name) so the produced code actually
initializes output_X to zero; update retStr exactly where the loop currently
increments output_idx and ensure you reference the buffer name from
output_buffer when emitting the init.
In `@DeeployTest/testUtils/core/execution.py`:
- Around line 194-197: The second warning uses an unnecessary f-string: in the
block checking opt_script.exists() change the log.warning call that currently
uses f"generateOptimizerNetwork.py not found — skipping optimizer codegen" to a
plain string literal without the leading f so it reads
"generateOptimizerNetwork.py not found — skipping optimizer codegen"; update the
call near the Path(opt_dir).exists() check around the opt_dir/opt_script logic
in execution.py (the log.warning that references generateOptimizerNetwork.py) to
remove the extraneous f prefix.
- Around line 118-121: The second warning uses an f-string with no placeholders;
in the opt_script.exists() branch update the log.warning call that currently
uses f"testMVPOptimizer.py not found — skipping optimizer codegen" to a plain
string literal (remove the leading 'f') so the message is not unnecessarily
treated as an f-string when checking opt_script.exists() and logging the
warning.
---
Outside diff comments:
In `@Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py`:
- Around line 40-82: The template file's formatting doesn't match yapf rules:
run the formatter (e.g., yapf -i
Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py) to fix import wrapping and
the long NodeTemplate/_PULPSGDTemplate call wrapping around referenceTemplate
and the multi-line string; ensure the _PULPSGDTemplate(...) invocation and the
assignment to referenceTemplate are wrapped/indented per yapf so the
triple-quoted template string lines align with the project's style.
---
Nitpick comments:
In `@Deeploy/DeeployTypes.py`:
- Around line 339-349: The code uses the builtin name 'next' as a local variable
which shadows Python's builtin; in the liveness traversal loop (where variables
live, queue, visited, buffNext and ctxt.lookup/VariableBuffer are used) rename
the local variable 'next' to a non-conflicting identifier such as 'next_name'
(and update its uses in queue.pop(), ctxt.lookup(next_name),
visited.add(next_name), and queue |= buffNext.aliases - visited) to avoid the
Ruff A001 builtin-shadowing violation.
In `@Deeploy/Targets/PULPOpen/Templates/SoftmaxCrossEntropyLossTemplate.py`:
- Around line 31-54: The printf in the SoftmaxCrossEntropyLoss dual-output
template (in referenceDualOutputTemplate) prints loss each run and should be
guarded or removed; wrap the printf(" [SCE] loss=...") call with a
compile-time or runtime guard (e.g., `#ifdef` DEBUG_LOSS / `#endif` or check a debug
flag) or replace it with a pluggable logger that can be disabled so production
builds do not emit or pay the cost of this output, ensuring you only change the
printf line (and related surrounding template text) inside
referenceDualOutputTemplate.
In `@DeeployTest/deeployTrainingRunner_tiled_siracusa.py`:
- Around line 1-11: Update the script shebang to explicitly use Python 3 by
changing the first line from "#!/usr/bin/env python" to "#!/usr/bin/env python3"
so that running deeployTrainingRunner_tiled_siracusa.py invokes Python 3
consistently; no other code changes needed—this affects the interpreter used
when executing the module that calls main(tiling_enabled = True) from
testUtils.deeployTrainingRunner.
In `@DeeployTest/generateOptimizerNetwork.py`:
- Line 28: Remove the unused `numpy` import in generateOptimizerNetwork.py:
delete the line importing `numpy as np` since no code in this module references
`np` (functions like build_shared_buffer_maps and other imports should manage
their own numpy usage); ensure no other references to `np` remain in the file
after removal.
In `@DeeployTest/testMVPTraining.py`:
- Around line 34-96: The four helper functions (_infer_num_data_inputs,
_infer_total_mb, _infer_data_size, _infer_n_accum) are duplicated; extract them
into a shared utility module (e.g., testUtils/trainingUtils.py) and replace the
local copies with imports. Move the implementations into the new module, export
the four functions (and _load_reference_losses if desired), update imports in
DeeployTest/testMVPTraining.py (use from testUtils.trainingUtils import
_infer_num_data_inputs, _infer_total_mb, _infer_data_size, _infer_n_accum),
remove the duplicate definitions from testMVPTraining.py, run unit tests and
update any relative import paths or packaging so both generateTrainingNetwork.py
and testMVPTraining.py import the single shared implementation.
In `@DeeployTest/testUtils/codeGenerate.py`:
- Around line 656-658: The loop that calls os.system to run clang-format with a
shell is brittle and flagged for shell injection; replace the os.system usage in
the block where clang_format is defined and the for fname in [...] loop uses
dumpdir by invoking subprocess.run with an argument list (e.g.,
['clang-format','-i', f'--style={clang_format}', f'{dumpdir}/{fname}']), pass
check=True to raise on failure, and ensure subprocess is imported; this removes
shell parsing, keeps the same flags/behavior, and uses shell=False for safety.
- Around line 1160-1162: Replace the os.system calls used to run clang-format
with subprocess.run for robustness: in the block where clang_format is defined
and the loop over ['OptimizerNetwork.c', 'OptimizerNetwork.h'] executes, call
subprocess.run with a list of arguments (e.g.
['clang-format','-i','--style='+clang_format,f'{dumpdir}/{fname}']), set
check=True to raise on non-zero exit, and avoid shell=True; optionally capture
stdout/stderr or let it inherit for visibility. Ensure you import subprocess at
the top and handle CalledProcessError (or let it propagate) so failures are not
silently ignored.
- Around line 308-311: The function signature for
generateTrainingTestInputsHeader uses implicit Optional defaults (init_weights:
List[np.ndarray] = None and data_size: int = None); import typing.Optional and
change those annotations to explicit Optionals (e.g., init_weights:
Optional[List[np.ndarray]] and data_size: Optional[int]) so the types follow PEP
484; update the function signature accordingly and add the import if missing.
In `@DeeployTest/testUtils/deeployTrainingRunner.py`:
- Around line 18-21: The module currently mutates os.environ['PATH'] at import
time which has global side effects; move that assignment into the entrypoint
function main() so PATH is only changed when the script is executed (not on
import). Locate the top-level os.environ['PATH'] modification and remove it from
module scope, then add the same assignment at the start of the
main(tiling_enabled: bool = False, default_platform: str = 'Siracusa',
default_simulator: str = 'gvsoc') function so the change is scoped to execution;
keep the exact string '/usr/bin:' + os.environ.get('PATH', '') and the
explanatory comment near the new location.
- Around line 97-100: Replace the list concatenation when extending gen_args
with iterable unpacking: instead of gen_args.extend(['--input-type-map'] +
list(args.input_type_map)) and the similar input_offset_map line, use
gen_args.extend(['--input-type-map', *args.input_type_map]) and
gen_args.extend(['--input-offset-map', *args.input_offset_map]) (apply to the
gen_args variable and the args.input_type_map / args.input_offset_map uses) so
the code uses iterable unpacking rather than building a new list.
- Around line 103-116: The gen_args building is inconsistent: args.l2 is only
appended when not equal to its default (1024000) but args.l1 is always appended,
causing the default 64000 to be emitted; update the branch for args.l1 in
DeeployTest/testUtils/deeployTrainingRunner.py so it mirrors args.l2 by only
appending f'--l1={args.l1}' when getattr(args, 'l1', None) is truthy and args.l1
!= 64000 (the default defined in DeeployRunnerArgumentParser), keeping the
existing behavior for args.l2 and leaving other flags (defaultMemLevel,
memAllocStrategy, searchStrategy, profileTiling, plotMemAlloc) unchanged.
In `@DeeployTest/testUtils/tilingUtils.py`:
- Around line 63-66: The loop over tensorLifetimeMap uses an unused variable
named lifetime which lints as unused; change the loop signature in the for loop
that currently reads "for tensorName, lifetime in tensorLifetimeMap.items()" to
use a throwaway name (e.g., "_" or "_lifetime") so only tensorName is
considered, leaving the body (buffer = ctxt.lookup(tensorName); if
buffer.is_input: tensorLifetimeMap[tensorName] = (0, maxStepIdx)) unchanged;
this targets the for loop that iterates tensorLifetimeMap.items() and references
ctxt.lookup and buffer.is_input.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3d4be450-9169-4e18-beb8-8b248adb2fb6
📒 Files selected for processing (30)
Deeploy/DeeployTypes.pyDeeploy/Targets/Generic/Layers.pyDeeploy/Targets/Generic/Parsers.pyDeeploy/Targets/Generic/TypeCheckers.pyDeeploy/Targets/PULPOpen/Bindings.pyDeeploy/Targets/PULPOpen/Platform.pyDeeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.pyDeeploy/Targets/PULPOpen/Templates/FloatInPlaceAccumulatorV2Template.pyDeeploy/Targets/PULPOpen/Templates/SGDTemplate.pyDeeploy/Targets/PULPOpen/Templates/SoftmaxCrossEntropyLossTemplate.pyDeeploy/Targets/PULPOpen/TileConstraints/InPlaceAccumulatorV2TileConstraint.pyDeeploy/Targets/PULPOpen/TileConstraints/SGDTileConstraint.pyDeeploy/Targets/PULPOpen/TileConstraints/SoftmaxCrossEntropyLossDualOutputTileConstraint.pyDeeploy/Targets/PULPOpen/Tiler.pyDeeploy/TilingExtension/TilerExtension.pyDeeployTest/CMakeLists.txtDeeployTest/Platforms/Siracusa/CMakeLists.txtDeeployTest/Platforms/Siracusa/src/deeploytraintest.cDeeployTest/deeployTrainingRunner.pyDeeployTest/deeployTrainingRunner_siracusa.pyDeeployTest/deeployTrainingRunner_tiled_siracusa.pyDeeployTest/generateOptimizerNetwork.pyDeeployTest/generateTrainingNetwork.pyDeeployTest/testMVPOptimizer.pyDeeployTest/testMVPTraining.pyDeeployTest/testUtils/codeGenerate.pyDeeployTest/testUtils/core/config.pyDeeployTest/testUtils/core/execution.pyDeeployTest/testUtils/deeployTrainingRunner.pyDeeployTest/testUtils/tilingUtils.py
| def parseNode(self, node: gs.Node) -> bool: | ||
| # Require exactly 3 inputs (buffer, gradient, lazy_reset_grad) and 1 output | ||
| return len(node.inputs) == 3 and len(node.outputs) == 1 | ||
|
|
||
| def parseNodeCtxt(self, | ||
| ctxt: NetworkContext, | ||
| node: gs.Node, | ||
| channels_first: bool = True) -> Tuple[NetworkContext, bool]: | ||
|
|
||
| buffer = ctxt.lookup(node.inputs[0].name) | ||
| gradient = ctxt.lookup(node.inputs[1].name) | ||
| lazy_reset_grad = ctxt.lookup(node.inputs[2].name) | ||
| data_out = ctxt.lookup(node.outputs[0].name) | ||
|
|
||
| self.operatorRepresentation['accum_buffer'] = buffer.name | ||
| self.operatorRepresentation['gradient'] = gradient.name | ||
| self.operatorRepresentation['lazy_reset_grad'] = lazy_reset_grad.name | ||
| self.operatorRepresentation['data_out'] = data_out.name | ||
| self.operatorRepresentation['size'] = int(np.prod(buffer.shape)) | ||
|
|
||
| return ctxt, True |
There was a problem hiding this comment.
Reject shape-mismatched InPlaceAccumulatorV2 nodes.
This parser only checks arity, but the generated kernel/tile path assumes buffer, gradient, and data_out have the same extent and that lazy_reset_grad is a single flag. If a malformed graph slips through, size is derived from buffer.shape and later code can read/write past the smaller tensor.
Proposed validation
def parseNodeCtxt(self,
ctxt: NetworkContext,
node: gs.Node,
channels_first: bool = True) -> Tuple[NetworkContext, bool]:
buffer = ctxt.lookup(node.inputs[0].name)
gradient = ctxt.lookup(node.inputs[1].name)
lazy_reset_grad = ctxt.lookup(node.inputs[2].name)
data_out = ctxt.lookup(node.outputs[0].name)
+
+ if tuple(buffer.shape) != tuple(gradient.shape) or tuple(buffer.shape) != tuple(data_out.shape):
+ return ctxt, False
+ if int(np.prod(lazy_reset_grad.shape)) != 1:
+ return ctxt, False
self.operatorRepresentation['accum_buffer'] = buffer.name
self.operatorRepresentation['gradient'] = gradient.name
self.operatorRepresentation['lazy_reset_grad'] = lazy_reset_grad.name
self.operatorRepresentation['data_out'] = data_out.name🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Deeploy/Targets/Generic/Parsers.py` around lines 2728 - 2748, The parser only
checks arity in parseNode/parseNodeCtxt but must validate tensor extents: in
parseNodeCtxt (using ctxt.lookup results for buffer, gradient, lazy_reset_grad,
data_out) verify that buffer.shape, gradient.shape and data_out.shape are
identical (same number of elements and per-dimension sizes) and that
lazy_reset_grad is a scalar/has size 1; if any check fails, return ctxt, False
(or otherwise reject the node) instead of proceeding; compute
operatorRepresentation['size'] from the validated common shape (e.g.,
prod(buffer.shape) after checks) and do not accept the node when shapes mismatch
to prevent out-of-bounds reads/writes.
Deeploy/Targets/PULPOpen/Templates/FloatInPlaceAccumulatorV2Template.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
Deeploy/Targets/Generic/TypeCheckers.py (1)
577-581:⚠️ Potential issue | 🟠 MajorKeep the inherited output-type validation in this override.
This arity check is necessary, but returning only the count match still bypasses
SignPropTypeChecker.checkOutputType(). That means dual-output SCE can bind as long as the output count matches, even if the declared output pointer types do not.✅ Minimal fix
def checkOutputType(self, inputs: List[VariableBuffer], operatorRepresentation: OperatorRepresentation) -> bool: # The parser sets 'loss' to a non-empty string for 2-output nodes, '' for 1-output. # Use this to determine the actual output count and match it against this binding. actual_num_outputs = 2 if operatorRepresentation.get('loss', '') != '' else 1 - return actual_num_outputs == len(self.output_types) + return actual_num_outputs == len(self.output_types) and \ + super().checkOutputType(inputs, operatorRepresentation)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/Generic/TypeCheckers.py` around lines 577 - 581, The override of checkOutputType currently only verifies arity and skips the inherited type-pointer validation; update Generic.TypeCheckers.checkOutputType to first compute actual_num_outputs (as it does using operatorRepresentation.get('loss', '')), return False if actual_num_outputs != len(self.output_types), and otherwise call and return the superclass SignPropTypeChecker.checkOutputType(self, inputs, operatorRepresentation) so the original pointer/type validation still runs; reference the method names checkOutputType, SignPropTypeChecker.checkOutputType, self.output_types, and operatorRepresentation.get('loss', '') to locate the code to change.DeeployTest/testMVPTraining.py (1)
160-174:⚠️ Potential issue | 🟡 MinorRegister zero-length inputs instead of skipping them.
For zero-sized non-bool/non-float tensors, Lines 173-174 leave
inputTypes/inputOffsetsunset for that graph input. The rest of the pipeline treats those maps as complete, so the first empty tensor here will fail later during deployer setup or code generation. If empty inputs are valid, assign a safe default entry instead ofpass; while touching this block, renamenameto_nameas well.Proposed fix
- for graph_idx, name in enumerate(graph_input_names): + for graph_idx, _name in enumerate(graph_input_names): if graph_idx in grad_acc_set: inputTypes[f"input_{graph_idx}"] = PointerClass(float32_t) inputOffsets[f"input_{graph_idx}"] = 0 else: arr = npz_base[npz_idx] npz_idx += 1 if arr.dtype == bool or arr.dtype == np.bool_: inputTypes[f"input_{graph_idx}"] = PointerClass(uint8_t) inputOffsets[f"input_{graph_idx}"] = 0 elif arr.dtype in (np.float32, np.float64): inputTypes[f"input_{graph_idx}"] = PointerClass(float32_t) inputOffsets[f"input_{graph_idx}"] = 0 elif np.prod(arr.shape) == 0: - pass + inputTypes[f"input_{graph_idx}"] = PointerClass(float32_t) + inputOffsets[f"input_{graph_idx}"] = 0 else:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@DeeployTest/testMVPTraining.py` around lines 160 - 174, The loop over graph_input_names leaves inputTypes/inputOffsets unset for zero-length non-bool/non-float arrays causing downstream failures; inside the for loop (variables graph_idx, name -> rename to _name) when you hit the branch elif np.prod(arr.shape) == 0, register a safe default entry instead of pass (e.g., set inputTypes[f"input_{graph_idx}"] to a PointerClass fallback such as float32_t and inputOffsets[f"input_{graph_idx}"] = 0) so the maps remain complete; ensure you still increment npz_idx and keep handling of grad_acc_set, PointerClass, float32_t and uint8_t unchanged.
🧹 Nitpick comments (4)
DeeployTest/testMVPOptimizer.py (1)
110-118: Clarify or remove the contradictory comment on line 110.Line 110's comment states "optimizer is forward-only, no lifetime extension needed" but then immediately uses
TrainingSBTilerwhich explicitly extends input lifetimes. The follow-up comment (lines 114-117) correctly explains why lifetime extension IS needed. Consider removing or updating line 110 to avoid confusion.📝 Fix the misleading comment
- # 7. Wrap with SBTiler (single-buffering; optimizer is forward-only, no lifetime extension needed). + # 7. Wrap with TrainingSBTiler (single-buffering with extended input lifetimes). unique_params = f"{args.dumpdir}_L1{args.l1}_L2{args.l2}_{args.defaultMemLevel}_optimizer" testIdentifier = hashlib.md5(unique_params.encode()).hexdigest()[:16] # TrainingSBTiler extends all input buffer lifetimes to the end of the🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@DeeployTest/testMVPOptimizer.py` around lines 110 - 118, The top comment before creating unique_params and wrapping with TilerDeployerWrapper is contradictory—remove or update the phrase "optimizer is forward-only, no lifetime extension needed" and replace it with a clear note that TrainingSBTiler is intentionally used to extend input buffer lifetimes (via TrainingMemoryScheduler) to prevent allocator reuse of consumed inputs for later outputs; adjust the comment near the TilerDeployerWrapper/TrainingSBTiler mention so it accurately states that optimizer training requires lifetime extension to avoid weight buffer corruption.DeeployTest/testUtils/tilingUtils.py (1)
50-67: LGTM! Training memory scheduler implementation is correct.The
TrainingMemorySchedulercorrectly extends input tensor lifetimes to span the entire tiling schedule, ensuring forward-pass inputs remain allocated during the backward pass.One minor nit from static analysis: the loop variable
lifetimeon line 62 is unused.🧹 Rename unused loop variable
- for tensorName, lifetime in tensorLifetimeMap.items(): + for tensorName, _lifetime in tensorLifetimeMap.items(): buffer = ctxt.lookup(tensorName) if buffer.is_input: tensorLifetimeMap[tensorName] = (0, maxStepIdx)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@DeeployTest/testUtils/tilingUtils.py` around lines 50 - 67, The loop in TrainingMemoryScheduler._calculateLifetimes iterates "for tensorName, lifetime in tensorLifetimeMap.items()" but "lifetime" is unused; rename it to "_" or "_lifetime" to satisfy static analysis. Update the iteration over tensorLifetimeMap.items() (in the _calculateLifetimes method) to use the underscore name and keep the rest of the logic (looking up buffer via ctxt.lookup(tensorName) and updating tensorLifetimeMap) unchanged.DeeployTest/testUtils/deeployTrainingRunner.py (1)
16-19: Hardcoded/usr/binPATH prepend may cause issues.Unconditionally prepending
/usr/binto PATH at module import time could cause unexpected behavior in environments where a different Python or tool version is expected. This global side effect happens even if the module is just imported.Move PATH manipulation into main() or make it conditional
-# gapy (gvsoc launcher) uses `#!/usr/bin/env python3`. Put /usr/bin first so -# it resolves to /usr/bin/python3 which has all required packages (gapylib, -# prettytable, …) rather than the minimal venv python. -os.environ['PATH'] = '/usr/bin:' + os.environ.get('PATH', '') +def _setup_gapy_path(): + """Prepend /usr/bin to PATH so gapy resolves to system python3.""" + # gapy (gvsoc launcher) uses `#!/usr/bin/env python3`. + current_path = os.environ.get('PATH', '') + if not current_path.startswith('/usr/bin:'): + os.environ['PATH'] = '/usr/bin:' + current_pathThen call
_setup_gapy_path()at the start ofmain().🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@DeeployTest/testUtils/deeployTrainingRunner.py` around lines 16 - 19, The module-level PATH mutation should be moved into a small helper and invoked from main to avoid global side effects; create a function named _setup_gapy_path() that performs the os.environ['PATH'] = '/usr/bin:' + os.environ.get('PATH', '') change (optionally guarded by a check such as if '/usr/bin' not in os.environ.get('PATH','') or an env var like DEEPLY_ALLOW_PATH_PREPEND), remove the top-level PATH assignment, and call _setup_gapy_path() at the start of main() so the PATH modification occurs only when the script is executed, not on import.DeeployTest/testMVPTraining.py (1)
411-418: Preserve the original traceback here.
raise erewrites the traceback to this handler. A bareraisekeeps the original failure location, which is much more useful when codegen fails under CI or GVSoC.Proposed fix
if args.shouldFail: print("\033[92mTiled training network generation ended, failed as expected!\033[0m") sys.exit(0) else: - raise e + raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@DeeployTest/testMVPTraining.py` around lines 411 - 418, The except block around generateTiledTrainingNetwork currently re-raises the exception with "raise e", which resets the traceback; change it to a bare "raise" so the original traceback is preserved (inside the same except handling that checks args.shouldFail and prints the success message when appropriate).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@Deeploy/Targets/PULPOpen/Bindings.py`:
- Around line 361-365: The list comprehension in
PULPSoftmaxCrossEntropyLossDualOutputBindings (and the other ~11 binding
definitions) uses the comprehension variable named type which shadows the
built-in; rename that variable to label_type (or similar) in the comprehension
for PULPSoftmaxCrossEntropyLossDualOutputBindings and do the same consistently
for other bindings using IntegerDataTypes so the generator expression for
NodeBinding(SoftmaxCrossEntropyLossChecker(...),
SoftmaxCrossEntropyLossTemplate.referenceDualOutputTemplate, ForkTransformer)
becomes a comprehension using label_type instead of type; update all occurrences
referencing the comprehension variable in this file (e.g., any usages inside
SoftmaxCrossEntropyLossChecker or other binding list comprehensions) to match
the new name.
In `@DeeployTest/testMVPTraining.py`:
- Around line 233-242: The auto-inference code currently uses floor division so
any remainder mini-batches get dropped when total_mb is later recomputed; change
the inference to preserve all samples by using ceiling division (e.g., compute
n_steps = ceil(total_mb / n_accum) or n_accum = ceil(total_mb / n_steps) via
math.ceil) or else explicitly handle the leftover partial batch instead of
truncating; update the branches around _infer_total_mb, _infer_n_accum, and the
recompute of total_mb so the final total_mb covers all original samples (or emit
a clear warning if you choose to pad/handle a partial final step).
---
Duplicate comments:
In `@Deeploy/Targets/Generic/TypeCheckers.py`:
- Around line 577-581: The override of checkOutputType currently only verifies
arity and skips the inherited type-pointer validation; update
Generic.TypeCheckers.checkOutputType to first compute actual_num_outputs (as it
does using operatorRepresentation.get('loss', '')), return False if
actual_num_outputs != len(self.output_types), and otherwise call and return the
superclass SignPropTypeChecker.checkOutputType(self, inputs,
operatorRepresentation) so the original pointer/type validation still runs;
reference the method names checkOutputType, SignPropTypeChecker.checkOutputType,
self.output_types, and operatorRepresentation.get('loss', '') to locate the code
to change.
In `@DeeployTest/testMVPTraining.py`:
- Around line 160-174: The loop over graph_input_names leaves
inputTypes/inputOffsets unset for zero-length non-bool/non-float arrays causing
downstream failures; inside the for loop (variables graph_idx, name -> rename to
_name) when you hit the branch elif np.prod(arr.shape) == 0, register a safe
default entry instead of pass (e.g., set inputTypes[f"input_{graph_idx}"] to a
PointerClass fallback such as float32_t and inputOffsets[f"input_{graph_idx}"] =
0) so the maps remain complete; ensure you still increment npz_idx and keep
handling of grad_acc_set, PointerClass, float32_t and uint8_t unchanged.
---
Nitpick comments:
In `@DeeployTest/testMVPOptimizer.py`:
- Around line 110-118: The top comment before creating unique_params and
wrapping with TilerDeployerWrapper is contradictory—remove or update the phrase
"optimizer is forward-only, no lifetime extension needed" and replace it with a
clear note that TrainingSBTiler is intentionally used to extend input buffer
lifetimes (via TrainingMemoryScheduler) to prevent allocator reuse of consumed
inputs for later outputs; adjust the comment near the
TilerDeployerWrapper/TrainingSBTiler mention so it accurately states that
optimizer training requires lifetime extension to avoid weight buffer
corruption.
In `@DeeployTest/testMVPTraining.py`:
- Around line 411-418: The except block around generateTiledTrainingNetwork
currently re-raises the exception with "raise e", which resets the traceback;
change it to a bare "raise" so the original traceback is preserved (inside the
same except handling that checks args.shouldFail and prints the success message
when appropriate).
In `@DeeployTest/testUtils/deeployTrainingRunner.py`:
- Around line 16-19: The module-level PATH mutation should be moved into a small
helper and invoked from main to avoid global side effects; create a function
named _setup_gapy_path() that performs the os.environ['PATH'] = '/usr/bin:' +
os.environ.get('PATH', '') change (optionally guarded by a check such as if
'/usr/bin' not in os.environ.get('PATH','') or an env var like
DEEPLY_ALLOW_PATH_PREPEND), remove the top-level PATH assignment, and call
_setup_gapy_path() at the start of main() so the PATH modification occurs only
when the script is executed, not on import.
In `@DeeployTest/testUtils/tilingUtils.py`:
- Around line 50-67: The loop in TrainingMemoryScheduler._calculateLifetimes
iterates "for tensorName, lifetime in tensorLifetimeMap.items()" but "lifetime"
is unused; rename it to "_" or "_lifetime" to satisfy static analysis. Update
the iteration over tensorLifetimeMap.items() (in the _calculateLifetimes method)
to use the underscore name and keep the rest of the logic (looking up buffer via
ctxt.lookup(tensorName) and updating tensorLifetimeMap) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 87d73b5e-2e69-4f9f-ad5d-4cccd6348c23
📒 Files selected for processing (19)
Deeploy/Targets/Generic/TypeCheckers.pyDeeploy/Targets/PULPOpen/Bindings.pyDeeploy/Targets/PULPOpen/Platform.pyDeeploy/Targets/PULPOpen/Templates/FloatInPlaceAccumulatorV2Template.pyDeeploy/Targets/PULPOpen/Templates/SGDTemplate.pyDeeploy/Targets/PULPOpen/TileConstraints/SGDTileConstraint.pyDeeploy/Targets/PULPOpen/TileConstraints/SoftmaxCrossEntropyLossDualOutputTileConstraint.pyDeeploy/Targets/PULPOpen/Tiler.pyDeeploy/TilingExtension/TilerExtension.pyDeeployTest/Platforms/Siracusa/src/deeploytraintest.cDeeployTest/deeployTrainingRunner.pyDeeployTest/generateOptimizerNetwork.pyDeeployTest/generateTrainingNetwork.pyDeeployTest/testMVPOptimizer.pyDeeployTest/testMVPTraining.pyDeeployTest/testUtils/codeGenerate.pyDeeployTest/testUtils/core/execution.pyDeeployTest/testUtils/deeployTrainingRunner.pyDeeployTest/testUtils/tilingUtils.py
✅ Files skipped from review due to trivial changes (1)
- DeeployTest/deeployTrainingRunner.py
🚧 Files skipped from review as they are similar to previous changes (7)
- Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py
- Deeploy/TilingExtension/TilerExtension.py
- Deeploy/Targets/PULPOpen/TileConstraints/SGDTileConstraint.py
- Deeploy/Targets/PULPOpen/Templates/FloatInPlaceAccumulatorV2Template.py
- Deeploy/Targets/PULPOpen/Tiler.py
- Deeploy/Targets/PULPOpen/Platform.py
- Deeploy/Targets/PULPOpen/TileConstraints/SoftmaxCrossEntropyLossDualOutputTileConstraint.py
Deeploy/Targets/PULPOpen/Bindings.py
Outdated
| PULPSoftmaxCrossEntropyLossDualOutputBindings = [ | ||
| NodeBinding( | ||
| SoftmaxCrossEntropyLossChecker([PointerClass(float32_t), PointerClass(type)], | ||
| [PointerClass(float32_t), PointerClass(float32_t)]), | ||
| SoftmaxCrossEntropyLossTemplate.referenceDualOutputTemplate, ForkTransformer) for type in IntegerDataTypes |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n Deeploy/Targets/PULPOpen/Bindings.py | sed -n '355,375p'Repository: pulp-platform/Deeploy
Length of output: 1271
🏁 Script executed:
rg "for type in IntegerDataTypes" Deeploy/Targets/PULPOpen/Bindings.pyRepository: pulp-platform/Deeploy
Length of output: 1155
Rename the comprehension variable to avoid Ruff A001 throughout the file.
The variable type in the list comprehension shadows Python's built-in type, which Ruff correctly flags. This pattern appears in multiple binding definitions in this file (at least 11 instances). Rename type to label_type (or similar) consistently across all occurrences.
At lines 361-365 specifically:
♻️ Fix for PULPSoftmaxCrossEntropyLossDualOutputBindings
PULPSoftmaxCrossEntropyLossDualOutputBindings = [
NodeBinding(
- SoftmaxCrossEntropyLossChecker([PointerClass(float32_t), PointerClass(type)],
+ SoftmaxCrossEntropyLossChecker([PointerClass(float32_t), PointerClass(label_type)],
[PointerClass(float32_t), PointerClass(float32_t)]),
- SoftmaxCrossEntropyLossTemplate.referenceDualOutputTemplate, ForkTransformer) for type in IntegerDataTypes
+ SoftmaxCrossEntropyLossTemplate.referenceDualOutputTemplate, ForkTransformer)
+ for label_type in IntegerDataTypes
]🧰 Tools
🪛 Ruff (0.15.9)
[error] 365-365: Variable type is shadowing a Python builtin
(A001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Deeploy/Targets/PULPOpen/Bindings.py` around lines 361 - 365, The list
comprehension in PULPSoftmaxCrossEntropyLossDualOutputBindings (and the other
~11 binding definitions) uses the comprehension variable named type which
shadows the built-in; rename that variable to label_type (or similar) in the
comprehension for PULPSoftmaxCrossEntropyLossDualOutputBindings and do the same
consistently for other bindings using IntegerDataTypes so the generator
expression for NodeBinding(SoftmaxCrossEntropyLossChecker(...),
SoftmaxCrossEntropyLossTemplate.referenceDualOutputTemplate, ForkTransformer)
becomes a comprehension using label_type instead of type; update all occurrences
referencing the comprehension variable in this file (e.g., any usages inside
SoftmaxCrossEntropyLossChecker or other binding list comprehensions) to match
the new name.
…puts Two fixes for CI regressions on PR pulp-platform#182: 1. SoftmaxCrossEntropyLoss: canonical form is now 2 outputs (loss + log_prob). The previous design carried both a legacy 1-output form and a new 2-output 'dual output' form side by side, which tripped Deeploy's root-layer backtracking: Layer.parse() commits to the first mapper whose parser succeeds, and if that mapper's bindings all later fail typeCheck, the root-layer backtracker raises instead of trying the next mapper. That broke the existing upstream Tests/Kernels/FP32/Softmax/CrossEntropy. Fix: collapse everything to a single 2-output path. SoftmaxCrossEntropy LossParser now requires exactly 2 outputs, the SCE template is the former 'referenceDualOutputTemplate' (loss + log_prob), the PULPOpen binding uses 2-output pointer types, and the tile constraint contains the loss patching logic that previously lived in SoftmaxCrossEntropyLossDualOutputTileConstraint (now deleted). SoftmaxCrossEntropyGradTileConstraint overrides dataLossName to '' so it falls straight through to the base-class single-output wrapper. The upstream Tests/Kernels/FP32/Softmax/CrossEntropy ONNX is regenerated with the canonical 2-output signature and its outputs.npz is updated with the computed scalar loss. 2. TilingExtension.TilerExtension: drop the 4-byte MiniMalloc size alignment that was added in the training branch to work around a Siracusa L3 DMA corner case. That change makes Snitch's tight-fit 5 kB Kernels/Integer/Add/Large test overflow L1 because the rounded-up per-buffer sizes no longer fit. For MLP core the alignment is not needed. The alias-skip logic (skip zero-sized in-place alias outputs from the MiniMalloc CSV and resolve their addrSpace from the alias target after solving) is kept. Verified locally: - Tests/Kernels/FP32/Softmax/CrossEntropy (upstream) PASSED - Tests/Kernels/Integer/Add/Large (snitch tiled, L1=5 kB) PASSED - deeployTrainingRunner_siracusa.py -t simplemlp_train PASSED - deeployTrainingRunner_tiled_siracusa.py -t simplemlp_train PASSED
…gle binding Drop the separate tiledReferenceTemplate / PULPInPlaceAccumulatorV2TiledBindings path. The non-tiled template's data_out write is redundant: InPlaceAccumulatorV2 is terminal in the training graph, no downstream kernel consumes data_out, and the alias registered in alignToContext already makes the graph output pointer resolve to accum_buffer's L2 slot. Emitting the write is not only unnecessary — in the tiled path it also triggers an L2 egress DMA whose destination may overlap with live buffers and corrupt L2, which is why the tiled variant existed in the first place. One template that only writes accum_buffer is correct for both modes. Tiler.py now pulls the (single) PULPInPlaceAccumulatorV2Bindings through PULPInPlaceAccumulatorV2TilingReadyBindings. Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled (deeployTrainingRunner_siracusa.py) and tiled (deeployTrainingRunner_tiled_siracusa.py --l1 64000 --l2 2000000) runs.
The weight/weight_updated alias + shared allocTemplate mechanism works for any memory level weight lives in, not only L2. Update the class docstring and inline comment to say "L2 or L3" instead of hard-coding L2. No code change.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py`:
- Around line 36-37: The allocTemplate for weight_updated currently inlines the
concrete instance by using str(weight._instance), which can become stale; update
the allocTemplate construction to reference the aliased symbol (e.g.,
weight.name) instead of weight._instance so allocation/scheduling changes
propagate — locate weight_updated.allocTemplate and its NodeTemplate
instantiation and replace the frozen _instance reference with a symbolic
reference to weight.name (preserving the same format string and type cast).
- Around line 26-31: The code currently adds aliases and unconditionally sets
weight_updated._alias = weight.name which can break tiling if the two buffers
live in different scopes; before adding aliases or setting _alias for the
buffers returned by ctxt.lookup(operatorRepresentation['weight']) and
ctxt.lookup(operatorRepresentation['weight_updated']), check that their
memory/scope attributes are the same (e.g., compare weight.scope or the buffer’s
scope/memory property) and only perform weight.aliases.add(...),
weight_updated.aliases.add(...), and weight_updated._alias = weight.name when
scopes match; if they differ, either skip aliasing or explicitly co-locate the
buffers (e.g., promote one to the other's scope) instead.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6b2f7cd6-d366-42fb-963e-94f656c8fcea
📒 Files selected for processing (1)
Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py
| weight = ctxt.lookup(operatorRepresentation['weight']) | ||
| weight_updated = ctxt.lookup(operatorRepresentation['weight_updated']) | ||
|
|
||
| weight.aliases.add(weight_updated.name) | ||
| weight_updated.aliases.add(weight.name) | ||
| weight_updated._alias = weight.name |
There was a problem hiding this comment.
Add a scope-consistency guard before aliasing buffers.
At Line 31, _alias is set unconditionally. If weight and weight_updated are in different scopes (global/local), this can produce inconsistent memory-level expectations during tiling/scheduling. Validate scope parity (or explicitly co-locate) before aliasing.
Suggested fix
weight = ctxt.lookup(operatorRepresentation['weight'])
weight_updated = ctxt.lookup(operatorRepresentation['weight_updated'])
+
+ weight_is_global = ctxt.is_global(weight.name)
+ updated_is_global = ctxt.is_global(weight_updated.name)
+ if weight_is_global != updated_is_global:
+ raise RuntimeError(
+ f"SGD alias requires same scope: weight={weight.name} "
+ f"({ 'global' if weight_is_global else 'local' }), "
+ f"weight_updated={weight_updated.name} "
+ f"({ 'global' if updated_is_global else 'local' })"
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py` around lines 26 - 31, The
code currently adds aliases and unconditionally sets weight_updated._alias =
weight.name which can break tiling if the two buffers live in different scopes;
before adding aliases or setting _alias for the buffers returned by
ctxt.lookup(operatorRepresentation['weight']) and
ctxt.lookup(operatorRepresentation['weight_updated']), check that their
memory/scope attributes are the same (e.g., compare weight.scope or the buffer’s
scope/memory property) and only perform weight.aliases.add(...),
weight_updated.aliases.add(...), and weight_updated._alias = weight.name when
scopes match; if they differ, either skip aliasing or explicitly co-locate the
buffers (e.g., promote one to the other's scope) instead.
| weight_updated.allocTemplate = NodeTemplate(" ${name} = (${type.typeName}) " + str(weight._instance) + ";") | ||
| weight_updated.deallocTemplate = NodeTemplate("") |
There was a problem hiding this comment.
Avoid freezing weight._instance into the alloc template at align-time.
Line 36 snapshots str(weight._instance) early. Since allocation/scheduling can still adapt buffer instances later, this risks stale alias assignment code for weight_updated. Prefer referencing the aliased symbol (weight.name) rather than the current _instance value.
Suggested fix
- weight_updated.allocTemplate = NodeTemplate(" ${name} = (${type.typeName}) " + str(weight._instance) + ";")
+ weight_updated.allocTemplate = NodeTemplate(
+ f" ${{name}} = (${{type.typeName}}) {weight.name};"
+ )
weight_updated.deallocTemplate = NodeTemplate("")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py` around lines 36 - 37, The
allocTemplate for weight_updated currently inlines the concrete instance by
using str(weight._instance), which can become stale; update the allocTemplate
construction to reference the aliased symbol (e.g., weight.name) instead of
weight._instance so allocation/scheduling changes propagate — locate
weight_updated.allocTemplate and its NodeTemplate instantiation and replace the
frozen _instance reference with a symbolic reference to weight.name (preserving
the same format string and type cast).
…ments Drop restating-the-code comments and inline verbose "trick" narration; keep only the class docstring and the one-line note about egressing through data_out rather than accum_buffer. Also compact the load-schedule loops into list comprehensions. No behaviour change. Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.
ReluGradTileConstraint was an unused BOPTileConstraint subclass that got pulled in with the initial training-platform cherry-pick. Nothing in the tree imports or registers it, simplemlp_train does not contain any Relu or ReluGrad nodes, and Relu gradients are explicitly out of scope for this PR. Remove it. Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.
…tTileConstraint The canonicalisation commit 763b464 moved the loss-patching wrapTilingSolution logic into the base SoftmaxCrossEntropyTileConstraint (dataLossName = 'loss' plus loss-output extension) and the commit message declared the dual-output subclass "now deleted", but the git rm never actually happened — the file was still sitting on disk with no importers anywhere in the tree. Finish the deletion. Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.
Collapse the in-place alias collection and resolution in Tiler.minimalloc() into set/dict comprehensions, drop the verbose multi-line comment, and remove the JUNGVI: attribution from a block that wasn't authored by Victor. The target lookup is now O(1) via a name-keyed dict instead of an inner linear scan over memoryMap. No behaviour change. Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.
…nceCode Upstream PR pulp-platform#177 (13113de, "Fix/tiling stack scoping and tiling information corruption", Pu DENG) wraps each layer's emitted code in a C block: layerCode = reduce(lambda a, b: a + b, sections, "") callStack += "{\n" + layerCode + "\n}\n" so that per-layer call args become short-lived stack variables and RunNetwork's overall stack footprint goes down. cc1f68b silently reverted this hunk during a merge from devel — the training-platform branch was based on a pre-pulp-platform#177 snapshot and the conflict resolution went the wrong way. Restore the wrapping verbatim. Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.
…est fixtures Revert the regenerated 2-output network.onnx and outputs.npz under DeeployTest/Tests/Kernels/FP32/Softmax/CrossEntropy/ back to the upstream-devel versions. The training MLP path doesn't depend on these fixtures, and keeping the original files avoids touching unrelated kernel test data in this PR. Verified on Siracusa: simplemlp_train still passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.
…nel test The new SoftmaxCrossEntropyLossParser requires exactly 2 outputs (loss + log_prob), so the upstream Tests/Kernels/FP32/Softmax/CrossEntropy fixture (legacy 1-output ONNX) is no longer parseable. Rather than regenerate it in this PR, just remove the test fixture directory and its entry in test_siracusa_config.py. CrossEntropyGrad is unaffected and stays. Verified on Siracusa: simplemlp_train still passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.
…d code
The Siracusa training harness was computing loss_err_count, printing
"Errors: N out of M", and then unconditionally returning 0. Numerical
mismatches were therefore non-blocking — the simulator exit code stayed
green even when the training graph diverged from the reference. Make the
return value reflect the comparison result:
return loss_err_count == 0 ? 0 : 1;
Also drop dead code that was never reached or never used:
- training_cycles / optimizer_cycles locals (only commented-out printf)
- connect_optimizer_buffers() (body was just (void)0; never called)
- Several commented-out section header blocks and helper-call lines
No control-flow change apart from the return value.
Verified on Siracusa: simplemlp_train still passes 0/4 (diff=0.000000 at
every step) in both non-tiled and tiled runs.
Step A and Step C in run_optimizer_step had explicit comment headers but the cluster task call between them (which actually runs the optimizer kernel) had none, so the labelling jumped A → C. Add a one-line "Step B: run optimizer kernel on cluster" header for symmetry. Pure documentation, no logic change.
The _augment_path helper prepended GVSOC_INSTALL_DIR/bin, LLVM_INSTALL_DIR/bin
and (if a venv was active) the venv's bin to PATH before invoking cmake /
gvsoc. Two reasons to remove it from this PR:
1. On Siracusa it is a no-op — CMake's gvsoc_<target> custom command and
the LLVM toolchain calls go through ${GVSOC_INSTALL_DIR}/... and
${LLVM_INSTALL_DIR}/... absolute paths and never depend on PATH.
2. The venv kconfigtool.py shebang fixup is GAP9-specific. GAP9 is
explicitly out of scope for this PR (the commit message of cc1f68b
lists "GAP9 platform port (separate PR, depends on this)" under the
excluded scope), so this plumbing belongs in the GAP9 follow-up, not
here.
Test runners silently mutating PATH is also a documentation/debug
hazard — if a user has intentionally pinned a different gvsoc/llvm in
their PATH, this helper would silently override it. Removed both call
sites in cmake_configure() and run_simulation().
Verified on Siracusa: simplemlp_train still passes 0/4 (diff=0.000000 at
every step) in both non-tiled and tiled runs.
…tils.py generateTrainingNetwork.py and testMVPTraining.py carried byte-identical copies of _load_reference_losses, _infer_num_data_inputs, _infer_total_mb, _infer_data_size, _infer_n_accum and the _GRAD_ACC constant. testMVPOptimizer.py and testMVPTraining.py each defined their own _mockScheduler, also byte-identical. Move all of these to a new testUtils/trainingUtils.py module and import them from the three entry points so the next helper edit only has to happen in one place. No behaviour change; the three entry-point scripts still exist as before (user asked to keep testMVP*.py and the two deeployTrainingRunner_siracusa.py / _tiled_siracusa.py stubs). Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.
…ranches in generate_network The training path in execution.generate_network had two sibling branches (config.training and config.tiling / config.training) that were 90% identical: - same Step 1 (run training codegen script with --n-steps / --n-accum / --num-data-inputs / -v / --debug / gen_args) - same training_meta.json read-back - same Step 2 optimizer loop with passthrough args and --defaultMemLevel default The only real differences were the two script names (testMVPTraining.py vs generateTrainingNetwork.py and the corresponding optimizer pair), a 4-entry vs 8-entry passthrough list, and the "Tiled training" vs "Training" error-message prefix. Collapse into a single `if config.training:` branch that selects the three variants up front and reuses one body. The two inference branches (`elif config.tiling:` and `else:`) are left untouched. Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.
… trainingUtils.py
The four codegen entry points (generateTrainingNetwork.py, testMVPTraining.py,
generateOptimizerNetwork.py, testMVPOptimizer.py) each had a nearly identical
argparse block in their if __name__ == '__main__' tail, plus the same
try/except --shouldFail handshake. Each addition of a training-side arg
meant editing up to four scripts in lockstep.
Move the shared argparse groups and the handshake runner to
testUtils/trainingUtils.py:
- add_cores_arg(parser) (shared by all four)
- add_training_inference_args(parser) (--num-data-inputs / --n-steps /
--n-accum / --learning-rate / --tolerance;
shared by both training scripts)
- add_memory_level_args(parser) (--l1 / --l2 / --defaultMemLevel;
shared by the tiled training and
both optimizer scripts)
- add_tiling_solver_args(parser) (--memAllocStrategy / --searchStrategy /
--plotMemAlloc / --profileTiling;
shared by the two tiled variants)
- add_optimizer_training_dir_arg(parser) (shared by both optimizer scripts)
- add_should_fail_arg(parser) (shared by all four)
- run_with_shouldfail(fn, args, label) (try/except --shouldFail runner)
Each entry point's __main__ block shrinks to ~8 lines that compose the
groups it needs. CLI behaviour is unchanged — every argument keeps the
same name, default, dest and help text.
Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every
step) in both non-tiled and tiled runs.
… to trainingUtils The training branch of generate_network and the training-flag block in configure_cmake had four pieces of boilerplate worth extracting: - building the `[python, script, -d gen_dir, -t path, -p platform]` prefix twice (training + optimizer step) - the same `log.debug + subprocess.run + returncode check + raise` tail twice - the opt-side gen_args passthrough filter (4-line for-loop) - the six-line -DTRAINING=ON / -DN_TRAIN_STEPS=... / -DN_ACCUM_STEPS=... / -DTRAINING_NUM_DATA_INPUTS=... block in configure_cmake Move all four into testUtils/trainingUtils.py as build_codegen_cmd / run_codegen_subprocess / filter_passthrough_args / add_training_cmake_flags. They take primitive parameters only — no DeeployTestConfig dependency — so trainingUtils does not gain a back-edge into testUtils.core. The inference branches of generate_network (`elif config.tiling:` / `else:`) are left untouched. Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.
The last training-specific helper left in execution.py was _resolve_optimizer_dir, which derived the optimizer ONNX directory from the training test's name (replacing `_train` with `_optimizer`) with a config.optimizer_dir override. Move it to trainingUtils.py alongside the other test-harness helpers and change its signature to primitive parameters (test_dir: str, optimizer_dir: Optional[str]) so trainingUtils still has no back-edge to DeeployTestConfig. Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.
Minimize the footprint this PR leaves in testUtils/core/execution.py.
Previously the diff vs upstream devel was ~170 lines spread across
generate_network, configure_cmake, run_simulation, run_complete_test
and a variety of unrelated comment removals / inference-branch
restructuring. Most of that was incidental and had nothing to do with
training.
Reset execution.py to the upstream-devel state and add only the three
surgical hooks the training path actually needs:
1. One import line:
from ..trainingUtils import add_training_cmake_flags, run_training_codegen
2. A 3-line early-return block at the top of generate_network:
if config.training:
run_training_codegen(config, script_dir)
return
3. A one-line call inside configure_cmake to append the training
cmake defines:
add_training_cmake_flags(cmd, config.training, config.n_train_steps,
config.n_accum_steps, config.training_num_data_inputs)
Everything else — inference's generate_network body, run_simulation,
run_complete_test, all upstream comments — is byte-identical to upstream.
The full two-stage training codegen pipeline (training-script subprocess,
training_meta.json readback, optimizer-script subprocess with passthrough
filter and skip checks) now lives in a new run_training_codegen() in
testUtils/trainingUtils.py, which is the sole module where training
test-harness logic is concentrated.
Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every
step) in both non-tiled and tiled runs.
…ateTraining.py testUtils/codeGenerate.py had accumulated ~880 lines of training / optimizer codegen helpers piled on top of the upstream inference module, plus a dead "Initialize all output buffers to zero" block that had crept into generateTestNetworkImplementation (it declared local variables and looped over graph outputs without ever emitting anything into retStr). The net effect was that this PR's diff in codeGenerate.py was +883 lines and an interleaving that made the inference vs training boundary invisible to reviewers. Reset testUtils/codeGenerate.py to byte-identical upstream/devel and move every training-side helper to a new testUtils/codeGenerateTraining.py: - generateTrainingTestInputsHeader / generateTrainingTestOutputsHeader - generateTrainingNetworkHeader / generateTrainingNetworkImplementation - generateTrainingTestNetwork - build_shared_buffer_maps / _patch_shared_buffers / _patch_shared_arenas - _ensure_training_l1_capacity - generateOptimizerNetworkHeader / generateOptimizerNetworkImplementation - generateOptimizerTestNetwork codeGenerateTraining.py imports generateL3HexDump from codeGenerate (the only inference helper the training functions reuse), which is a clean forward dependency. No back-edge from codeGenerate to training. The four codegen entry points are updated to pull their training / optimizer helpers from testUtils.codeGenerateTraining instead of testUtils.codeGenerate. No other files change. Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step) in both non-tiled and tiled runs.
…r.py
DeeployTest/deeployTrainingRunner.py was a third training-runner entry
point that peeked at --tiled / -p ahead of testUtils.deeployTrainingRunner.main()
and forwarded tiling_enabled accordingly. It duplicated what the two
explicit stubs already do by file name:
deeployTrainingRunner_siracusa.py → main(tiling_enabled=False)
deeployTrainingRunner_tiled_siracusa.py → main(tiling_enabled=True)
Nothing in the repo (py imports, CMake, CI workflows) references the
top-level file; only its own docstring and an offline AI_AGENT planning
doc mention it. Remove it so the training CLI surface is just the two
file-name-scoped stubs, consistent with the preference to keep the
non-tiled and tiled variants visually split.
Verified on Siracusa: simplemlp_train still passes 0/4 (diff=0.000000 at
every step) in both non-tiled and tiled runs.
…ngUtils
Apply a strict "only training-specific helpers get a dedicated function"
rule to testUtils/trainingUtils.py. The following wrappers were removed
because their content was not training-specific:
- add_cores_arg (--cores is generic codegen)
- add_memory_level_args (--l1 / --l2 / --defaultMemLevel — tiling generic)
- add_tiling_solver_args (--memAllocStrategy / --searchStrategy /
--plotMemAlloc / --profileTiling — tiling generic)
- add_should_fail_arg (--shouldFail is used by all codegen scripts)
- run_with_shouldfail (the try/except shouldFail handshake is used
by all codegen scripts)
- build_codegen_cmd (generic [python, script, -d, -t, -p] prefix)
- run_codegen_subprocess (generic log + subprocess.run + check)
- filter_passthrough_args (generic list comprehension)
The first five are now inlined in each of the four training codegen
entry points (generateTrainingNetwork.py, testMVPTraining.py,
generateOptimizerNetwork.py, testMVPOptimizer.py), matching the style of
the upstream inference scripts which define these args inline as well.
The last three (subprocess helpers) were only called from inside
run_training_codegen() itself, so they become inline code there.
What stays in trainingUtils.py are only helpers whose content is
genuinely training-specific: the inputs.npz/outputs.npz readers, the
Tiler _mockScheduler, the training-argument argparse builders
(add_training_inference_args, add_optimizer_training_dir_arg),
resolve_optimizer_dir, add_training_cmake_flags and run_training_codegen.
Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every
step) in both non-tiled and tiled runs.
…input defaults
Two fixes to the inputTypes / inputOffsets build loop in both
generateTrainingNetwork.py and testMVPTraining.py (the two files share
this pattern verbatim).
1. The `name` loop variable in `for graph_idx, name in enumerate(...)`
was dead — switched to `for graph_idx in range(len(...))` so the
intent is explicit and the lint warning goes away.
2. The zero-sized-input branch (`np.prod(arr.shape) == 0`) previously
`pass`ed without populating inputTypes[f"input_{idx}"] or
inputOffsets[f"input_{idx}"]. ONNX permits optional placeholder
inputs with shape like (0,) and the downstream deployer looks up
every input by key, so a zero-sized input would later raise a
confusing KeyError far from the cause. Populate the entries with
a trivial default (float32 pointer, offset 0) — the type does not
matter for codegen since the buffer has no elements, but the key
must exist.
Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every
step) in both non-tiled and tiled runs.
CI pre-commit hook flagged: - yapf: collapse 3-line set comprehension in Tiler.aliasBlocks and rewrap two-line arg lists in trainingUtils.add_training_cmake_flags and run_training_codegen. - autoflake: drop unused `from typing import List` imports in testMVPTraining.py and testMVPOptimizer.py. Reformat only, no semantic change. MLP non-tiled + tiled regression still pass with diff=0.000000 on all four steps.
CCT_Train/CCT2_FT2 ONNX still uses the legacy 1-output SoftmaxCrossEntropyLoss / SoftmaxCrossEntropyLossGrad signature, which this PR deprecated in favour of the canonical 2-output (loss + log_prob) form in 763b464. The parser now rejects the old form, so CCT2_FT2 fails network generation with 'Did not find adequate mapping for graph ... SoftmaxCrossEntropyLossParser: Exhausted backtracking' at the L3-singlebuffer tiled check. Regenerating CCT_Train fixtures to the canonical 2-output form is out of scope for this PR; drop CCT2_FT2 from the L3 singlebuffer and L3 doublebuffer model lists. The other upstream coverage for CCT_2_32_32_128 (inference-side CCT) is retained.
Summary
Minimum viable MLP end-to-end training on Siracusa (untiled + tiled). Brings in only the framework + 4 operators needed to build a complete forward → backward → gradient-accumulation → SGD graph. Targets `devel`.
Deliberately excluded (future follow-up PRs): GAP9 port, all Conv/Pool/Norm gradients, MSELoss, GroupNorm, PULPTrainlib submodule, FP32 forward kernels in `TargetLibraries/`.
Training framework
Verification
```bash
Untiled codegen + build
python DeeployTest/deeployTrainingRunner_siracusa.py
-t /app/Onnx4Deeploy/onnx/model/simplemlp_train
--optimizer-dir /app/Onnx4Deeploy/onnx/model/simplemlp_optimizer
--skipsim --n-accum 1
-> ✓ Test simplemlp_train PASSED
Tiled codegen + build
python DeeployTest/deeployTrainingRunner_tiled_siracusa.py
-t /app/Onnx4Deeploy/onnx/model/simplemlp_train
--optimizer-dir /app/Onnx4Deeploy/onnx/model/simplemlp_optimizer
--skipsim --n-accum 4
-> ✓ Test simplemlp_train PASSED
```
Test plan