From cc1f68b786624fcfa3782657a6d58f2b9cef77da Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 10:42:55 +0000
Subject: [PATCH 01/28] training-platform core: minimal MLP end-to-end training
 on Siracusa

Bring in the minimum set of changes from the TrainingPlatform branch
needed to generate working C code for an end-to-end MLP training graph
(forward + backward + gradient accumulation + SGD weight update) on
the Siracusa platform, both untiled and tiled.

Scope is deliberately narrow - only MLP. No Conv/Pool/Norm gradients,
no PULPTrainlib submodule, no new C kernel sources. The four
operators involved (Gemm, SoftmaxCrossEntropyLoss + Grad, SGD,
InPlaceAccumulatorV2) are all implemented as inline template strings
so nothing new lands in TargetLibraries/.

Operators:
- InPlaceAccumulatorV2: new ORT com.microsoft operator with full
  Parser/Layer/TypeChecker/Template/TileConstraint/Binding scaffolding.
- SoftmaxCrossEntropyLoss: parser now accepts both 1-output (legacy,
  log_prob only) and 2-output (loss + log_prob) signatures; checker
  picks the correct binding via checkOutputType based on the 'loss'
  key. New PULPSoftmaxCrossEntropyLossDualOutputBindings and
  SoftmaxCrossEntropyLossDualOutputTileConstraint handle the dual
  output case.
- SGD: SGDTemplate now aliases weight_updated onto weight so the
  tiled egress DMA writes updated weights back to the weight's L2
  buffer.
- Gemm: tiny FloatGemmTemplate import cleanup.

Framework fixes (all operator-agnostic):
- DeeployTypes.VariableBuffer.isLive: treat is_input / is_output
  buffers as live across the whole step range.
- TilingExtension.TilerExtension: skip zero-sized in-place alias
  outputs from the MiniMalloc CSV and resolve their addrSpace from
  the alias target after the solver runs.
- TilingExtension.TilingVariableReplacement: change per-tile reference
  update from '*ref = array[i];' to 'ref = &array[i];' to avoid
  permanent mutation of static PI_L1 arrays between training steps.

Training test framework:
- testMVPTraining.py, testMVPOptimizer.py, generateTrainingNetwork.py,
  generateOptimizerNetwork.py: new codegen entry points for the
  two-graph (TrainingNetwork + OptimizerNetwork) architecture.
- deeployTrainingRunner.py, deeployTrainingRunner_siracusa.py,
  deeployTrainingRunner_tiled_siracusa.py and their testUtils helper:
  training test runners mirroring the inference deeployRunner pattern.
- testUtils/codeGenerate.py: training-specific codegen helpers
  (generateTrainingTestNetwork, generateOptimizerTestNetwork,
  build_shared_buffer_maps, _patch_shared_buffers / _shared_arenas,
  _ensure_training_l1_capacity) and L3 hex-dump path.
- testUtils/core/execution.py: L3 flash-image detection and load.
- testUtils/tilingUtils.py: TrainingSBTiler.
- CMakeLists.txt (top + Siracusa): TRAINING-gated targets that build
  TrainingNetwork.c and OptimizerNetwork.c side by side.
- DeeployTest/Platforms/Siracusa/src/deeploytraintest.c: training test
  harness that drives the two networks across multiple training steps
  with gradient accumulation.

Scope deliberately excluded (future follow-up PRs):
- GAP9 platform port (separate PR, depends on this).
- All Conv/Pool/Norm/Relu gradient operators.
- MSELoss/MSELossGrad, GroupNormalization, dual-output MaxPool.
- PULPTrainlib submodule (only needed for ConvGrad kernels).
- FP32 forward AvgPool/BatchNorm/MaxPool/Layernorm/Relu kernels.

Verified locally: testMVPTraining.py -t simplemlp_train and
testMVPOptimizer.py -t simplemlp_optimizer both succeed on Siracusa
with --n-accum 1, producing TrainingNetwork.c (2027 lines) and
OptimizerNetwork.c containing Gemm / SoftmaxCrossEntropyLoss /
SoftmaxCrossEntropyLossGrad / InPlaceAccumulatorV2 / SGD.
---
 Deeploy/DeeployTypes.py                       |   7 +-
 Deeploy/Targets/Generic/Layers.py             |  16 +
 Deeploy/Targets/Generic/Parsers.py            |  55 +-
 Deeploy/Targets/Generic/TypeCheckers.py       |  37 +-
 Deeploy/Targets/PULPOpen/Bindings.py          |  38 +-
 Deeploy/Targets/PULPOpen/Platform.py          |  46 +-
 .../PULPOpen/Templates/FloatGemmTemplate.py   |   5 +-
 .../FloatInPlaceAccumulatorV2Template.py      |  89 ++
 .../Targets/PULPOpen/Templates/SGDTemplate.py |  39 +-
 .../SoftmaxCrossEntropyLossTemplate.py        |  25 +
 .../InPlaceAccumulatorV2TileConstraint.py     | 102 ++
 .../TileConstraints/SGDTileConstraint.py      |   6 +
 ...rossEntropyLossDualOutputTileConstraint.py |  74 ++
 Deeploy/Targets/PULPOpen/Tiler.py             |  18 +-
 Deeploy/TilingExtension/TilerExtension.py     |  41 +-
 DeeployTest/CMakeLists.txt                    |  43 +-
 DeeployTest/Platforms/Siracusa/CMakeLists.txt |  43 +-
 .../Platforms/Siracusa/src/deeploytraintest.c | 415 +++++++++
 DeeployTest/deeployTrainingRunner.py          |  30 +
 DeeployTest/deeployTrainingRunner_siracusa.py |  11 +
 .../deeployTrainingRunner_tiled_siracusa.py   |  11 +
 DeeployTest/generateOptimizerNetwork.py       | 161 ++++
 DeeployTest/generateTrainingNetwork.py        | 373 ++++++++
 DeeployTest/testMVPOptimizer.py               | 236 +++++
 DeeployTest/testMVPTraining.py                | 421 +++++++++
 DeeployTest/testUtils/codeGenerate.py         | 875 +++++++++++++++++-
 DeeployTest/testUtils/core/config.py          |   8 +
 DeeployTest/testUtils/core/execution.py       | 299 +++++-
 .../testUtils/deeployTrainingRunner.py        | 149 +++
 DeeployTest/testUtils/tilingUtils.py          |  29 +-
 30 files changed, 3589 insertions(+), 113 deletions(-)
 create mode 100644 Deeploy/Targets/PULPOpen/Templates/FloatInPlaceAccumulatorV2Template.py
 create mode 100644 Deeploy/Targets/PULPOpen/TileConstraints/InPlaceAccumulatorV2TileConstraint.py
 create mode 100644 Deeploy/Targets/PULPOpen/TileConstraints/SoftmaxCrossEntropyLossDualOutputTileConstraint.py
 create mode 100644 DeeployTest/Platforms/Siracusa/src/deeploytraintest.c
 create mode 100644 DeeployTest/deeployTrainingRunner.py
 create mode 100644 DeeployTest/deeployTrainingRunner_siracusa.py
 create mode 100644 DeeployTest/deeployTrainingRunner_tiled_siracusa.py
 create mode 100644 DeeployTest/generateOptimizerNetwork.py
 create mode 100644 DeeployTest/generateTrainingNetwork.py
 create mode 100644 DeeployTest/testMVPOptimizer.py
 create mode 100644 DeeployTest/testMVPTraining.py
 create mode 100644 DeeployTest/testUtils/deeployTrainingRunner.py

diff --git a/Deeploy/DeeployTypes.py b/Deeploy/DeeployTypes.py
index 797bd44c47..771f00c07d 100644
--- a/Deeploy/DeeployTypes.py
+++ b/Deeploy/DeeployTypes.py
@@ -336,14 +336,14 @@ def has_live_aliases(self, ctxt: NetworkContext) -> bool:
             True if this VariableBuffer has any live aliases, False otherwise
         """
         # Do a breadth-first search across the aliasing double-linked list
-        live = self._live
+        live = self._live or self.is_input or self.is_output
         queue = set(self.aliases)
         visited = set(self.name)
         while len(queue) > 0:
             next = queue.pop()
             buffNext = ctxt.lookup(next)
             assert isinstance(buffNext, VariableBuffer)
-            live |= buffNext._live
+            live |= buffNext._live or buffNext.is_input or buffNext.is_output
             visited.add(next)
             queue |= buffNext.aliases - visited
         return live
@@ -2800,8 +2800,7 @@ def generateInferenceCode(self) -> str:
             self.ctxt, code = node.generate(self.ctxt)
 
             sections = reduce(lambda a, b: a + b, code, [])
-            layerCode = reduce(lambda a, b: a + b, sections, "")
-            callStack += "{\n" + layerCode + "\n}\n"
+            callStack += reduce(lambda a, b: a + b, sections, "")
 
         return callStack
 
diff --git a/Deeploy/Targets/Generic/Layers.py b/Deeploy/Targets/Generic/Layers.py
index cc733937cc..7ead6556b7 100644
--- a/Deeploy/Targets/Generic/Layers.py
+++ b/Deeploy/Targets/Generic/Layers.py
@@ -492,6 +492,22 @@ def __init__(self, maps: List[NodeMapper]):
         super().__init__(maps)
 
 
+class InPlaceAccumulatorV2Layer(ONNXLayer):
+    """Layer for ORT InPlaceAccumulatorV2 operator (com.microsoft).
+
+    Gradient accumulation with optional reset:
+        if lazy_reset_grad: out = gradient
+        else:               out = buffer + gradient
+    """
+
+    def __init__(self, maps: List[NodeMapper]):
+        super().__init__(maps)
+
+    def computeOps(self):
+        # One conditional check + one element-wise op (copy or add) per element
+        return self.mapper.parser.operatorRepresentation['size']
+
+
 class LinearAttentionLayer(ONNXLayer):
 
     def __init__(self, maps: List[NodeMapper]):
diff --git a/Deeploy/Targets/Generic/Parsers.py b/Deeploy/Targets/Generic/Parsers.py
index ad787d9e4b..1323cc069a 100644
--- a/Deeploy/Targets/Generic/Parsers.py
+++ b/Deeploy/Targets/Generic/Parsers.py
@@ -2617,7 +2617,8 @@ def __init__(self):
 
     def parseNode(self, node: gs.Node) -> bool:
 
-        ret = all([len(node.inputs) == 2, len(node.outputs) == 1])
+        # Accept 1 output (log_prob only) or 2 outputs (loss + log_prob)
+        ret = all([len(node.inputs) == 2, len(node.outputs) in (1, 2)])
 
         return ret
 
@@ -2628,7 +2629,15 @@ def parseNodeCtxt(self,
 
         logits = ctxt.lookup(node.inputs[0].name)
         labels = ctxt.lookup(node.inputs[1].name)
-        log_prob = ctxt.lookup(node.outputs[0].name)
+        if len(node.outputs) == 2:
+            # Dual-output: outputs[0]=loss (scalar), outputs[1]=log_prob
+            loss = ctxt.lookup(node.outputs[0].name)
+            log_prob = ctxt.lookup(node.outputs[1].name)
+            self.operatorRepresentation['loss'] = loss.name
+        else:
+            # Single-output (legacy): outputs[0]=log_prob
+            log_prob = ctxt.lookup(node.outputs[0].name)
+            self.operatorRepresentation['loss'] = ''
         self.operatorRepresentation['logits'] = logits.name
         self.operatorRepresentation['labels'] = labels.name
         self.operatorRepresentation['log_prob'] = log_prob.name
@@ -2697,6 +2706,48 @@ def parseNodeCtxt(self,
         return ctxt, True
 
 
+class InPlaceAccumulatorV2Parser(NodeParser):
+    """Parser for ORT InPlaceAccumulatorV2 operator (com.microsoft).
+
+    Semantics:
+        if lazy_reset_grad: out = gradient          (reset)
+        else:               out = buffer + gradient  (accumulate)
+
+    Inputs:
+        0: buffer          - current accumulation buffer (float tensor)
+        1: gradient        - new gradient to accumulate (float tensor, same shape)
+        2: lazy_reset_grad - reset flag; if true, overwrite; else add (bool[1])
+
+    Output:
+        0: output_buffer   - updated accumulation buffer (float tensor)
+    """
+
+    def __init__(self):
+        super().__init__()
+
+    def parseNode(self, node: gs.Node) -> bool:
+        # Require exactly 3 inputs (buffer, gradient, lazy_reset_grad) and 1 output
+        return len(node.inputs) == 3 and len(node.outputs) == 1
+
+    def parseNodeCtxt(self,
+                      ctxt: NetworkContext,
+                      node: gs.Node,
+                      channels_first: bool = True) -> Tuple[NetworkContext, bool]:
+
+        buffer = ctxt.lookup(node.inputs[0].name)
+        gradient = ctxt.lookup(node.inputs[1].name)
+        lazy_reset_grad = ctxt.lookup(node.inputs[2].name)
+        data_out = ctxt.lookup(node.outputs[0].name)
+
+        self.operatorRepresentation['accum_buffer'] = buffer.name
+        self.operatorRepresentation['gradient'] = gradient.name
+        self.operatorRepresentation['lazy_reset_grad'] = lazy_reset_grad.name
+        self.operatorRepresentation['data_out'] = data_out.name
+        self.operatorRepresentation['size'] = int(np.prod(buffer.shape))
+
+        return ctxt, True
+
+
 class BatchNormParser(NodeParser):
 
     def __init__(self):
diff --git a/Deeploy/Targets/Generic/TypeCheckers.py b/Deeploy/Targets/Generic/TypeCheckers.py
index c2c8d436f8..d65dc455d2 100644
--- a/Deeploy/Targets/Generic/TypeCheckers.py
+++ b/Deeploy/Targets/Generic/TypeCheckers.py
@@ -574,14 +574,21 @@ class SoftmaxCrossEntropyLossChecker(SignPropTypeChecker):
     def __init__(self, input_types: Sequence[Type[Pointer]], output_types: Sequence[Type[Pointer]]):
         super().__init__(input_types, output_types)
 
+    def checkOutputType(self, inputs: List[VariableBuffer],
+                        operatorRepresentation: OperatorRepresentation) -> bool:
+        # The parser sets 'loss' to a non-empty string for 2-output nodes, '' for 1-output.
+        # Use this to determine the actual output count and match it against this binding.
+        actual_num_outputs = 2 if operatorRepresentation.get('loss', '') != '' else 1
+        return actual_num_outputs == len(self.output_types)
+
     def _inferNumLevels(self, inputs: List[VariableBuffer],
                         operatorRepresentation: OperatorRepresentation) -> Optional[List[int]]:
 
-        return [2**(self.input_types[0].referencedType.typeWidth)]
+        return [2**(self.input_types[0].referencedType.typeWidth)] * len(self.output_types)
 
     def _inferSignedness(self, inputs: List[VariableBuffer],
                          operatorRepresentation: OperatorRepresentation) -> Optional[List[bool]]:
-        return [False]
+        return [False] * len(self.output_types)
 
 
 class SGDChecker(SignPropTypeChecker):
@@ -598,6 +605,32 @@ def _inferSignedness(self, inputs: List[VariableBuffer],
         return [True]
 
 
+class InPlaceAccumulatorV2Checker(SignPropTypeChecker):
+    """Type checker for ORT InPlaceAccumulatorV2 operator (com.microsoft).
+
+    Inputs:
+        0: buffer          (float32*)
+        1: gradient        (float32*)
+        2: lazy_reset_grad (uint8_t* or bool* - 1 element)
+
+    Output:
+        0: output_buffer   (float32*)
+    """
+
+    def __init__(self, input_types: Sequence[Type[Pointer]], output_types: Sequence[Type[Pointer]]):
+        super().__init__(input_types, output_types)
+
+    def _inferNumLevels(self, inputs: List[VariableBuffer],
+                        operatorRepresentation: OperatorRepresentation) -> List[int]:
+        # Output has same precision as the buffer input (float32)
+        return [2**(self.input_types[0].referencedType.typeWidth)]
+
+    def _inferSignedness(self, inputs: List[VariableBuffer],
+                         operatorRepresentation: OperatorRepresentation) -> List[bool]:
+        # Float32 output is signed
+        return [True]
+
+
 class BatchNormChecker(SignPropTypeChecker):
 
     def __init__(self, input_types: Sequence[Type[Pointer]], output_types: Sequence[Type[Pointer]]):
diff --git a/Deeploy/Targets/PULPOpen/Bindings.py b/Deeploy/Targets/PULPOpen/Bindings.py
index 5d7b02ae62..b3029e7adc 100644
--- a/Deeploy/Targets/PULPOpen/Bindings.py
+++ b/Deeploy/Targets/PULPOpen/Bindings.py
@@ -18,9 +18,9 @@
 from Deeploy.Targets.Generic.Templates import AddTemplate, ConcatTemplate, DequantTemplate, FloatReduceSumTemplate, \
     GatherTemplate, QuantTemplate, RQSiGELUTemplate, SliceTemplate, iHardswishTemplate
 from Deeploy.Targets.Generic.TypeCheckers import AddChecker, ConcatChecker, ConvChecker, DequantChecker, \
-    GatherChecker, GELUChecker, GEMMChecker, HardswishChecker, LayerNormChecker, MatMulChecker, MulChecker, \
-    QuantChecker, ReduceMeanChecker, ReluChecker, ReshapeChecker, RQAddChecker, RQHardswishChecker, SGDChecker, \
-    SliceChecker, SoftmaxChecker, SoftmaxCrossEntropyLossChecker, TransposeChecker
+    GatherChecker, GELUChecker, GEMMChecker, HardswishChecker, InPlaceAccumulatorV2Checker, LayerNormChecker, \
+    MatMulChecker, MulChecker, QuantChecker, ReduceMeanChecker, ReluChecker, ReshapeChecker, RQAddChecker, \
+    RQHardswishChecker, SGDChecker, SliceChecker, SoftmaxChecker, SoftmaxCrossEntropyLossChecker, TransposeChecker
 from Deeploy.Targets.PULPOpen.CodeTransformationPasses.PULPClusterSynch import PULPSynchCoresPass
 from Deeploy.Targets.PULPOpen.CodeTransformationPasses.PULPClusterTiling import PULPClusterTiling
 from Deeploy.Targets.PULPOpen.CodeTransformationPasses.PULPL3Tiling import PULPL3Tiling
@@ -29,11 +29,12 @@
 from Deeploy.Targets.PULPOpen.DMA.L3Dma import l3DmaHack
 from Deeploy.Targets.PULPOpen.DMA.MchanDma import MchanDma
 from Deeploy.Targets.PULPOpen.Templates import ConvTemplate, DMASliceTemplate, FloatAddTemplate, FloatConvTemplate, \
-    FloatGELUTemplate, FloatGemmTemplate, FloatLayernormTemplate, FloatMatMulTemplate, FloatMaxPoolTemplate, \
-    FloatMulTemplate, FloatReduceMeanTemplate, FloatReluTemplate, FloatSoftmaxTemplate, GEMMTemplate, \
-    MatrixVectorTemplate, MaxPoolTemplate, MulTemplate, ReduceMeanTemplate, RequantShiftTemplate, ReshapeTemplate, \
-    RQAddTemplate, RQSiHardswishTemplate, SGDTemplate, SoftmaxCrossEntropyLossTemplate, TallGEMMTemplate, \
-    TransposeTemplate, UniformRequantShiftTemplate, iRMSNormTemplate, iSoftmaxTemplate
+    FloatGELUTemplate, FloatGemmTemplate, FloatInPlaceAccumulatorV2Template, FloatLayernormTemplate, \
+    FloatMatMulTemplate, FloatMaxPoolTemplate, FloatMulTemplate, FloatReduceMeanTemplate, FloatReluTemplate, \
+    FloatSoftmaxTemplate, GEMMTemplate, MatrixVectorTemplate, MaxPoolTemplate, MulTemplate, ReduceMeanTemplate, \
+    RequantShiftTemplate, ReshapeTemplate, RQAddTemplate, RQSiHardswishTemplate, SGDTemplate, \
+    SoftmaxCrossEntropyLossTemplate, TallGEMMTemplate, TransposeTemplate, UniformRequantShiftTemplate, \
+    iRMSNormTemplate, iSoftmaxTemplate
 from Deeploy.Targets.PULPOpen.TypeCheckers import PULPConvChecker, PULPLinearChecker, PULPMaxPoolChecker, \
     PULPRequantShiftChecker
 from Deeploy.TilingExtension.CodeTransformationPasses.TilingVariableReplacement import TilingVariableReplacement, \
@@ -357,6 +358,13 @@
         SoftmaxCrossEntropyLossTemplate.referenceTemplate, ForkTransformer) for type in IntegerDataTypes
 ]
 
+PULPSoftmaxCrossEntropyLossDualOutputBindings = [
+    NodeBinding(
+        SoftmaxCrossEntropyLossChecker([PointerClass(float32_t), PointerClass(type)],
+                                       [PointerClass(float32_t), PointerClass(float32_t)]),
+        SoftmaxCrossEntropyLossTemplate.referenceDualOutputTemplate, ForkTransformer) for type in IntegerDataTypes
+]
+
 PULPSoftmaxCrossEntropyLossGradBindings = [
     NodeBinding(
         SoftmaxCrossEntropyLossChecker([PointerClass(float32_t), PointerClass(type)], [PointerClass(float32_t)]),
@@ -368,6 +376,20 @@
                 SGDTemplate.referenceTemplate, ForkTransformer)
 ]
 
+PULPInPlaceAccumulatorV2Bindings = [
+    NodeBinding(
+        InPlaceAccumulatorV2Checker(
+            [PointerClass(float32_t), PointerClass(float32_t), PointerClass(uint8_t)], [PointerClass(float32_t)]),
+        FloatInPlaceAccumulatorV2Template.referenceTemplate, ForkTransformer)
+]
+
+PULPInPlaceAccumulatorV2TiledBindings = [
+    NodeBinding(
+        InPlaceAccumulatorV2Checker(
+            [PointerClass(float32_t), PointerClass(float32_t), PointerClass(uint8_t)], [PointerClass(float32_t)]),
+        FloatInPlaceAccumulatorV2Template.tiledReferenceTemplate, ForkTransformer)
+]
+
 PULPTransposeBindings = [
     NodeBinding(TransposeChecker([PointerClass(type)], [PointerClass(type)]), TransposeTemplate.referenceTemplate,
                 ForkTransformer) for type in IntegerDataTypes
diff --git a/Deeploy/Targets/PULPOpen/Platform.py b/Deeploy/Targets/PULPOpen/Platform.py
index 7456dd9e1b..56481f9220 100644
--- a/Deeploy/Targets/PULPOpen/Platform.py
+++ b/Deeploy/Targets/PULPOpen/Platform.py
@@ -14,17 +14,17 @@
 from Deeploy.Targets.Generic.Bindings import BasicGEMMBindings, BasicPad1DBindings, BasicPad2DBindings, \
     BasicRQIntegerDivBinding
 from Deeploy.Targets.Generic.Layers import AddLayer, ConcatLayer, ConvLayer, GatherLayer, GELUGradLayer, GELULayer, \
-    GEMMLayer, LayerNormGradLayer, LayerNormLayer, MatMulLayer, MaxPoolLayer, MulLayer, PadLayer, QuantLayer, \
-    ReduceMeanLayer, ReduceSumLayer, ReluLayer, RequantShiftLayer, ReshapeLayer, RQIntegerDivLayer, RQSiGELULayer, \
-    RQSiHardswishLayer, SGDLayer, SliceLayer, SoftmaxCrossEntropyLossGradLayer, SoftmaxCrossEntropyLossLayer, \
-    SoftmaxGradLayer, SoftmaxLayer, TransposeLayer, iHardswishLayer, iRMSNormLayer
+    GEMMLayer, InPlaceAccumulatorV2Layer, LayerNormGradLayer, LayerNormLayer, MatMulLayer, MaxPoolLayer, MulLayer, \
+    PadLayer, QuantLayer, ReduceMeanLayer, ReduceSumLayer, ReluLayer, RequantShiftLayer, ReshapeLayer, \
+    RQIntegerDivLayer, RQSiGELULayer, RQSiHardswishLayer, SGDLayer, SliceLayer, SoftmaxCrossEntropyLossGradLayer, \
+    SoftmaxCrossEntropyLossLayer, SoftmaxGradLayer, SoftmaxLayer, TransposeLayer, iHardswishLayer, iRMSNormLayer
 from Deeploy.Targets.Generic.Parsers import AddParser, ConcatParser, DequantParser, FlattenParser, GatherParser, \
-    GELUGradParser, GELUParser, GEMMParser, LayerNormGradParser, LayerNormParser, MatMulParser, MaxPool1DParser, \
-    MaxPool2DParser, MulParser, Pad1DParser, Pad2DParser, QuantParser, ReduceSumParser, ReluParser, \
-    RequantShiftParser, ReshapeParser, RQAddParser, RQIntegerDivParser, RQSiGELUParser, RQSiHardswishParser, \
-    SGDParser, SliceParser, SoftmaxCrossEntropyLossGradParser, SoftmaxCrossEntropyLossParser, SoftmaxGradParser, \
-    SoftmaxParser, TransposeParser, UniformRequantShiftParser, UnsqueezeParser, iHardswishParser, iRMSNormParser, \
-    iSoftmaxParser
+    GELUGradParser, GELUParser, GEMMParser, InPlaceAccumulatorV2Parser, LayerNormGradParser, LayerNormParser, \
+    MatMulParser, MaxPool1DParser, MaxPool2DParser, MulParser, Pad1DParser, Pad2DParser, QuantParser, \
+    ReduceSumParser, ReluParser, RequantShiftParser, ReshapeParser, RQAddParser, RQIntegerDivParser, RQSiGELUParser, \
+    RQSiHardswishParser, SGDParser, SliceParser, SoftmaxCrossEntropyLossGradParser, SoftmaxCrossEntropyLossParser, \
+    SoftmaxGradParser, SoftmaxParser, TransposeParser, UniformRequantShiftParser, UnsqueezeParser, iHardswishParser, \
+    iRMSNormParser, iSoftmaxParser
 from Deeploy.Targets.Generic.Templates import AllocateTemplate as BasicAllocateTemplate
 from Deeploy.Targets.Generic.TopologyOptimizationPasses.Passes import DequantPatternPass, IntegerDivRequantMergePass, \
     MergeConstAddAndRequantPass, MergeTrueIntegerDivRequantShiftPass, QuantPatternPass, RQSSplitPass, \
@@ -39,14 +39,15 @@
 from Deeploy.Targets.PULPOpen.Tiler import PULPAddTilingReadyBindings, PULPConcatTilingReadyBindings, \
     PULPConv2DTilingReadyBindings, PULPDWConv2DTilingReadyBindings, PULPFlattenTilingReadyBindings, \
     PULPFPGELUGradTilingReadyBindings, PULPFPGELUTilingReadyBindings, PULPFPGEMMTilingReadyBindings, \
-    PULPGatherTilingReadyBindings, PULPiHardswishTilingReadyBindings, PULPiRMSNormTilingReadyBindings, \
-    PULPiRQSGELUTilingReadyBindings, PULPLayernormGradTilingReadyBindings, PULPLayernormTilingReadyBindings, \
-    PULPMatMulTilingReadyBindings, PULPMaxPool1DTilingReadyBindings, PULPMaxPool2DTilingReadyBindings, \
-    PULPMulTilingReadyBindings, PULPReduceMeanTilingReadyBindings, PULPReduceSumTilingReadyBindings, \
-    PULPReluTilingReadyBindings, PULPRQAddTilingReadyBindings, PULPRQSConv1DTilingReadyBindings, \
-    PULPRQSConv2DTilingReadyBindings, PULPRQSDWConv2DTilingReadyBindings, PULPRQSGEMMTilingReadyBindings, \
-    PULPRQSiHardswishTilingReadyBindings, PULPRQSMatrixVecTilingReadyBindings, PULPRQSTallGEMMTilingReadyBindings, \
-    PULPRQSTilingReadyBindings, PULPSGDTilingReadyBindings, PULPSliceTilingReadyBindings, \
+    PULPGatherTilingReadyBindings, PULPiHardswishTilingReadyBindings, PULPInPlaceAccumulatorV2TilingReadyBindings, \
+    PULPiRMSNormTilingReadyBindings, PULPiRQSGELUTilingReadyBindings, PULPLayernormGradTilingReadyBindings, \
+    PULPLayernormTilingReadyBindings, PULPMatMulTilingReadyBindings, PULPMaxPool1DTilingReadyBindings, \
+    PULPMaxPool2DTilingReadyBindings, PULPMulTilingReadyBindings, PULPReduceMeanTilingReadyBindings, \
+    PULPReduceSumTilingReadyBindings, PULPReluTilingReadyBindings, PULPRQAddTilingReadyBindings, \
+    PULPRQSConv1DTilingReadyBindings, PULPRQSConv2DTilingReadyBindings, PULPRQSDWConv2DTilingReadyBindings, \
+    PULPRQSGEMMTilingReadyBindings, PULPRQSiHardswishTilingReadyBindings, PULPRQSMatrixVecTilingReadyBindings, \
+    PULPRQSTallGEMMTilingReadyBindings, PULPRQSTilingReadyBindings, PULPSGDTilingReadyBindings, \
+    PULPSliceTilingReadyBindings, PULPSoftmaxCrossEntropyDualOutputTilingReadyBindings, \
     PULPSoftmaxCrossEntropyGradTilingReadyBindings, PULPSoftmaxCrossEntropyTilingReadyBindings, \
     PULPSoftmaxGradTilingReadyBindings, PULPSoftmaxTilingReadyBindings, PULPTransposeTilingReadyBindings, \
     PULPUniformRQSTilingReadyBindings
@@ -105,9 +106,12 @@
 iHardswishMapper = NodeMapper(iHardswishParser(), PULPiHardswishTilingReadyBindings)
 RQSiHardswishMapper = NodeMapper(RQSiHardswishParser(), PULPRQSiHardswishTilingReadyBindings)
 SoftmaxCrossEntropyLossMapper = NodeMapper(SoftmaxCrossEntropyLossParser(), PULPSoftmaxCrossEntropyTilingReadyBindings)
+SoftmaxCrossEntropyLossDualOutputMapper = NodeMapper(SoftmaxCrossEntropyLossParser(),
+                                                     PULPSoftmaxCrossEntropyDualOutputTilingReadyBindings)
 SoftmaxCrossEntropyLossGradMapper = NodeMapper(SoftmaxCrossEntropyLossGradParser(),
                                                PULPSoftmaxCrossEntropyGradTilingReadyBindings)
 SGDMapper = NodeMapper(SGDParser(), PULPSGDTilingReadyBindings)
+InPlaceAccumulatorV2Mapper = NodeMapper(InPlaceAccumulatorV2Parser(), PULPInPlaceAccumulatorV2TilingReadyBindings)
 QuantMapper = NodeMapper(QuantParser(), BasicQuantBindings)
 DequantMapper = NodeMapper(DequantParser(), BasicDequantBindings)
 GEMMDequantMapper = NodeMapper(PULPGEMMParser(), BasicGEMMBindings)
@@ -149,9 +153,11 @@
     'Quant': QuantLayer([QuantMapper]),
     'Dequant': QuantLayer([DequantMapper]),
     'SoftmaxGrad': SoftmaxGradLayer([SoftmaxGradMapper]),
-    'SoftmaxCrossEntropyLoss': SoftmaxCrossEntropyLossLayer([SoftmaxCrossEntropyLossMapper]),
+    'SoftmaxCrossEntropyLoss':
+        SoftmaxCrossEntropyLossLayer([SoftmaxCrossEntropyLossDualOutputMapper, SoftmaxCrossEntropyLossMapper]),
     'SoftmaxCrossEntropyLossGrad': SoftmaxCrossEntropyLossGradLayer([SoftmaxCrossEntropyLossGradMapper]),
-    'SGD': SGDLayer([SGDMapper])
+    'SGD': SGDLayer([SGDMapper]),
+    'InPlaceAccumulatorV2': InPlaceAccumulatorV2Layer([InPlaceAccumulatorV2Mapper]),
 }
 
 
diff --git a/Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py b/Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py
index 59499706e5..ef046f191d 100644
--- a/Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py
+++ b/Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py
@@ -4,7 +4,8 @@
 
 from typing import Dict, List, Tuple
 
-from Deeploy.AbstractDataTypes import float32_tPtr
+from Deeploy.AbstractDataTypes import PointerClass
+from Deeploy.CommonExtensions.DataTypes import float32_t
 from Deeploy.DeeployTypes import NetworkContext, NodeTemplate, OperatorRepresentation
 
 
@@ -19,7 +20,7 @@ def alignToContext(self, ctxt: NetworkContext,
         if 'C' not in operatorRepresentation or operatorRepresentation['C'] is None:
             # No bias case - set C to NULL and provide a default type
             operatorRepresentation['C'] = None
-            operatorRepresentation['C_type'] = float32_tPtr  # Default to fp32 type
+            operatorRepresentation['C_type'] = PointerClass(float32_t)  # Default to fp32 type
             operatorRepresentation['C_batched'] = False
 
         return ctxt, operatorRepresentation, []
diff --git a/Deeploy/Targets/PULPOpen/Templates/FloatInPlaceAccumulatorV2Template.py b/Deeploy/Targets/PULPOpen/Templates/FloatInPlaceAccumulatorV2Template.py
new file mode 100644
index 0000000000..2c01219dbd
--- /dev/null
+++ b/Deeploy/Targets/PULPOpen/Templates/FloatInPlaceAccumulatorV2Template.py
@@ -0,0 +1,89 @@
+# SPDX-FileCopyrightText: 2025 ETH Zurich and University of Bologna
+#
+# SPDX-License-Identifier: Apache-2.0
+
+from typing import Dict, List, Tuple
+
+from Deeploy.DeeployTypes import NetworkContext, NodeTemplate, OperatorRepresentation, VariableBuffer
+
+
+class _PULPInPlaceAccumulatorV2Template(NodeTemplate):
+    """True in-place InPlaceAccumulatorV2 template for PULP.
+
+    Writes the result directly into accum_buffer (the graph input) rather
+    than into a separate data_out buffer.  data_out is registered as an
+    alias of accum_buffer so the memory allocator knows they share memory
+    and will not free accum_buffer prematurely.
+
+    Semantics:
+        if lazy_reset_grad: accum_buffer = gradient        (reset)
+        else:               accum_buffer += gradient       (accumulate)
+    """
+
+    def __init__(self, templateStr):
+        super().__init__(templateStr)
+
+    def alignToContext(
+            self, ctxt: NetworkContext,
+            operatorRepresentation: OperatorRepresentation) -> Tuple[NetworkContext, OperatorRepresentation, List[str]]:
+        accum_buffer = ctxt.lookup(operatorRepresentation['accum_buffer'])
+        data_out = ctxt.lookup(operatorRepresentation['data_out'])
+
+        accum_buffer.aliases.add(data_out.name)
+        data_out.aliases.add(accum_buffer.name)
+        data_out._alias = accum_buffer.name
+        return ctxt, operatorRepresentation, []
+
+
+referenceTemplate = _PULPInPlaceAccumulatorV2Template("""
+// InPlaceAccumulatorV2 - true in-place (Name: ${nodeName}, Op: ${nodeOp})
+// Writes result to accum_buffer (in-place) and data_out (explicit output).
+// In training, data_out aliases accum_buffer (same or separate allocation).
+// Reset (lazy_reset_grad=1): accum_buffer  = gradient
+// Accum (lazy_reset_grad=0): accum_buffer += gradient
+int8_t ${nodeName}_core_id = pi_core_id();
+int8_t ${nodeName}_log2Core = log2(NUM_CORES);
+int32_t ${nodeName}_chunk = (${size} >> ${nodeName}_log2Core) + ((${size} & (NUM_CORES-1))!=0);
+int32_t ${nodeName}_start = MIN(${nodeName}_chunk * ${nodeName}_core_id, (int32_t)${size});
+int32_t ${nodeName}_stop  = MIN(${nodeName}_start + ${nodeName}_chunk,   (int32_t)${size});
+
+if (${lazy_reset_grad}[0]) {
+    for (int32_t i = ${nodeName}_start; i < ${nodeName}_stop; i++) {
+        ${accum_buffer}[i] = ${gradient}[i];
+        ${data_out}[i] = ${gradient}[i];
+    }
+} else {
+    for (int32_t i = ${nodeName}_start; i < ${nodeName}_stop; i++) {
+        ${accum_buffer}[i] += ${gradient}[i];
+        ${data_out}[i] = ${accum_buffer}[i];
+    }
+}
+""")
+
+# Tiled variant: writes only to ${accum_buffer} (no ${data_out} write).
+# In the tiled context the optimizer reads the gradient directly from
+# accum_buffer's L2 address (input_4/input_5).  data_out's L2 address may
+# overlap with other live buffers, so writing to it via DMA would corrupt L2.
+# Omitting ${data_out} means we do not need a DMA egress for it at all.
+tiledReferenceTemplate = _PULPInPlaceAccumulatorV2Template("""
+// InPlaceAccumulatorV2 - tiled in-place (Name: ${nodeName}, Op: ${nodeOp})
+// Tiled variant: result written only to accum_buffer (egressed to L2 by DMA).
+// data_out is NOT written here — optimizer reads gradient from accum_buffer.
+// Reset (lazy_reset_grad=1): accum_buffer  = gradient
+// Accum (lazy_reset_grad=0): accum_buffer += gradient
+int8_t ${nodeName}_core_id = pi_core_id();
+int8_t ${nodeName}_log2Core = log2(NUM_CORES);
+int32_t ${nodeName}_chunk = (${size} >> ${nodeName}_log2Core) + ((${size} & (NUM_CORES-1))!=0);
+int32_t ${nodeName}_start = MIN(${nodeName}_chunk * ${nodeName}_core_id, (int32_t)${size});
+int32_t ${nodeName}_stop  = MIN(${nodeName}_start + ${nodeName}_chunk,   (int32_t)${size});
+
+if (${lazy_reset_grad}[0]) {
+    for (int32_t i = ${nodeName}_start; i < ${nodeName}_stop; i++) {
+        ${accum_buffer}[i] = ${gradient}[i];
+    }
+} else {
+    for (int32_t i = ${nodeName}_start; i < ${nodeName}_stop; i++) {
+        ${accum_buffer}[i] += ${gradient}[i];
+    }
+}
+""")
diff --git a/Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py b/Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py
index 1592fe30c4..da27aab47c 100644
--- a/Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py
+++ b/Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py
@@ -2,9 +2,42 @@
 #
 # SPDX-License-Identifier: Apache-2.0
 
-from Deeploy.DeeployTypes import NodeTemplate
+from typing import List, Tuple
 
-referenceTemplate = NodeTemplate("""
+from Deeploy.DeeployTypes import NetworkContext, NodeTemplate, OperatorRepresentation, VariableBuffer
+
+
+class _PULPSGDTemplate(NodeTemplate):
+    """In-place SGD template for PULP.
+
+    weight_updated is aliased to weight so the memory allocator places them
+    at the same L2 address.  This ensures the tiled egress DMA writes the
+    updated weight back to weight's L2 buffer — the same buffer the training
+    network reads from on the next forward pass.
+    """
+
+    def __init__(self, templateStr):
+        super().__init__(templateStr)
+
+    def alignToContext(
+            self, ctxt: NetworkContext,
+            operatorRepresentation: OperatorRepresentation) -> Tuple[NetworkContext, OperatorRepresentation, List[str]]:
+        weight = ctxt.lookup(operatorRepresentation['weight'])
+        weight_updated = ctxt.lookup(operatorRepresentation['weight_updated'])
+
+        weight.aliases.add(weight_updated.name)
+        weight_updated.aliases.add(weight.name)
+        weight_updated._alias = weight.name
+
+        # Make weight_updated share weight's L2 allocation (no separate malloc).
+        # The egress DMA then writes updated weights back to weight's L2 address.
+        weight_updated.allocTemplate = NodeTemplate(
+            " ${name} = (${type.typeName}) " + str(weight._instance) + ";")
+        weight_updated.deallocTemplate = NodeTemplate("")
+        return ctxt, operatorRepresentation, []
+
+
+referenceTemplate = _PULPSGDTemplate("""
 // SGD Weight Update with Separated Multiplication and Subtraction Unrolling
 // (Name: ${nodeName}, Op: ${nodeOp})
 int8_t ${nodeName}_core_id = pi_core_id();
@@ -46,4 +79,4 @@
     float32_t temp_grad = learning_rate * ref_${grad}[i];
     ref_${weight_updated}[i] = ref_${weight}[i] - temp_grad;
 }
-""")
\ No newline at end of file
+""")
diff --git a/Deeploy/Targets/PULPOpen/Templates/SoftmaxCrossEntropyLossTemplate.py b/Deeploy/Targets/PULPOpen/Templates/SoftmaxCrossEntropyLossTemplate.py
index c1aefe01a3..4a3da4b3ee 100644
--- a/Deeploy/Targets/PULPOpen/Templates/SoftmaxCrossEntropyLossTemplate.py
+++ b/Deeploy/Targets/PULPOpen/Templates/SoftmaxCrossEntropyLossTemplate.py
@@ -28,6 +28,31 @@
 END_SINGLE_CORE
 """)
 
+referenceDualOutputTemplate = NodeTemplate("""
+BEGIN_SINGLE_CORE
+    // SoftmaxCrossEntropyLoss dual-output (Name: ${nodeName}, Op: ${nodeOp})
+    float32_t sce_total_loss = 0.0f;
+    for (uint32_t i = 0; i < ${batch}; i++) {
+        float32_t sce_max_logit = ${logits}[i * ${num_classes}];
+        for (uint32_t j = 1; j < ${num_classes}; j++) {
+            if (${logits}[i * ${num_classes} + j] > sce_max_logit)
+                sce_max_logit = ${logits}[i * ${num_classes} + j];
+        }
+        float32_t sce_sum_exp = 0.0f;
+        for (uint32_t j = 0; j < ${num_classes}; j++)
+            sce_sum_exp += expf(${logits}[i * ${num_classes} + j] - sce_max_logit);
+        float32_t sce_log_sum_exp = logf(sce_sum_exp);
+        for (uint32_t j = 0; j < ${num_classes}; j++)
+            ${log_prob}[i * ${num_classes} + j] =
+                ${logits}[i * ${num_classes} + j] - sce_max_logit - sce_log_sum_exp;
+        sce_total_loss += -(${logits}[i * ${num_classes} + (uint32_t)(${labels}[i])]
+                            - sce_max_logit - sce_log_sum_exp);
+    }
+    ${loss}[0] = sce_total_loss / (float32_t)${batch};
+    printf("    [SCE] loss=%.6f\\r\\n", (double)${loss}[0]);
+END_SINGLE_CORE
+""")
+
 referenceGradientTemplate = NodeTemplate("""
 BEGIN_SINGLE_CORE
     // SoftmaxCrossEntropyLossGrad (Name: ${nodeName}, Op: ${nodeOp})
diff --git a/Deeploy/Targets/PULPOpen/TileConstraints/InPlaceAccumulatorV2TileConstraint.py b/Deeploy/Targets/PULPOpen/TileConstraints/InPlaceAccumulatorV2TileConstraint.py
new file mode 100644
index 0000000000..2d3cfa4c3e
--- /dev/null
+++ b/Deeploy/Targets/PULPOpen/TileConstraints/InPlaceAccumulatorV2TileConstraint.py
@@ -0,0 +1,102 @@
+# SPDX-FileCopyrightText: 2025 ETH Zurich and University of Bologna
+#
+# SPDX-License-Identifier: Apache-2.0
+
+from typing import Dict, List, Tuple
+
+import numpy as np
+
+from Deeploy.AbstractDataTypes import PointerClass
+from Deeploy.CommonExtensions.DataTypes import uint16_t
+from Deeploy.DeeployTypes import NetworkContext, OperatorRepresentation
+from Deeploy.Targets.Generic.TileConstraints.BOPTileConstraint import BOPTileConstraint
+from Deeploy.TilingExtension.MemoryConstraints import NodeMemoryConstraint
+from Deeploy.TilingExtension.TilerModel import TilerModel
+from Deeploy.TilingExtension.TilingCodegen import AbsoluteHyperRectangle, HyperRectangle, TilingSchedule, \
+    VariableReplacementScheme
+
+
+class InPlaceAccumulatorV2TileConstraint(BOPTileConstraint):
+    """Tile constraint for InPlaceAccumulatorV2.
+
+    Tiles buffer and gradient together (same shape); lazy_reset_grad is a
+    scalar (1 element) and is not tiled.
+    """
+
+    dataIn1Name = 'accum_buffer'
+    dataIn2Name = 'gradient'
+    dataOutName = 'data_out'
+
+    @classmethod
+    def addGeometricalConstraint(cls, tilerModel: TilerModel, parseDict: Dict, ctxt: NetworkContext) -> TilerModel:
+        # Register buffer, gradient, data_out and add BOP equality constraints
+        tilerModel = super().addGeometricalConstraint(tilerModel, parseDict, ctxt)
+
+        # Register lazy_reset_grad (scalar flag, not tiled): fix all dims to full size
+        lazyResetName = parseDict['lazy_reset_grad']
+        tilerModel.addTensorDimToModel(ctxt, lazyResetName)
+        lazyResetTensor = ctxt.lookup(lazyResetName)
+        shape = lazyResetTensor.shape
+        dims = [shape] if isinstance(shape, int) else shape
+        for idx, dim in enumerate(dims):
+            dimVar = tilerModel.getTensorDimVar(lazyResetName, idx)
+            tilerModel.addConstraint(dimVar == dim)
+
+        return tilerModel
+
+    @classmethod
+    def serializeTilingSolution(
+            cls, tilingSolution: NodeMemoryConstraint, absoluteOutputCubes: List[AbsoluteHyperRectangle],
+            targetMemLevel: str, ctxt: NetworkContext,
+            operatorRepresentation: OperatorRepresentation) -> Tuple[VariableReplacementScheme, TilingSchedule]:
+        outputCubes = [cube.rectangle for cube in absoluteOutputCubes]
+
+        # Egress strategy: use data_out (the proper graph output, present in
+        # outputTensorMemoryConstraints) rather than accum_buffer (a graph input,
+        # only in inputTensorMemoryConstraints).  This avoids two core-class issues:
+        #   1. accum_buffer appearing in BOTH inputBaseOffsets and outputBaseOffsets
+        #      causes a duplicate-hoist KeyError in TilingVariableReplacement.
+        #   2. The egress DMA lookup uses outputTensorMemoryConstraints; accum_buffer
+        #      is not there and would raise a KeyError.
+        #
+        # The trick: force outputBaseOffsets[data_out] to the SAME L1 arena offset as
+        # inputBaseOffsets[accum_buffer].  Both data_out_ref and accum_buffer_ref then
+        # map to the same physical L1 address.  The tiled kernel writes to ${accum_buffer}
+        # (= accum_buffer_ref in L1); the egress DMA transfers data_out_ref (same L1
+        # bytes) to data_out's L2 address, which is what the optimizer reads.
+        addrNames = [cls.dataIn1Name, cls.dataIn2Name, cls.dataOutName, 'lazy_reset_grad']
+        inputBaseOffsets, outputBaseOffsets = cls.extractBaseAddr(tilingSolution, targetMemLevel,
+                                                                  operatorRepresentation, addrNames)
+
+        # Pin data_out's L1 tile to the same arena slot as accum_buffer's L1 tile.
+        outputBaseOffsets[cls.dataOutName] = inputBaseOffsets[cls.dataIn1Name]
+
+        replacements = {"size": []}
+        replacementTypes = {"size": PointerClass(uint16_t)}
+
+        lazyResetName = operatorRepresentation['lazy_reset_grad']
+        lazyResetShape = ctxt.lookup(lazyResetName).shape
+        lazyResetDims = (lazyResetShape,) if isinstance(lazyResetShape, int) else tuple(lazyResetShape)
+        lazyResetCube = HyperRectangle((0,) * len(lazyResetDims), lazyResetDims)
+
+        inputLoadSchedule = []
+        outputLoadSchedule = []
+
+        for cube in outputCubes:
+            replacements["size"].append(int(np.prod(cube.dims)))
+            inputLoadSchedule.append({
+                cls.dataIn1Name: cube,
+                cls.dataIn2Name: cube,
+                'lazy_reset_grad': lazyResetCube,
+            })
+
+        for out in outputCubes:
+            # Egress: DMA from data_out_ref (same L1 slot as accum_buffer_ref) → data_out L2.
+            outputLoadSchedule.append({
+                cls.dataOutName: out,
+            })
+
+        tilingSchedule = TilingSchedule(inputBaseOffsets, outputBaseOffsets, inputLoadSchedule, outputLoadSchedule)
+        variableReplacementSchedule = VariableReplacementScheme(replacements, replacementTypes)
+
+        return variableReplacementSchedule, tilingSchedule
diff --git a/Deeploy/Targets/PULPOpen/TileConstraints/SGDTileConstraint.py b/Deeploy/Targets/PULPOpen/TileConstraints/SGDTileConstraint.py
index b7757786e1..ebef4910ca 100644
--- a/Deeploy/Targets/PULPOpen/TileConstraints/SGDTileConstraint.py
+++ b/Deeploy/Targets/PULPOpen/TileConstraints/SGDTileConstraint.py
@@ -10,3 +10,9 @@ class SGDTileConstraint(BOPTileConstraint):
     dataIn1Name = 'weight'
     dataIn2Name = 'grad'
     dataOutName = 'weight_updated'
+
+class ReluGradTileConstraint(BOPTileConstraint):
+
+    dataIn1Name = 'grad_out'
+    dataIn2Name = 'data_in'
+    dataOutName = 'grad_in'
diff --git a/Deeploy/Targets/PULPOpen/TileConstraints/SoftmaxCrossEntropyLossDualOutputTileConstraint.py b/Deeploy/Targets/PULPOpen/TileConstraints/SoftmaxCrossEntropyLossDualOutputTileConstraint.py
new file mode 100644
index 0000000000..3456632b79
--- /dev/null
+++ b/Deeploy/Targets/PULPOpen/TileConstraints/SoftmaxCrossEntropyLossDualOutputTileConstraint.py
@@ -0,0 +1,74 @@
+# SPDX-FileCopyrightText: 2025 ETH Zurich and University of Bologna
+#
+# SPDX-License-Identifier: Apache-2.0
+
+import copy
+from typing import Dict, List, Tuple, Union
+
+from Deeploy.DeeployTypes import NetworkContext, OperatorRepresentation
+from Deeploy.TilingExtension.MemoryConstraints import NodeMemoryConstraint
+from Deeploy.TilingExtension.TileConstraint import TileConstraint
+from Deeploy.TilingExtension.TilerModel import TilerModel
+from Deeploy.TilingExtension.TilingCodegen import AbsoluteHyperRectangle, HyperRectangle, TilingSchedule, \
+    VariableReplacementScheme
+from Deeploy.Targets.PULPOpen.TileConstraints.SoftmaxCrossEntropyTileConstraint import \
+    SoftmaxCrossEntropyTileConstraint
+
+
+class SoftmaxCrossEntropyLossDualOutputTileConstraint(SoftmaxCrossEntropyTileConstraint):
+    """TileConstraint for SoftmaxCrossEntropyLoss with 2 outputs:
+      - log_prob  : [batch, num_classes]  (primary output — same as single-output version)
+      - loss      : []  0-d scalar (scalar cross-entropy mean)
+
+    Both batch and num_classes are pinned to their full size by the inherited
+    addPolicyConstraint, so no actual tiling of SCE occurs.  The sole purpose of
+    this subclass is to override wrapTilingSolution so that the base-class
+    single-output assertion is bypassed, and the scalar loss buffer is included
+    in the DMA output schedule.
+    """
+
+    # Key in operatorRepresentation for the scalar loss output buffer name.
+    dataLossName = 'loss'
+
+    @classmethod
+    def wrapTilingSolution(
+            cls, tilingSolution: NodeMemoryConstraint, targetMemLevel: str, ctxt: NetworkContext,
+            operatorRepresentation: OperatorRepresentation) -> Tuple[VariableReplacementScheme, List[TilingSchedule]]:
+
+        logProbVar = operatorRepresentation[cls.dataOutName]   # e.g. "onnx::log_prob::3"
+        lossVar    = operatorRepresentation.get(cls.dataLossName, '')
+
+        # If loss is absent (empty string — single-output fallback) or not in the
+        # memory constraint dict, delegate straight to the parent unchanged.
+        if not lossVar or lossVar not in tilingSolution.outputTensorMemoryConstraints:
+            return super().wrapTilingSolution(tilingSolution, targetMemLevel, ctxt, operatorRepresentation)
+
+        # Build a single-output copy of tilingSolution (log_prob only) so that
+        # the base-class assertion `len(outputTensorMemoryConstraints) == 1` passes.
+        singleOutputSolution = copy.deepcopy(tilingSolution)
+        singleOutputSolution.outputTensorMemoryConstraints = {
+            logProbVar: tilingSolution.outputTensorMemoryConstraints[logProbVar]
+        }
+
+        # Call the base-class wrapTilingSolution, which runs cube computation and
+        # calls serializeTilingSolution for log_prob.
+        varReplacement, tilingSchedules = super().wrapTilingSolution(
+            singleOutputSolution, targetMemLevel, ctxt, operatorRepresentation)
+
+        # Extend each TilingSchedule to include the scalar loss output.
+        # The loss tensor is always 1 element (0-d scalar represented as [1] for DMA).
+        lossAddr = TileConstraint.getBaseAddr(tilingSolution, targetMemLevel, lossVar)
+
+        # If the address is None (IO tensor with runtime-determined address, or tensor
+        # not allocated at this memory level), skip — same logic as sanitizeTilingSchedule.
+        if lossAddr == [None]:
+            return varReplacement, tilingSchedules
+
+        lossRect = HyperRectangle((0,), (1,))
+
+        for schedule in tilingSchedules:
+            schedule.outputBaseOffsets[cls.dataLossName] = lossAddr
+            for step in schedule.outputLoadSchedule:
+                step[cls.dataLossName] = lossRect
+
+        return varReplacement, tilingSchedules
diff --git a/Deeploy/Targets/PULPOpen/Tiler.py b/Deeploy/Targets/PULPOpen/Tiler.py
index 901106459e..a135d43812 100644
--- a/Deeploy/Targets/PULPOpen/Tiler.py
+++ b/Deeploy/Targets/PULPOpen/Tiler.py
@@ -19,10 +19,11 @@
     PULPGatherBindings, PULPiHardswishBindings, PULPiRMSNormBindings, PULPiRQSGELUBindings, PULPLayernormBinding, \
     PULPLayernormGradBinding, PULPMatMulBindings, PULPMaxPool1DBindings, PULPMaxPool2DBindings, PULPMulBindings, \
     PULPReduceMeanBindings, PULPReduceSumBindings, PULPReluBinding, PULPReshapeBindings, PULPRQAddBindings, \
-    PULPRQSBindings, PULPRQSConv1DBindings, PULPRQSConv2DBindings, PULPRQSDWConv2DBindings, PULPRQSGEMMBindings, \
+    PULPInPlaceAccumulatorV2Bindings, PULPInPlaceAccumulatorV2TiledBindings, PULPRQSBindings, \
+    PULPRQSConv1DBindings, PULPRQSConv2DBindings, PULPRQSDWConv2DBindings, PULPRQSGEMMBindings, \
     PULPRQSiHardswishBindings, PULPRQSMatrixVecBindings, PULPRQSTallGEMMBindings, PULPSGDBindings, PULPSliceBindings, \
-    PULPSoftmaxBindings, PULPSoftmaxCrossEntropyLossBindings, PULPSoftmaxCrossEntropyLossGradBindings, \
-    PULPSoftmaxGradBindings, PULPTransposeBindings, PULPUniformRQSBindings
+    PULPSoftmaxBindings, PULPSoftmaxCrossEntropyLossBindings, PULPSoftmaxCrossEntropyLossDualOutputBindings, \
+    PULPSoftmaxCrossEntropyLossGradBindings, PULPSoftmaxGradBindings, PULPTransposeBindings, PULPUniformRQSBindings
 from Deeploy.Targets.PULPOpen.TileConstraints.ConvTileConstraint import Conv2DTileConstraint, RQConv1DTileConstraint, \
     RQConv2DTileConstraint
 from Deeploy.Targets.PULPOpen.TileConstraints.DWConvTileConstraint import DWConv2DTileConstraint, \
@@ -30,6 +31,8 @@
 from Deeploy.Targets.PULPOpen.TileConstraints.GatherTileConstraint import GatherTileConstraint
 from Deeploy.Targets.PULPOpen.TileConstraints.GeluTileConstraint import GeluGradTileConstraint
 from Deeploy.Targets.PULPOpen.TileConstraints.GEMMTileConstraint import FloatGEMMTileConstraint, GEMMTileConstraint
+from Deeploy.Targets.PULPOpen.TileConstraints.InPlaceAccumulatorV2TileConstraint import \
+    InPlaceAccumulatorV2TileConstraint
 from Deeploy.Targets.PULPOpen.TileConstraints.iSoftmaxTileConstraint import SoftmaxGradTileConstraint, \
     iSoftmaxTileConstraint
 from Deeploy.Targets.PULPOpen.TileConstraints.LayernormTileConstraint import LayernormGradTileConstraint, \
@@ -41,6 +44,8 @@
 from Deeploy.Targets.PULPOpen.TileConstraints.RequantShiftTileConstraint import RequantShiftTileConstraint
 from Deeploy.Targets.PULPOpen.TileConstraints.SGDTileConstraint import SGDTileConstraint
 from Deeploy.Targets.PULPOpen.TileConstraints.SliceConstraint import SliceTileConstraint
+from Deeploy.Targets.PULPOpen.TileConstraints.SoftmaxCrossEntropyLossDualOutputTileConstraint import \
+    SoftmaxCrossEntropyLossDualOutputTileConstraint
 from Deeploy.Targets.PULPOpen.TileConstraints.SoftmaxCrossEntropyTileConstraint import \
     SoftmaxCrossEntropyGradTileConstraint, SoftmaxCrossEntropyTileConstraint
 from Deeploy.TilingExtension.TilerExtension import TilingReadyNodeBindings
@@ -143,6 +148,10 @@
 PULPSoftmaxCrossEntropyTilingReadyBindings = TilingReadyNodeBindings(
     nodeBindings = PULPSoftmaxCrossEntropyLossBindings, tileConstraint = SoftmaxCrossEntropyTileConstraint())
 
+PULPSoftmaxCrossEntropyDualOutputTilingReadyBindings = TilingReadyNodeBindings(
+    nodeBindings = PULPSoftmaxCrossEntropyLossDualOutputBindings,
+    tileConstraint = SoftmaxCrossEntropyLossDualOutputTileConstraint())
+
 PULPSoftmaxCrossEntropyGradTilingReadyBindings = TilingReadyNodeBindings(
     nodeBindings = PULPSoftmaxCrossEntropyLossGradBindings, tileConstraint = SoftmaxCrossEntropyGradTileConstraint())
 
@@ -155,6 +164,9 @@
 PULPSGDTilingReadyBindings = TilingReadyNodeBindings(nodeBindings = PULPSGDBindings,
                                                      tileConstraint = SGDTileConstraint())
 
+PULPInPlaceAccumulatorV2TilingReadyBindings = TilingReadyNodeBindings(
+    nodeBindings = PULPInPlaceAccumulatorV2TiledBindings, tileConstraint = InPlaceAccumulatorV2TileConstraint())
+
 PULPSliceTilingReadyBindings = TilingReadyNodeBindings(nodeBindings = PULPSliceBindings,
                                                        tileConstraint = SliceTileConstraint())
 
diff --git a/Deeploy/TilingExtension/TilerExtension.py b/Deeploy/TilingExtension/TilerExtension.py
index 2186d4d4c4..e42ddf13ad 100644
--- a/Deeploy/TilingExtension/TilerExtension.py
+++ b/Deeploy/TilingExtension/TilerExtension.py
@@ -333,7 +333,7 @@ def _convertCtxtToStaticSchedule(self, ctxt: NetworkContext,
                     if _buffer._memoryLevel != memoryLevel:
                         continue
 
-                    if hasattr(_buffer, "_alias") and ctxt.is_global(_buffer._alias):
+                    if hasattr(_buffer, "_alias") and ctxt.is_global(_buffer._alias) and _buffer._alias not in blockNames:
                         continue
 
                     if hasattr(_buffer, "_alias") and _buffer._alias in blockNames:
@@ -398,11 +398,32 @@ def minimalloc(self, memoryMap, ctxt, nodeMemoryConstraint, capacity: int, memor
         environment variable to be set to the installation directory.
         """
 
+        blockNames = [block.name for block in memoryMap]
+
+        # In-place alias outputs are costless — their storage is
+        # already accounted for by the alias target.  This mirrors the
+        # zero-cost logic in _buildCostVector (MemoryScheduler.py) and the
+        # skip logic in _allocateStaticBuffer.
+        # We skip them from the MiniMalloc CSV (MiniMalloc does not accept
+        # size-0 entries) and resolve their addrSpace from the alias target
+        # after the solver runs.
+        # NOTE: Only skip when alias target is in the SAME memoryMap.
+        # When alias target is global (e.g. L2 weight) but we're allocating
+        # L1, the buffer still needs its own L1 space.
+        aliasBlocks = set()
+        for memoryBlock in memoryMap:
+            _buffer = ctxt.lookup(memoryBlock.name)
+            if hasattr(_buffer, "_alias") and _buffer._alias in blockNames:
+                aliasBlocks.add(memoryBlock.name)
+
         with open(f"{self._minimalloc_input}.csv", mode = "w", newline = "") as file:
             writer = csv.writer(file, lineterminator = "\n")
             writer.writerow(["id", "lower", "upper", "size"])
             for memoryBlock in memoryMap:
 
+                if memoryBlock.name in aliasBlocks:
+                    continue
+
                 _buffer = ctxt.lookup(memoryBlock.name)
                 if nodeMemoryConstraint is None:
                     _bufferSize = _buffer.size if isinstance(
@@ -419,11 +440,12 @@ def minimalloc(self, memoryMap, ctxt, nodeMemoryConstraint, capacity: int, memor
                                 8) * nodeMemoryConstraint.tensorMemoryConstraints[
                                     memoryBlock.name].memoryConstraints[memoryLevel].multiBufferCoefficient
 
+                _alignedSize = ((int(_bufferSize) + 3) // 4) * 4
                 writer.writerow([
                     memoryBlock.name,
                     str(memoryBlock.lifetime[0]),
                     str(memoryBlock.lifetime[1] + 1),
-                    str(int(_bufferSize))
+                    str(_alignedSize)
                 ])
 
         try:
@@ -452,6 +474,21 @@ def minimalloc(self, memoryMap, ctxt, nodeMemoryConstraint, capacity: int, memor
                     if memoryBlock.name == row[0]:
                         memoryBlock._addrSpace = (int(row[-1]), int(row[-1]) + int(row[-2]))
 
+        # JUNGVI: Alias blocks were skipped in the MiniMalloc CSV.
+        # Resolve their addrSpace from their alias target so that
+        # downstream code can access it if needed.
+        for memoryBlock in memoryMap:
+            if memoryBlock.name in aliasBlocks:
+                _buffer = ctxt.lookup(memoryBlock.name)
+                aliasTarget = ctxt.dealiasBuffer(memoryBlock.name)
+                for targetBlock in memoryMap:
+                    if targetBlock.name == aliasTarget:
+                        memoryBlock._addrSpace = targetBlock._addrSpace
+                        break
+                else:
+                    # Alias target not in this memoryMap — use zero offset
+                    memoryBlock._addrSpace = (0, 0)
+
         return memoryMap
 
     def computeTilingSchedule(self, ctxt: NetworkContext) -> TilingSolution:
diff --git a/DeeployTest/CMakeLists.txt b/DeeployTest/CMakeLists.txt
index b7f3535790..3d6480d5f9 100644
--- a/DeeployTest/CMakeLists.txt
+++ b/DeeployTest/CMakeLists.txt
@@ -6,8 +6,16 @@ include_directories(${GENERATED_SOURCE})
 
 set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
 
-add_library(network OBJECT ${GENERATED_SOURCE}/Network.c)
-target_link_libraries(network PUBLIC deeploylib)
+if(TRAINING)
+  add_library(training_network OBJECT ${GENERATED_SOURCE}/TrainingNetwork.c)
+  target_link_libraries(training_network PUBLIC deeploylib)
+  # Optimizer network (SGD kernel, compiled separately to allow different prefix)
+  add_library(optimizer_network OBJECT ${GENERATED_SOURCE}/OptimizerNetwork.c)
+  target_link_libraries(optimizer_network PUBLIC deeploylib)
+else()
+  add_library(network OBJECT ${GENERATED_SOURCE}/Network.c)
+  target_link_libraries(network PUBLIC deeploylib)
+endif()
 
 if(platform STREQUAL MemPool)
   add_subdirectory(Platforms/MemPool)
@@ -29,7 +37,12 @@ elseif(DEEPLOY_ARCH STREQUAL PULP)
   )
 
   if (NOT HEXLIST)
-    target_compile_options(network PUBLIC -DNOFLASH)
+    if(TRAINING)
+      target_compile_options(training_network PUBLIC -DNOFLASH)
+      target_compile_options(optimizer_network PUBLIC -DNOFLASH)
+    else()
+      target_compile_options(network PUBLIC -DNOFLASH)
+    endif()
   else()
     gvsoc_flags_add_files_to_hyperflash(GVSOC_HEX_HYPERFLASH_FLAGS HEXLIST)
     list(APPEND GVSOC_EXTRA_FLAGS ${GVSOC_HEX_HYPERFLASH_FLAGS})
@@ -37,9 +50,12 @@ elseif(DEEPLOY_ARCH STREQUAL PULP)
 
   # SCHEREMO: Waive warnings
   # Pointer sign warnings are caused by the data width abstraction used in Deeploy. Signedness is not explicitly modelled, as this is handled by kernels
-  target_compile_options(network PRIVATE
-    -Wno-pointer-sign
-  )
+  if(TRAINING)
+    target_compile_options(training_network PRIVATE -Wno-pointer-sign)
+    target_compile_options(optimizer_network PRIVATE -Wno-pointer-sign)
+  else()
+    target_compile_options(network PRIVATE -Wno-pointer-sign)
+  endif()
 
   if(platform STREQUAL Siracusa OR platform STREQUAL Siracusa_w_neureka)
     add_subdirectory(Platforms/Siracusa)
@@ -61,7 +77,12 @@ elseif(platform STREQUAL GAP9)
   if (NOT HEXLIST)
     # L2 mode: No flash/readfs files
     # Data lives in L2 memory only
-    target_compile_options(network PUBLIC -DNOFLASH)
+    if(TRAINING)
+      target_compile_options(training_network PUBLIC -DNOFLASH)
+      target_compile_options(optimizer_network PUBLIC -DNOFLASH)
+    else()
+      target_compile_options(network PUBLIC -DNOFLASH)
+    endif()
     message(STATUS "[Deeploy GAP9] L2 mode: No hex files found, -DNOFLASH set")
     message(STATUS "[Deeploy GAP9] If you expect L3 mode, ensure Python codegen created hex files in ${GENERATED_SOURCE}/hex/")
   else()
@@ -77,5 +98,13 @@ elseif(platform STREQUAL GAP9)
     message(STATUS "GAPY_RUNNER_ARGS: ${GAPY_RUNNER_ARGS}")
   endif()
 
+  # Waive warnings in generated code
+  if(TRAINING)
+    target_compile_options(training_network PRIVATE -Wno-pointer-sign -Wno-sign-compare)
+    target_compile_options(optimizer_network PRIVATE -Wno-pointer-sign -Wno-sign-compare)
+  else()
+    target_compile_options(network PRIVATE -Wno-pointer-sign -Wno-sign-compare)
+  endif()
+
   add_subdirectory(Platforms/GAP9)
 endif()
diff --git a/DeeployTest/Platforms/Siracusa/CMakeLists.txt b/DeeployTest/Platforms/Siracusa/CMakeLists.txt
index 45e6191490..28ac5131f2 100644
--- a/DeeployTest/Platforms/Siracusa/CMakeLists.txt
+++ b/DeeployTest/Platforms/Siracusa/CMakeLists.txt
@@ -1,19 +1,46 @@
 # SPDX-FileCopyrightText: 2024 ETH Zurich and University of Bologna
-#
 # SPDX-License-Identifier: Apache-2.0
 
 set(ProjectId ${TESTNAME})
 
-file(GLOB_RECURSE SOURCES
-    src/CycleCounter.c
-    src/deeploytest.c
-)
+option(TRAINING "Use training harness instead of inference harness" OFF)
+
+# Compile-time training parameters (override via -D on cmake command line)
+set(N_TRAIN_STEPS "1" CACHE STRING "Number of optimizer steps")
+set(N_ACCUM_STEPS "1" CACHE STRING "Number of mini-batches per optimizer step")
+set(TRAINING_NUM_DATA_INPUTS "2" CACHE STRING "Number of data inputs per mini-batch")
+
+if(TRAINING)
+    file(GLOB_RECURSE SOURCES
+        src/CycleCounter.c
+        src/deeploytraintest.c
+    )
+    set(NETWORK_LIB training_network)
+else()
+    file(GLOB_RECURSE SOURCES
+        src/CycleCounter.c
+        src/deeploytest.c
+    )
+    set(NETWORK_LIB network)
+endif()
 
 add_deeploy_executable(${ProjectId} EXCLUDE_FROM_ALL ${SOURCES})
 target_include_directories(${ProjectId} PRIVATE ${CMAKE_CURRENT_LIST_DIR}/inc)
 
-target_link_libraries(${ProjectId} PRIVATE network deeploylib)
-target_compile_options(${ProjectId} INTERFACE network)
-add_gvsoc_emulation(${ProjectId} "siracusa")
+if(TRAINING)
+    target_link_libraries(${ProjectId} PRIVATE ${NETWORK_LIB} optimizer_network deeploylib)
+else()
+    target_link_libraries(${ProjectId} PRIVATE ${NETWORK_LIB} deeploylib)
+endif()
+target_compile_options(${ProjectId} INTERFACE ${NETWORK_LIB})
 
+if(TRAINING)
+    target_compile_definitions(${ProjectId} PRIVATE
+        N_TRAIN_STEPS=${N_TRAIN_STEPS}
+        N_ACCUM_STEPS=${N_ACCUM_STEPS}
+        TRAINING_NUM_DATA_INPUTS=${TRAINING_NUM_DATA_INPUTS}
+    )
+endif()
+
+add_gvsoc_emulation(${ProjectId} "siracusa")
 link_compile_dump(${TESTNAME})
diff --git a/DeeployTest/Platforms/Siracusa/src/deeploytraintest.c b/DeeployTest/Platforms/Siracusa/src/deeploytraintest.c
new file mode 100644
index 0000000000..2b43c90710
--- /dev/null
+++ b/DeeployTest/Platforms/Siracusa/src/deeploytraintest.c
@@ -0,0 +1,415 @@
+/*
+ * SPDX-FileCopyrightText: 2020 ETH Zurich and University of Bologna
+ *
+ * SPDX-License-Identifier: Apache-2.0
+ */
+
+/*
+ * Training harness for Siracusa — Phase 2 (with Deeploy-compiled OptimizerNetwork)
+ *
+ * Loop structure:
+ *
+ *   InitTrainingNetwork()
+ *   InitOptimizerNetwork()
+ *   Connect optimizer buffers → training network's weight/grad buffers
+ *
+ *   for update_step in [0, N_TRAIN_STEPS):          // optimizer steps
+ *       for accum_step in [0, N_ACCUM_STEPS):        // mini-batches per update
+ *           lazy_reset_grad = (accum_step == 0)      // reset on first, accumulate on rest
+ *           load data for this mini-batch
+ *           RunTrainingNetwork()                     // fwd + bwd + InPlaceAccumulatorV2
+ *           store loss value
+ *       // SGD weight update via Deeploy-compiled optimizer kernel:
+ *       copy weights + grad_acc → optimizer input buffers
+ *       RunOptimizerNetwork()
+ *       copy weight_updated ← optimizer output buffers → training weight buffers
+ *
+ *   Numerical verification:
+ *     - Compare stored loss values against testLossRef[] (from testoutputs.h)
+ *
+ * Buffer layout in DeeployNetwork_inputs[] (must match ONNX input order):
+ *   [0 .. TRAINING_NUM_DATA_INPUTS-1]              data + labels (per mini-batch)
+ *   [TRAINING_NUM_DATA_INPUTS ..
+ *    .. TRAINING_GRAD_BUF_START_IDX-1]             weights (persistent)
+ *   [TRAINING_GRAD_BUF_START_IDX ..
+ *    .. +TRAINING_NUM_GRAD_INPUTS-1]               grad accumulation bufs (persistent)
+ *   [DeeployNetwork_num_inputs-1]                  lazy_reset_grad uint8
+ *
+ * Optimizer buffer layout in DeeployOptNetwork_inputs[] (interleaved pairs):
+ *   [2*i]   weight_i     (copied from DeeployNetwork_inputs[TRAINING_NUM_DATA_INPUTS+i])
+ *   [2*i+1] grad_acc_i   (copied from DeeployNetwork_inputs[TRAINING_GRAD_BUF_START_IDX+i])
+ * DeeployOptNetwork_outputs[i] = weight_i_updated
+ *   → copied back to DeeployNetwork_inputs[TRAINING_NUM_DATA_INPUTS+i]
+ *
+ * Compile-time constants (emitted by code generator into testinputs.h):
+ *   N_TRAIN_STEPS              number of optimizer (weight-update) steps
+ *   N_ACCUM_STEPS              number of mini-batches accumulated per update
+ *   TRAINING_NUM_DATA_INPUTS   inputs that change each mini-batch (data + labels)
+ *   TRAINING_GRAD_BUF_START_IDX  first grad acc buffer index in DeeployNetwork_inputs[]
+ *   TRAINING_NUM_GRAD_INPUTS   number of grad accumulation buffers (== number of weights)
+ *   TRAINING_NUM_WEIGHT_INPUTS number of trainable weight buffers
+ *   TRAINING_LEARNING_RATE     SGD learning rate (for reference — embedded in optimizer ONNX)
+ *
+ * Reference comparison constants (emitted into testoutputs.h):
+ *   N_LOSS_REFS                number of reference loss values
+ *   NUM_WEIGHT_REFS            number of reference weight tensors
+ *   TRAINING_TOLERANCE_ABS     absolute comparison tolerance
+ */
+
+#include <math.h>
+#include <stdint.h>
+#include <string.h>
+
+#include "CycleCounter.h"
+#include "OptimizerNetwork.h"
+#include "TrainingNetwork.h"
+#include "dory_mem.h"
+#include "pmsis.h"
+#include "testinputs.h"
+#include "testoutputs.h"
+
+/* Helper: true when ptr is in L2 (CPU-accessible); false when in L3 (external RAM) */
+#define IS_L2(ptr)  ((uint32_t)(ptr) >= 0x10000000u)
+
+/* -------------------------------------------------------------------------
+ * Compile-time defaults — override via CMake target_compile_definitions
+ * ---------------------------------------------------------------------- */
+
+#ifndef N_TRAIN_STEPS
+#define N_TRAIN_STEPS 1
+#endif
+
+#ifndef N_ACCUM_STEPS
+#define N_ACCUM_STEPS 1
+#endif
+
+#ifndef TRAINING_NUM_DATA_INPUTS
+#define TRAINING_NUM_DATA_INPUTS 2
+#endif
+
+#define MAINSTACKSIZE  12000
+#define SLAVESTACKSIZE 3800
+
+/* -------------------------------------------------------------------------
+ * Cluster device
+ * ---------------------------------------------------------------------- */
+
+struct pi_device cluster_dev;
+
+
+/* -------------------------------------------------------------------------
+ * Loss storage (one value per forward pass)
+ * ---------------------------------------------------------------------- */
+
+#define TOTAL_FWD_PASSES (N_TRAIN_STEPS * N_ACCUM_STEPS)
+static float stored_losses[TOTAL_FWD_PASSES];
+
+/* -------------------------------------------------------------------------
+ * Optimizer buffer connection
+ *
+ * Connect DeeployOptNetwork_inputs[]/outputs[] to the training network's
+ * weight and grad acc buffers via memcpy.
+ *
+ * Optimizer ONNX input order: [w0, g0, w1, g1, ...]  (interleaved pairs)
+ * Optimizer ONNX output order: [w0_updated, w1_updated, ...]
+ * ---------------------------------------------------------------------- */
+
+/* -------------------------------------------------------------------------
+ * L3-aware memory transfer: handles all combinations of L2/L3 src and dst
+ * ---------------------------------------------------------------------- */
+
+static void l3_aware_copy(void *dst, const void *src, uint32_t bytes) {
+  if (IS_L2(dst) && IS_L2(src)) {
+    memcpy(dst, src, bytes);
+  } else if (IS_L2(dst)) {
+    /* L3 → L2 */
+    ram_read(dst, (void *)src, bytes);
+  } else if (IS_L2(src)) {
+    /* L2 → L3 */
+    ram_write(dst, (void *)src, bytes);
+  } else {
+    /* L3 → L3: stage through a temporary L2 buffer */
+    void *tmp = pi_l2_malloc(bytes);
+    ram_read(tmp, (void *)src, bytes);
+    ram_write(dst, tmp, bytes);
+    pi_l2_free(tmp, bytes);
+  }
+}
+
+static void connect_optimizer_buffers(void) {
+#if defined(TRAINING_NUM_WEIGHT_INPUTS) && (TRAINING_NUM_WEIGHT_INPUTS > 0)
+  /* Nothing to pre-allocate — InitOptimizerNetwork() already allocated the
+   * optimizer's static buffers and set DeeployOptNetwork_inputs[]/outputs[].
+   * We only need to sync data at each optimizer step (see run_optimizer_step). */
+  (void)0;
+#endif
+}
+
+static void run_optimizer_step(void) {
+#if defined(TRAINING_NUM_WEIGHT_INPUTS) && (TRAINING_NUM_WEIGHT_INPUTS > 0)
+  /* --- Step A: copy current weights + grad acc → optimizer input buffers ---
+   * Skipped when codegen has shared the buffers (pointer equality test). */
+  for (uint32_t wi = 0; wi < (uint32_t)TRAINING_NUM_WEIGHT_INPUTS; wi++) {
+    uint32_t train_w_idx = (uint32_t)TRAINING_NUM_DATA_INPUTS + wi;
+    uint32_t train_g_idx = (uint32_t)TRAINING_GRAD_BUF_START_IDX + wi;
+    uint32_t opt_w_in    = 2u * wi;
+    uint32_t opt_g_in    = 2u * wi + 1u;
+
+    if (DeeployOptNetwork_inputs[opt_w_in] != DeeployNetwork_inputs[train_w_idx]) {
+      l3_aware_copy(DeeployOptNetwork_inputs[opt_w_in],
+                    DeeployNetwork_inputs[train_w_idx],
+                    DeeployOptNetwork_inputs_bytes[opt_w_in]);
+    }
+    if (DeeployOptNetwork_inputs[opt_g_in] != DeeployNetwork_inputs[train_g_idx]) {
+      l3_aware_copy(DeeployOptNetwork_inputs[opt_g_in],
+                    DeeployNetwork_inputs[train_g_idx],
+                    DeeployOptNetwork_inputs_bytes[opt_g_in]);
+    }
+  }
+
+  struct pi_cluster_task opt_task;
+  pi_cluster_task(&opt_task, RunOptimizerNetwork, NULL);
+  opt_task.stack_size       = MAINSTACKSIZE;
+  opt_task.slave_stack_size = SLAVESTACKSIZE;
+  pi_cluster_send_task_to_cl(&cluster_dev, &opt_task);
+
+  /* --- Step C: copy weight_updated back to training network's weight buffers ---
+   * Skipped when codegen has shared the output buffer with the training input. */
+  for (uint32_t wi = 0; wi < (uint32_t)TRAINING_NUM_WEIGHT_INPUTS; wi++) {
+    uint32_t train_w_idx  = (uint32_t)TRAINING_NUM_DATA_INPUTS + wi;
+    uint32_t opt_w_out    = wi;
+
+    if (DeeployOptNetwork_outputs[opt_w_out] == DeeployNetwork_inputs[train_w_idx]) {
+      continue;  /* in-place: training buffer already updated */
+    }
+
+    uint32_t opt_bytes   = DeeployOptNetwork_outputs_bytes[opt_w_out];
+    uint32_t train_bytes = DeeployNetwork_inputs_bytes[train_w_idx];
+    if (opt_bytes == train_bytes) {
+      l3_aware_copy(DeeployNetwork_inputs[train_w_idx],
+                    DeeployOptNetwork_outputs[opt_w_out],
+                    opt_bytes);
+    } else {
+      /* Broadcasted bias: fill every tile with updated value. */
+      for (uint32_t off = 0; off < train_bytes; off += opt_bytes) {
+        uint32_t chunk = (off + opt_bytes <= train_bytes) ? opt_bytes : (train_bytes - off);
+        l3_aware_copy((char *)DeeployNetwork_inputs[train_w_idx] + off,
+                      DeeployOptNetwork_outputs[opt_w_out],
+                      chunk);
+      }
+    }
+  }
+#endif /* TRAINING_NUM_WEIGHT_INPUTS */
+}
+
+/* -------------------------------------------------------------------------
+ * Numerical comparison helpers — run on cluster (FC has no FPU)
+ * ---------------------------------------------------------------------- */
+
+typedef struct {
+  float    *computed;
+  float    *reference;
+  uint32_t  n;
+  uint32_t *err_count;
+} LossCompareArgs;
+
+static void CompareLossesOnCluster(void *args) {
+  if (pi_core_id() != 0) return;
+  LossCompareArgs *a = (LossCompareArgs *)args;
+  float tol = TRAINING_TOLERANCE_ABS;  /* read on cluster — has FPU */
+  uint32_t errors = 0;
+  for (uint32_t i = 0; i < a->n; i++) {
+    float diff = a->computed[i] - a->reference[i];
+    if (diff < 0.0f) diff = -diff;
+    printf("  [loss %u] computed=%.6f  ref=%.6f  diff=%.6f  TOL=%.6f\r\n",
+             i, (double)a->computed[i], (double)a->reference[i],
+             (double)diff, (double)tol);
+    if (diff > tol) {
+      errors++;
+    }
+  }
+  *a->err_count = errors;
+}
+
+/* -------------------------------------------------------------------------
+ * main
+ * ---------------------------------------------------------------------- */
+
+int main(void) {
+
+
+printf("=== Siracusa Training Harness (Phase 2 — with OptimizerNetwork) ===\r\n");
+printf("N_TRAIN_STEPS=%u  N_ACCUM_STEPS=%u  DATA_INPUTS=%u\r\n",
+        (unsigned)N_TRAIN_STEPS, (unsigned)N_ACCUM_STEPS,
+        (unsigned)TRAINING_NUM_DATA_INPUTS);
+
+
+//   /* ------------------------------------------------------------------
+//    * Cluster bring-up
+//    * ------------------------------------------------------------------ */
+
+  struct pi_cluster_conf conf;
+  pi_cluster_conf_init(&conf);
+  conf.id = 0;
+  pi_open_from_conf(&cluster_dev, &conf);
+  if (pi_cluster_open(&cluster_dev))
+    return -1;
+
+#ifndef NOFLASH
+  mem_init();
+  open_fs();
+#endif
+
+  struct pi_cluster_task cluster_task;
+
+  /* ------------------------------------------------------------------
+   * Init training network
+   * ------------------------------------------------------------------ */
+
+  printf("Initializing TrainingNetwork...\r\n");
+  pi_cluster_task(&cluster_task, InitTrainingNetwork, NULL);
+  cluster_task.stack_size       = MAINSTACKSIZE;
+  cluster_task.slave_stack_size = SLAVESTACKSIZE;
+  pi_cluster_send_task_to_cl(&cluster_dev, &cluster_task);
+
+  /* ------------------------------------------------------------------
+   * Zero-initialise gradient accumulation buffers.
+   * ------------------------------------------------------------------ */
+
+
+for (uint32_t _gi = 0; _gi < (uint32_t)TRAINING_NUM_GRAD_INPUTS; _gi++) {
+  uint32_t _idx = (uint32_t)TRAINING_GRAD_BUF_START_IDX + _gi;
+  uint32_t bytes = DeeployNetwork_inputs_bytes[_idx];
+  void *buf = DeeployNetwork_inputs[_idx];
+  if (IS_L2(buf)) {
+    memset(buf, 0, bytes);
+  } else {
+    /* Write zeros into L3 via DMA using a temporary L2 zero page */
+    uint8_t *zero_page = pi_l2_malloc(512);
+    memset(zero_page, 0, 512);
+    for (uint32_t off = 0; off < bytes; off += 512) {
+      uint32_t chunk = (off + 512 <= bytes) ? 512 : (bytes - off);
+      ram_write((char *)buf + off, zero_page, chunk);
+    }
+    pi_l2_free(zero_page, 512);
+  }
+}
+
+  /* ------------------------------------------------------------------
+   * Init optimizer network
+   * ------------------------------------------------------------------ */
+
+  printf("Initializing OptimizerNetwork...\r\n");
+  pi_cluster_task(&cluster_task, InitOptimizerNetwork, NULL);
+  cluster_task.stack_size       = MAINSTACKSIZE;
+  cluster_task.slave_stack_size = SLAVESTACKSIZE;
+  pi_cluster_send_task_to_cl(&cluster_dev, &cluster_task);
+
+//   connect_optimizer_buffers();
+
+//   /* ------------------------------------------------------------------
+//    * lazy_reset_grad is the last input of the training network.
+//    * ------------------------------------------------------------------ */
+
+  uint32_t reset_idx = DeeployNetwork_num_inputs - 1;
+
+  /* ------------------------------------------------------------------
+   * Copy initial weights into network input buffers.
+   * (InitTrainingNetwork only malloc's them; testInitWeights[] holds
+   *  the actual starting values from inputs.npz.)
+   * ------------------------------------------------------------------ */
+
+#if defined(TRAINING_NUM_WEIGHT_INPUTS) && (TRAINING_NUM_WEIGHT_INPUTS > 0)
+  for (uint32_t wi = 0; wi < (uint32_t)TRAINING_NUM_WEIGHT_INPUTS; wi++) {
+    uint32_t idx = (uint32_t)TRAINING_NUM_DATA_INPUTS + wi;
+    l3_aware_copy(DeeployNetwork_inputs[idx], testInitWeights[wi], DeeployNetwork_inputs_bytes[idx]);
+  }
+#endif
+
+  printf("Starting training (%u optimizer steps x %u accum steps)...\r\n",
+         (unsigned)N_TRAIN_STEPS, (unsigned)N_ACCUM_STEPS);
+
+  uint32_t training_cycles   = 0;
+  uint32_t optimizer_cycles  = 0;
+
+  for (uint32_t update_step = 0; update_step < N_TRAIN_STEPS; update_step++) {
+
+    for (uint32_t accum_step = 0; accum_step < N_ACCUM_STEPS; accum_step++) {
+
+      uint32_t mb = update_step * N_ACCUM_STEPS + accum_step;
+
+      printf("  update %u/%u  accum %u/%u  (mini-batch %u)\r\n",
+             update_step + 1, (unsigned)N_TRAIN_STEPS,
+             accum_step + 1,  (unsigned)N_ACCUM_STEPS,
+             mb);
+
+
+      /* ① Set lazy_reset_grad. */
+      {
+        void *reset_ptr = DeeployNetwork_inputs[reset_idx];
+        uint8_t reset_val = (accum_step == 0) ? 1u : 0u;
+        if (IS_L2(reset_ptr)) {
+          *((uint8_t *)reset_ptr) = reset_val;
+        } else {
+          ram_write(reset_ptr, &reset_val, sizeof(uint8_t));
+        }
+      }
+
+      /* ② Load this mini-batch's data + labels (cycle through unique samples). */
+      for (uint32_t buf = 0; buf < TRAINING_NUM_DATA_INPUTS; buf++) {
+        l3_aware_copy(DeeployNetwork_inputs[buf],
+                      testDataVector[mb % TRAINING_DATA_SIZE][buf],
+                      DeeployNetwork_inputs_bytes[buf]);
+      }
+
+      /* ③ Forward + backward + InPlaceAccumulatorV2. */
+      pi_cluster_task(&cluster_task, RunTrainingNetwork, NULL);
+      cluster_task.stack_size       = MAINSTACKSIZE;
+      cluster_task.slave_stack_size = SLAVESTACKSIZE;
+      pi_cluster_send_task_to_cl(&cluster_dev, &cluster_task);
+
+      /* ④ Store loss — use memcpy to avoid float registers on FC (no FPU). */
+      {
+        void *loss_ptr = DeeployNetwork_outputs[0];
+        if (IS_L2(loss_ptr)) {
+          memcpy(&stored_losses[mb], loss_ptr, sizeof(float));
+        } else {
+          ram_read(&stored_losses[mb], loss_ptr, sizeof(float));
+        }
+      }
+
+    } /* end accum_step loop */
+
+    /* ⑤ SGD weight update via Deeploy-compiled OptimizerNetwork. */
+    run_optimizer_step();
+
+  } /* end update_step loop */
+
+  // printf("Training complete.\r\n");
+  // printf("Total training cycles  : %u\r\n", training_cycles);
+  // printf("Total optimizer cycles : %u\r\n", optimizer_cycles);
+
+
+  /* ------------------------------------------------------------------
+   * Numerical verification — run on cluster (FC has no FPU)
+   * ------------------------------------------------------------------ */
+
+  uint32_t loss_err_count = 0;
+  uint32_t total_loss_checks = (TOTAL_FWD_PASSES < N_LOSS_REFS) ? TOTAL_FWD_PASSES : N_LOSS_REFS;
+  LossCompareArgs loss_cmp_args = {
+    .computed  = stored_losses,
+    .reference = (float *)testLossRef,
+    .n         = total_loss_checks,
+    .err_count = &loss_err_count,
+  };
+  pi_cluster_task(&cluster_task, CompareLossesOnCluster, &loss_cmp_args);
+  cluster_task.stack_size       = MAINSTACKSIZE;
+  cluster_task.slave_stack_size = SLAVESTACKSIZE;
+  pi_cluster_send_task_to_cl(&cluster_dev, &cluster_task);
+  printf("Errors: %u out of %u\r\n", (unsigned)loss_err_count, (unsigned)total_loss_checks);
+
+
+
+  return 0;
+
+}
diff --git a/DeeployTest/deeployTrainingRunner.py b/DeeployTest/deeployTrainingRunner.py
new file mode 100644
index 0000000000..815d713ad9
--- /dev/null
+++ b/DeeployTest/deeployTrainingRunner.py
@@ -0,0 +1,30 @@
+#!/usr/bin/env python
+# SPDX-FileCopyrightText: 2025 ETH Zurich and University of Bologna
+#
+# SPDX-License-Identifier: Apache-2.0
+"""
+CLI runner for training tests on Siracusa and GAP9.
+
+Usage:
+    python deeployTrainingRunner.py -t <test_dir> [-p Siracusa|GAP9] [--tiled] [options]
+
+Examples:
+    python deeployTrainingRunner.py -t Tests/Models/MLP_Train/simplemlp_train
+    python deeployTrainingRunner.py -t Tests/Models/MLP_Train/simplemlp_train -p GAP9
+    python deeployTrainingRunner.py -t Tests/Models/SmallTransformer/tinytransformer_train --tiled
+    python deeployTrainingRunner.py -t Tests/Models/SmallTransformer/tinytransformer_train --tiled -p GAP9
+"""
+
+import argparse
+import sys
+
+from testUtils.deeployTrainingRunner import main
+
+if __name__ == '__main__':
+    # Peek at --tiled and -p before passing to main(), which builds its own parser.
+    pre = argparse.ArgumentParser(add_help=False)
+    pre.add_argument('--tiled', action='store_true', default=False)
+    pre.add_argument('-p', '--platform', default='Siracusa')
+    known, _ = pre.parse_known_args()
+
+    sys.exit(main(tiling_enabled=known.tiled, default_platform=known.platform))
diff --git a/DeeployTest/deeployTrainingRunner_siracusa.py b/DeeployTest/deeployTrainingRunner_siracusa.py
new file mode 100644
index 0000000000..c13cc31411
--- /dev/null
+++ b/DeeployTest/deeployTrainingRunner_siracusa.py
@@ -0,0 +1,11 @@
+#!/usr/bin/env python
+# SPDX-FileCopyrightText: 2025 ETH Zurich and University of Bologna
+#
+# SPDX-License-Identifier: Apache-2.0
+
+import sys
+
+from testUtils.deeployTrainingRunner import main
+
+if __name__ == '__main__':
+    sys.exit(main(tiling_enabled = False))
diff --git a/DeeployTest/deeployTrainingRunner_tiled_siracusa.py b/DeeployTest/deeployTrainingRunner_tiled_siracusa.py
new file mode 100644
index 0000000000..3509fc04fe
--- /dev/null
+++ b/DeeployTest/deeployTrainingRunner_tiled_siracusa.py
@@ -0,0 +1,11 @@
+#!/usr/bin/env python
+# SPDX-FileCopyrightText: 2025 ETH Zurich and University of Bologna
+#
+# SPDX-License-Identifier: Apache-2.0
+
+import sys
+
+from testUtils.deeployTrainingRunner import main
+
+if __name__ == '__main__':
+    sys.exit(main(tiling_enabled = True))
diff --git a/DeeployTest/generateOptimizerNetwork.py b/DeeployTest/generateOptimizerNetwork.py
new file mode 100644
index 0000000000..b2d3031fe9
--- /dev/null
+++ b/DeeployTest/generateOptimizerNetwork.py
@@ -0,0 +1,161 @@
+# SPDX-FileCopyrightText: 2025 ETH Zurich and University of Bologna
+#
+# SPDX-License-Identifier: Apache-2.0
+"""
+Optimizer network code-generation entry point.
+
+Loads the optimizer ONNX graph (containing Deeploy SGD nodes) and emits
+OptimizerNetwork.c / OptimizerNetwork.h into the specified output directory.
+
+The generated code uses the prefix ``DeeployOptNetwork_`` (instead of the
+default ``DeeployNetwork_``) so that it can be linked together with the
+training network without symbol conflicts.
+
+Usage
+-----
+    /usr/bin/python generateOptimizerNetwork.py \\
+        -t <optimizer_dir>  \\   # directory containing network.onnx
+        -d <output_dir>     \\   # where to write OptimizerNetwork.c/h
+        -p Siracusa         \\
+        --cores 8           \\
+        --lr 0.001
+"""
+
+import os
+import sys
+from pathlib import Path
+
+import numpy as np
+import onnx
+import onnx_graphsurgeon as gs
+from testUtils.codeGenerate import build_shared_buffer_maps, generateOptimizerTestNetwork
+from testUtils.platformMapping import mapDeployer, mapPlatform, setupMemoryPlatform
+from testUtils.testRunner import TestGeneratorArgumentParser
+
+from Deeploy.AbstractDataTypes import PointerClass
+from Deeploy.CommonExtensions.DataTypes import float32_t
+from Deeploy.DeeployTypes import _NoVerbosity
+from Deeploy.Logging import DEFAULT_LOGGER as log
+from Deeploy.MemoryLevelExtension.MemoryLevels import MemoryHierarchy, MemoryLevel
+from Deeploy.MemoryLevelExtension.NetworkDeployers.MemoryLevelDeployer import MemoryDeployerWrapper
+from Deeploy.MemoryLevelExtension.OptimizationPasses.MemoryLevelAnnotationPasses import AnnotateDefaultMemoryLevel
+from Deeploy.Targets.PULPOpen.Platform import PULPClusterEngine
+
+
+def generateOptimizerNetwork(args):
+    log.debug("Arguments: %s", args)
+
+    # 1. Load optimizer network.onnx
+    onnx_path = f'{args.dir}/network.onnx'
+    onnx_model = onnx.load_model(onnx_path)
+    graph = gs.import_onnx(onnx_model)
+
+    log.debug(f"Optimizer ONNX inputs: {[i.name for i in onnx_model.graph.input]}")
+    log.debug(f"Optimizer ONNX outputs: {[o.name for o in onnx_model.graph.output]}")
+
+    # 2. Platform setup
+    platform, signProp = mapPlatform(args.platform)
+    log.debug(f"Platform: {platform} (sign: {signProp})")
+
+    clusters = [e for e in platform.engines if isinstance(e, PULPClusterEngine)]
+    for cluster in clusters:
+        cluster.n_cores = args.cores
+
+    # 3. All optimizer inputs are float32 (weights + grad acc buffers).
+    graph_input_names = [inp.name for inp in onnx_model.graph.input]
+    inputTypes = {f"input_{i}": PointerClass(float32_t) for i in range(len(graph_input_names))}
+    inputOffsets = {f"input_{i}": 0 for i in range(len(graph_input_names))}
+
+    # 4. Create and prepare deployer
+    _DEEPLOYSTATEDIR = os.path.join(args.dumpdir, "deeployStates_optimizer")
+
+    deployer = mapDeployer(platform,
+                           graph,
+                           inputTypes,
+                           name="DeeployOptimizerNetwork",
+                           deeployStateDir=_DEEPLOYSTATEDIR,
+                           inputOffsets=inputOffsets)
+
+    # Set up memory hierarchy so AnnotateDefaultMemoryLevel assigns the correct
+    # memory level to ConstantBuffers (weights).  The optimizer graph is NOT
+    # tiled, but it must share the same memory-level view as the training graph
+    # so that weights end up in the same physical location (L2 when L3 is the
+    # training default, see AnnotateDefaultMemoryLevel).
+    L3 = MemoryLevel(name="L3", neighbourNames=["L2"], size=64000000)
+    L2 = MemoryLevel(name="L2", neighbourNames=["L3", "L1"], size=args.l2)
+    L1 = MemoryLevel(name="L1", neighbourNames=["L2"], size=args.l1)
+    memoryHierarchy = MemoryHierarchy([L3, L2, L1])
+    memoryHierarchy.setDefaultMemoryLevel(args.defaultMemLevel)
+    defaultTargetMemoryLevel = memoryHierarchy.memoryLevels[args.defaultMemLevel]
+
+    deployer.Platform = setupMemoryPlatform(deployer.Platform, memoryHierarchy, defaultTargetMemoryLevel)
+    deployer = MemoryDeployerWrapper(deployer, [AnnotateDefaultMemoryLevel(memoryHierarchy)])
+
+    verbosityCfg = _NoVerbosity
+    _ = deployer.prepare(verbosityCfg)
+
+    # 5. Build shared-buffer maps when the training ONNX is available
+    shared_input_map: dict = {}
+    shared_output_map: dict = {}
+    training_onnx = Path(args.training_dir) / "network.onnx" if args.training_dir else None
+    if training_onnx and training_onnx.exists():
+        shared_input_map, shared_output_map = build_shared_buffer_maps(str(training_onnx), onnx_model)
+        log.debug(f"[SharedBuffers] input map: {shared_input_map}")
+        log.debug(f"[SharedBuffers] output map: {shared_output_map}")
+        log.info(f"[OptimizerNetwork] Sharing {len(shared_input_map)} inputs and "
+                 f"{len(shared_output_map)} outputs with TrainingNetwork")
+    else:
+        if args.training_dir:
+            log.warning(f"[OptimizerNetwork] training_dir set but {training_onnx} not found — "
+                        "generating standalone OptimizerNetwork (no buffer sharing)")
+
+    # 6. Generate OptimizerNetwork.c / OptimizerNetwork.h
+    os.makedirs(args.dumpdir, exist_ok=True)
+    generateOptimizerTestNetwork(deployer, args.dumpdir, verbosityCfg, shared_input_map, shared_output_map)
+
+    log.info(f"Optimizer network code generated in: {args.dumpdir}")
+    print(f"[OptimizerNetwork] Generated OptimizerNetwork.c/h in {args.dumpdir}")
+
+if __name__ == '__main__':
+
+    parser = TestGeneratorArgumentParser(description="Deeploy Optimizer Network Code Generation.")
+    parser.add_argument(
+        "--cores",
+        type=int,
+        default=1,
+        help="Number of cluster cores. Default: 1.",
+    )
+    parser.add_argument(
+        "--lr",
+        type=float,
+        default=0.001,
+        help="Learning rate (informational only; embedded in optimizer ONNX attributes). Default: 0.001.",
+    )
+    parser.add_argument("--defaultMemLevel", type=str, default="L2",
+                        help="Default memory level (L2 or L3). Must match the training graph. Default: L2.")
+    parser.add_argument("--l1", type=int, default=64000, help="L1 size in bytes. Default: 64000.")
+    parser.add_argument("--l2", type=int, default=1024000, help="L2 size in bytes. Default: 1024000.")
+    parser.add_argument(
+        "--training-dir",
+        type=str,
+        default=None,
+        help="Directory containing the training network.onnx.  When provided, "
+             "weight and grad-acc buffers are shared with TrainingNetwork instead "
+             "of being allocated independently.",
+    )
+    parser.add_argument('--shouldFail', action='store_true')
+    parser.set_defaults(shouldFail=False)
+
+    args = parser.parse_args()
+
+    try:
+        generateOptimizerNetwork(args)
+    except Exception as e:
+        if args.shouldFail:
+            print("\033[92mOptimizer network generation ended, failed as expected!\033[0m")
+            sys.exit(0)
+        else:
+            raise e
+
+    if args.shouldFail:
+        raise RuntimeError("Expected to fail!")
diff --git a/DeeployTest/generateTrainingNetwork.py b/DeeployTest/generateTrainingNetwork.py
new file mode 100644
index 0000000000..d27e74aba8
--- /dev/null
+++ b/DeeployTest/generateTrainingNetwork.py
@@ -0,0 +1,373 @@
+# SPDX-FileCopyrightText: 2024 ETH Zurich and University of Bologna
+#
+# SPDX-License-Identifier: Apache-2.0
+
+import json
+import os
+import sys
+
+import numpy as np
+import onnx
+import onnx_graphsurgeon as gs
+from testUtils.codeGenerate import generateTrainingTestNetwork
+from testUtils.platformMapping import mapDeployer, mapPlatform
+from testUtils.testRunner import TestGeneratorArgumentParser
+from testUtils.typeMapping import inferTypeAndOffset
+
+from Deeploy.AbstractDataTypes import PointerClass
+from Deeploy.CommonExtensions.DataTypes import float32_t, uint8_t
+from Deeploy.DeeployTypes import _NoVerbosity
+from Deeploy.Logging import DEFAULT_LOGGER as log
+from Deeploy.Targets.PULPOpen.Platform import PULPClusterEngine, PULPPlatform
+
+_GRAD_ACC = "_grad.accumulation.buffer"
+
+
+def _load_reference_losses(train_dir: str) -> list:
+    """Load reference loss values from outputs.npz."""
+    outputs_path = os.path.join(train_dir, "outputs.npz")
+    if not os.path.exists(outputs_path):
+        log.warning(f"outputs.npz not found at {outputs_path} — loss comparison skipped")
+        return None
+
+    try:
+        outputs = np.load(outputs_path)
+    except Exception as e:
+        log.warning(f"Failed to load outputs.npz: {e} — loss comparison skipped")
+        return None
+
+    for key in outputs.files:
+        if 'loss' in key.lower():
+            vals = [float(v) for v in np.array(outputs[key]).flatten().tolist()]
+            log.info(f"Reference losses loaded from outputs.npz['{key}']: {vals}")
+            return vals
+
+    log.warning("No 'loss' key found in outputs.npz — loss comparison skipped")
+    return None
+
+
+def _infer_num_data_inputs(inputs_path: str) -> int:
+    """Auto-detect number of data inputs from inputs.npz.
+
+    Data inputs are the base arr_* entries that have per-mini-batch
+    variants (mb1_arr_*) in the npz — i.e. entries that actually change
+    across mini-batches.
+
+    Raises ValueError if no mb1 entries are found (single-mini-batch case)
+    where the data/weight boundary cannot be determined automatically.
+    """
+    inputs = np.load(inputs_path)
+    base_keys = sorted(k for k in inputs.files if not k.startswith('mb') and not k.startswith('meta_'))
+    count = sum(1 for k in base_keys if f'mb1_{k}' in inputs.files)
+    if count == 0:
+        raise ValueError(
+            "Cannot auto-detect num_data_inputs: inputs.npz has only one mini-batch "
+            "(no mb1_arr_* entries found). Please pass --num-data-inputs explicitly."
+        )
+    return count
+
+
+def _infer_total_mb(inputs_path: str) -> int:
+    """Count total mini-batches from inputs.npz.
+
+    New format: inputs.npz contains meta_n_batches (total training mini-batches)
+    and meta_data_size (number of unique samples stored; C harness cycles via modulo).
+
+    Legacy format: count 1 + number of unique mb* indices.
+    """
+    inputs = np.load(inputs_path)
+    if "meta_n_batches" in inputs.files:
+        return int(inputs["meta_n_batches"].flat[0])
+    mb_indices = set()
+    for key in inputs.files:
+        if key.startswith('mb'):
+            try:
+                idx = int(key.split('_')[0][2:])
+                mb_indices.add(idx)
+            except ValueError:
+                pass
+    return 1 + len(mb_indices)
+
+
+def _infer_data_size(inputs_path: str) -> int:
+    """Return the number of unique input samples stored in inputs.npz.
+
+    New format: reads meta_data_size.
+    Legacy format: same as _infer_total_mb (all batches were unique).
+    """
+    inputs = np.load(inputs_path)
+    if "meta_data_size" in inputs.files:
+        return int(inputs["meta_data_size"].flat[0])
+    return _infer_total_mb(inputs_path)
+
+
+def _infer_n_accum(inputs_path: str) -> int:
+    """Return the gradient accumulation step count stored in inputs.npz.
+
+    New format: reads meta_n_accum written by the exporter.
+    Legacy format: defaults to 1 (no gradient accumulation).
+    """
+    inputs = np.load(inputs_path)
+    if "meta_n_accum" in inputs.files:
+        return int(inputs["meta_n_accum"].flat[0])
+    return 1
+
+
+def generateTrainingNetwork(args):
+    log.debug("Arguments: %s", args)
+
+    # 1. Load network.onnx (training graph)
+    onnx_graph = onnx.load_model(f'{args.dir}/network.onnx')
+    graph = gs.import_onnx(onnx_graph)
+
+    # 1a. Handle UNDEFINED-typed outputs in training ONNX graphs.
+    # Backward pass ONNX often doesn't propagate types for gradient outputs.
+    # (i) Strip UNDEFINED-typed outputs that have no consumers.
+    # (ii) Patch UNDEFINED-typed outputs WITH consumers to float32 (training default).
+    _stripped = False
+    _patched = False
+    for node in graph.nodes:
+        filtered = [out for out in node.outputs
+                    if not (out.dtype == 0 and len(out.outputs) == 0)]
+        if len(filtered) < len(node.outputs):
+            node.outputs = filtered
+            _stripped = True
+        for out in node.outputs:
+            if out.dtype == 0 and len(out.outputs) > 0:
+                out.dtype = np.dtype(np.float32)
+                _patched = True
+    if _stripped:
+        graph.cleanup()
+        log.debug("Stripped UNDEFINED-typed unused optional outputs from graph nodes")
+    if _patched:
+        log.debug("Patched UNDEFINED-typed outputs with consumers to float32")
+
+    # 2. Load inputs.npz (new format: no grad acc buf entries)
+    inputs_path = f'{args.dir}/inputs.npz'
+    inputs = np.load(inputs_path)
+
+    # 3. Platform setup
+    platform, signProp = mapPlatform(args.platform)
+
+    log.debug(f"Platform: {platform} (sign: {signProp})")
+
+    # Set cores on cluster engines (same pattern as generateNetwork.py)
+    clusters = [engine for engine in platform.engines if isinstance(engine, PULPClusterEngine)]
+    for cluster in clusters:
+        cluster.n_cores = args.cores
+
+    # 4. Identify grad acc buf positions in the ONNX graph.
+    graph_input_names = [inp.name for inp in onnx_graph.graph.input]
+    grad_acc_set = {i for i, n in enumerate(graph_input_names) if _GRAD_ACC in n}
+    non_grad_indices = [i for i in range(len(graph_input_names)) if i not in grad_acc_set]
+
+    # Base npz arrays: keys that are neither per-mb entries (mb*) nor metadata (meta_*)
+    base_keys = sorted(k for k in inputs.files if not k.startswith('mb') and not k.startswith('meta_'))
+    npz_base = [inputs[k] for k in base_keys]
+
+    if len(npz_base) != len(non_grad_indices):
+        raise ValueError(
+            f"inputs.npz has {len(npz_base)} base entries but network.onnx has "
+            f"{len(non_grad_indices)} non-grad-buf inputs. "
+            f"Re-generate inputs.npz with the updated exporter.")
+
+    # Build inputTypes / inputOffsets for ALL graph input positions.
+    inputTypes = {}
+    inputOffsets = {}
+
+    npz_idx = 0
+    for graph_idx, name in enumerate(graph_input_names):
+        if graph_idx in grad_acc_set:
+            inputTypes[f"input_{graph_idx}"] = PointerClass(float32_t)
+            inputOffsets[f"input_{graph_idx}"] = 0
+        else:
+            arr = npz_base[npz_idx]
+            npz_idx += 1
+
+            if arr.dtype == bool or arr.dtype == np.bool_:
+                inputTypes[f"input_{graph_idx}"] = PointerClass(uint8_t)
+                inputOffsets[f"input_{graph_idx}"] = 0
+            elif arr.dtype in (np.float32, np.float64):
+                # Float32 training parameters always stay float32.
+                # inferTypeAndOffset would misclassify integer-valued floats
+                # (e.g. LayerNorm gamma=1.0 / beta=0.0) as int8_t.
+                inputTypes[f"input_{graph_idx}"] = PointerClass(float32_t)
+                inputOffsets[f"input_{graph_idx}"] = 0
+            elif np.prod(arr.shape) == 0:
+                pass
+            else:
+                values = arr.reshape(-1).astype(np.float32)
+                _type, offset = inferTypeAndOffset(values, signProp=False)
+                inputTypes[f"input_{graph_idx}"] = _type
+                inputOffsets[f"input_{graph_idx}"] = offset
+
+    # 5. Create deployer
+    _DEEPLOYSTATEDIR = os.path.join(args.dumpdir, "deeployStates")
+
+    deployer = mapDeployer(platform,
+                           graph,
+                           inputTypes,
+                           name="DeeployTrainingNetwork",
+                           deeployStateDir=_DEEPLOYSTATEDIR,
+                           inputOffsets=inputOffsets)
+
+    log.debug(f"Deployer: {deployer}")
+
+    # 6. Prepare deployer
+    verbosityCfg = _NoVerbosity
+
+    _ = deployer.prepare(verbosityCfg)
+
+    # 7. Resolve num_data_inputs, n_steps, n_accum (auto-detect when not given).
+
+    # num_data_inputs: detect from npz mb1 variants if not specified
+    num_data = args.num_data_inputs
+    if num_data is None:
+        num_data = _infer_num_data_inputs(inputs_path)
+        log.info(f"Auto-detected num_data_inputs={num_data} from inputs.npz")
+
+    # n_steps / n_accum: derive from inputs.npz mini-batch count if not specified
+    n_steps = args.n_steps
+    n_accum = args.n_accum
+    if n_steps is None or n_accum is None:
+        total_mb = _infer_total_mb(inputs_path)
+        log.info(f"Auto-detected total_mb={total_mb} from inputs.npz")
+        if n_steps is None and n_accum is None:
+            n_accum = _infer_n_accum(inputs_path)
+            n_steps = max(1, total_mb // n_accum)
+        elif n_steps is None:
+            n_steps = max(1, total_mb // n_accum)
+        else:
+            n_accum = max(1, total_mb // n_steps)
+
+    log.info(f"Training config: n_steps={n_steps} n_accum={n_accum} num_data_inputs={num_data}")
+
+    # 8. Build unique_mb_data from npz (only data_size unique samples).
+    # The C harness cycles through them via mb % TRAINING_DATA_SIZE.
+    total_mb = n_steps * n_accum
+    data_size = _infer_data_size(inputs_path)
+    log.info(f"Data cycling: data_size={data_size}, total_mb={total_mb}")
+    mb0_data = list(npz_base[:num_data])
+
+    unique_mb_data = []
+    for mb in range(data_size):
+        if mb == 0:
+            unique_mb_data.append(mb0_data)
+        else:
+            mb_row = []
+            for buf_idx in range(num_data):
+                key = f"mb{mb}_arr_{buf_idx:04d}"
+                mb_row.append(inputs[key] if key in inputs else mb0_data[buf_idx])
+            unique_mb_data.append(mb_row)
+
+    # Grad acc buf info for testinputs.h.
+    if grad_acc_set:
+        sorted_grad = sorted(grad_acc_set)
+        grad_buf_start_idx = sorted_grad[0]
+    else:
+        grad_buf_start_idx = -1
+    num_grad_inputs = len(grad_acc_set)
+
+    # Initial weight arrays: npz_base[num_data .. grad_buf_start_idx-1]
+    if grad_buf_start_idx > num_data:
+        init_weights = list(npz_base[num_data:grad_buf_start_idx])
+    else:
+        init_weights = []
+
+    # 9. Load reference loss from outputs.npz.
+    reference_losses = _load_reference_losses(args.dir)
+
+    # 10. Generate all output files
+    os.makedirs(args.dumpdir, exist_ok=True)
+
+    generateTrainingTestNetwork(deployer,
+                                unique_mb_data,
+                                args.dumpdir,
+                                verbosityCfg,
+                                n_steps=n_steps,
+                                n_accum=n_accum,
+                                num_data_inputs=num_data,
+                                grad_buf_start_idx=grad_buf_start_idx,
+                                num_grad_inputs=num_grad_inputs,
+                                learning_rate=args.learning_rate,
+                                reference_losses=reference_losses,
+                                init_weights=init_weights,
+                                data_size=data_size,
+                                tolerance_abs=args.tolerance_abs)
+
+    # 11. Write resolved config for execution.py to pick up after subprocess call.
+    meta = {
+        "n_train_steps": n_steps,
+        "n_accum_steps": n_accum,
+        "training_num_data_inputs": num_data,
+    }
+    meta_path = os.path.join(args.dumpdir, "training_meta.json")
+    with open(meta_path, 'w') as f:
+        json.dump(meta, f, indent=2)
+    log.info(f"Training meta written to {meta_path}: {meta}")
+
+
+if __name__ == '__main__':
+
+    parser = TestGeneratorArgumentParser(description="Deeploy Training Code Generation Utility.")
+    parser.add_argument(
+        "--cores",
+        type=int,
+        default=1,
+        help="Number of cores on which the network is run. "
+        "Currently required for im2col buffer sizing on Siracusa. Default: 1.",
+    )
+    parser.add_argument(
+        "--num-data-inputs",
+        type=int,
+        dest="num_data_inputs",
+        default=None,
+        help="Number of DATA inputs that change per mini-batch. "
+        "Auto-detected from ONNX graph if not specified.",
+    )
+    parser.add_argument(
+        "--n-steps",
+        type=int,
+        dest="n_steps",
+        default=None,
+        help="N_TRAIN_STEPS: number of gradient-accumulation update steps. "
+        "Auto-detected from inputs.npz mini-batch count if not specified.",
+    )
+    parser.add_argument(
+        "--n-accum",
+        type=int,
+        dest="n_accum",
+        default=None,
+        help="N_ACCUM_STEPS: number of mini-batches per update step. "
+        "Auto-detected from inputs.npz mini-batch count if not specified.",
+    )
+    parser.add_argument(
+        "--learning-rate",
+        type=float,
+        dest="learning_rate",
+        default=0.001,
+        help="SGD learning rate emitted as TRAINING_LEARNING_RATE in testinputs.h. Default: 0.001.",
+    )
+    parser.add_argument(
+        "--tolerance",
+        type=float,
+        dest="tolerance_abs",
+        default=1e-3,
+        help="Absolute loss tolerance emitted as TRAINING_TOLERANCE_ABS in testoutputs.h. Default: 1e-3.",
+    )
+    parser.add_argument('--shouldFail', action='store_true')
+    parser.set_defaults(shouldFail=False)
+
+    args = parser.parse_args()
+
+    try:
+        generateTrainingNetwork(args)
+    except Exception as e:
+        if args.shouldFail:
+            print("\033[92mTraining network generation ended, failed as expected!\033[0m")
+            sys.exit(0)
+        else:
+            raise e
+
+    if args.shouldFail:
+        raise RuntimeError("Expected to fail!")
diff --git a/DeeployTest/testMVPOptimizer.py b/DeeployTest/testMVPOptimizer.py
new file mode 100644
index 0000000000..9e29d79c55
--- /dev/null
+++ b/DeeployTest/testMVPOptimizer.py
@@ -0,0 +1,236 @@
+# SPDX-FileCopyrightText: 2025 ETH Zurich and University of Bologna
+#
+# SPDX-License-Identifier: Apache-2.0
+"""
+Tiled optimizer network code-generation entry point.
+
+Loads the optimizer ONNX graph (containing Deeploy SGD nodes) and emits
+OptimizerNetwork.c / OptimizerNetwork.h into the specified output directory,
+using the SB-Tiler to tile SGD kernels through L1.
+
+The generated code uses the prefix ``DeeployOptNetwork_`` (instead of the
+default ``DeeployNetwork_``) so that it can be linked together with the
+training network without symbol conflicts.
+
+Usage
+-----
+    /usr/bin/python testMVPOptimizer.py \\
+        -t <optimizer_dir>  \\   # directory containing network.onnx
+        -d <output_dir>     \\   # where to write OptimizerNetwork.c/h
+        -p Siracusa         \\
+        --cores 8           \\
+        --l1 64000          \\
+        --l2 1024000        \\
+        --defaultMemLevel L2
+"""
+
+import hashlib
+import os
+import sys
+from pathlib import Path
+from typing import List
+
+import onnx
+import onnx_graphsurgeon as gs
+from testUtils.codeGenerate import build_shared_buffer_maps, generateOptimizerTestNetwork
+from testUtils.platformMapping import mapDeployer, mapPlatform, setupMemoryPlatform
+from testUtils.testRunner import TestGeneratorArgumentParser
+from testUtils.tilingUtils import TrainingSBTiler
+
+from Deeploy.AbstractDataTypes import PointerClass
+from Deeploy.CommonExtensions.DataTypes import float32_t
+from Deeploy.DeeployTypes import CodeGenVerbosity, _NoVerbosity
+from Deeploy.Logging import DEFAULT_LOGGER as log
+from Deeploy.MemoryLevelExtension.MemoryLevels import MemoryHierarchy, MemoryLevel
+from Deeploy.MemoryLevelExtension.NetworkDeployers.MemoryLevelDeployer import MemoryDeployerWrapper
+from Deeploy.MemoryLevelExtension.OptimizationPasses.MemoryLevelAnnotationPasses import AnnotateDefaultMemoryLevel, \
+    AnnotateIOMemoryLevel
+from Deeploy.Targets.PULPOpen.Platform import PULPClusterEngine
+from Deeploy.TilingExtension.TilerExtension import TilerDeployerWrapper
+
+
+def _mockScheduler(graph: gs.Graph) -> List[List[gs.Node]]:
+    """Wrap every node in a singleton list for the Tiler pattern interface."""
+    return [[node] for node in graph.nodes]
+
+
+def generateTiledOptimizerNetwork(args) -> None:
+    log.debug("Arguments: %s", args)
+
+    # 1. Load optimizer network.onnx
+    onnx_path = f'{args.dir}/network.onnx'
+    onnx_model = onnx.load_model(onnx_path)
+    graph = gs.import_onnx(onnx_model)
+
+    log.debug(f"Optimizer ONNX inputs: {[i.name for i in onnx_model.graph.input]}")
+    log.debug(f"Optimizer ONNX outputs: {[o.name for o in onnx_model.graph.output]}")
+
+    # 2. Platform setup
+    platform, signProp = mapPlatform(args.platform)
+    log.debug(f"Platform: {platform} (sign: {signProp})")
+
+    clusters = [e for e in platform.engines if isinstance(e, PULPClusterEngine)]
+    for cluster in clusters:
+        cluster.n_cores = args.cores
+
+    # 3. All optimizer inputs are float32 (weights + grad acc buffers).
+    graph_input_names = [inp.name for inp in onnx_model.graph.input]
+    inputTypes = {f"input_{i}": PointerClass(float32_t) for i in range(len(graph_input_names))}
+    inputOffsets = {f"input_{i}": 0 for i in range(len(graph_input_names))}
+
+    # 4. Create deployer with _mockScheduler (required for TilerDeployerWrapper).
+    _DEEPLOYSTATEDIR = os.path.join(args.dumpdir, "deeployStates_optimizer")
+
+    deployer = mapDeployer(platform,
+                           graph,
+                           inputTypes,
+                           name="DeeployOptimizerNetwork",
+                           deeployStateDir=_DEEPLOYSTATEDIR,
+                           inputOffsets=inputOffsets,
+                           scheduler=_mockScheduler)
+
+    # 5. Set up memory hierarchy.
+    #    Tiles execute in L1; optimizer I/O (weights, grads) live in L2 (or L3).
+    L3 = MemoryLevel(name="L3", neighbourNames=["L2"], size=64_000_000)
+    L2 = MemoryLevel(name="L2", neighbourNames=["L3", "L1"], size=args.l2)
+    L1 = MemoryLevel(name="L1", neighbourNames=["L2"], size=args.l1)
+    memoryHierarchy = MemoryHierarchy([L3, L2, L1])
+    memoryHierarchy.setDefaultMemoryLevel(args.defaultMemLevel)
+
+    defaultTargetMemLevel = L1
+    defaultIoMemLevel = memoryHierarchy.memoryLevels[args.defaultMemLevel]
+
+    # 6. Wrap with memory-level annotation.
+    deployer.Platform = setupMemoryPlatform(deployer.Platform, memoryHierarchy, defaultTargetMemLevel)
+    deployer = MemoryDeployerWrapper(deployer, [
+        AnnotateIOMemoryLevel(defaultIoMemLevel.name),
+        AnnotateDefaultMemoryLevel(memoryHierarchy),
+    ])
+
+    # 7. Wrap with SBTiler (single-buffering; optimizer is forward-only, no lifetime extension needed).
+    unique_params = f"{args.dumpdir}_L1{args.l1}_L2{args.l2}_{args.defaultMemLevel}_optimizer"
+    testIdentifier = hashlib.md5(unique_params.encode()).hexdigest()[:16]
+
+    # TrainingSBTiler extends all input buffer lifetimes to the end of the
+    # schedule (via TrainingMemoryScheduler).  This prevents the allocator from
+    # reusing the space of a consumed input (e.g. fc1 weight) for a later
+    # output (e.g. fc2 updated weight), which would corrupt the weight buffer.
+    deployer = TilerDeployerWrapper(deployer, TrainingSBTiler, testName=testIdentifier, workDir=args.dumpdir)
+    deployer.tiler.visualizeMemoryAlloc = args.plotMemAlloc
+    deployer.tiler.memoryAllocStrategy = args.memAllocStrategy
+    deployer.tiler.searchStrategy = args.searchStrategy
+
+    # 8. Prepare deployer.
+    verbosityCfg = _NoVerbosity
+    if args.profileTiling:
+        verbosityCfg = CodeGenVerbosity(tilingProfiling=True)
+    _ = deployer.prepare(verbosityCfg)
+
+    # 9. Build shared-buffer maps when the training ONNX is available
+    shared_input_map: dict = {}
+    shared_output_map: dict = {}
+    training_onnx = Path(args.training_dir) / "network.onnx" if args.training_dir else None
+    if training_onnx and training_onnx.exists():
+        shared_input_map, shared_output_map = build_shared_buffer_maps(str(training_onnx), onnx_model)
+        log.debug(f"[SharedBuffers] input map: {shared_input_map}")
+        log.debug(f"[SharedBuffers] output map: {shared_output_map}")
+        log.info(f"[TiledOptimizerNetwork] Sharing {len(shared_input_map)} inputs and "
+                 f"{len(shared_output_map)} outputs with TrainingNetwork")
+    else:
+        if args.training_dir:
+            log.warning(f"[TiledOptimizerNetwork] training_dir set but {training_onnx} not found — "
+                        "generating standalone OptimizerNetwork (no buffer sharing)")
+
+    # 10. Generate OptimizerNetwork.c / OptimizerNetwork.h
+    os.makedirs(args.dumpdir, exist_ok=True)
+    generateOptimizerTestNetwork(deployer, args.dumpdir, verbosityCfg, shared_input_map, shared_output_map)
+
+    log.info(f"Tiled optimizer network code generated in: {args.dumpdir}")
+    print(f"[TiledOptimizerNetwork] Generated OptimizerNetwork.c/h in {args.dumpdir}")
+
+
+if __name__ == '__main__':
+
+    parser = TestGeneratorArgumentParser(description="Deeploy Tiled Optimizer Network Code Generation.")
+
+    parser.add_argument(
+        "--cores",
+        type=int,
+        default=1,
+        help="Number of cluster cores. Default: 1.",
+    )
+    parser.add_argument(
+        "--lr",
+        type=float,
+        default=0.001,
+        help="Learning rate (informational only; embedded in optimizer ONNX attributes). Default: 0.001.",
+    )
+    parser.add_argument(
+        '--l1',
+        type=int,
+        dest='l1',
+        default=64_000,
+        help='L1 size in bytes. Default: 64000.',
+    )
+    parser.add_argument(
+        '--l2',
+        type=int,
+        dest='l2',
+        default=1_024_000,
+        help='L2 size in bytes. Default: 1024000.',
+    )
+    parser.add_argument(
+        '--defaultMemLevel',
+        type=str,
+        dest='defaultMemLevel',
+        default="L2",
+        help='Default memory level for optimizer I/O buffers (L2 or L3). Must match the training graph. Default: L2.',
+    )
+    parser.add_argument(
+        '--memAllocStrategy',
+        type=str,
+        dest='memAllocStrategy',
+        default="MiniMalloc",
+        help='Memory allocation strategy. Default: MiniMalloc.',
+    )
+    parser.add_argument(
+        '--searchStrategy',
+        type=str,
+        dest='searchStrategy',
+        default="random-max",
+        help='CP solver search strategy. Default: random-max.',
+    )
+    parser.add_argument(
+        '--plotMemAlloc',
+        action='store_true',
+        help='Save memory allocation plots in the deeployStates folder.',
+    )
+    parser.add_argument(
+        '--profileTiling',
+        action='store_true',
+        help='Enable tiling profiling (inserts cycle counters around each tiled kernel).',
+    )
+    parser.add_argument(
+        "--training-dir",
+        type=str,
+        default=None,
+        help="Directory containing the training network.onnx.  When provided, "
+             "weight and grad-acc buffers are shared with TrainingNetwork instead "
+             "of being allocated independently.",
+    )
+    parser.add_argument('--shouldFail', action='store_true')
+    parser.set_defaults(shouldFail=False)
+
+    args = parser.parse_args()
+
+    try:
+        generateTiledOptimizerNetwork(args)
+    except Exception as e:
+        if args.shouldFail:
+            print("\033[92mTiled optimizer network generation ended, failed as expected!\033[0m")
+            sys.exit(0)
+        else:
+            raise e
+
+    if args.shouldFail:
+        raise RuntimeError("Expected to fail!")
diff --git a/DeeployTest/testMVPTraining.py b/DeeployTest/testMVPTraining.py
new file mode 100644
index 0000000000..30b23dd1e3
--- /dev/null
+++ b/DeeployTest/testMVPTraining.py
@@ -0,0 +1,421 @@
+# SPDX-FileCopyrightText: 2024 ETH Zurich and University of Bologna
+#
+# SPDX-License-Identifier: Apache-2.0
+
+import hashlib
+import json
+import os
+import sys
+from typing import List
+
+import numpy as np
+import onnx
+import onnx_graphsurgeon as gs
+from testUtils.codeGenerate import generateTrainingTestNetwork
+from testUtils.platformMapping import mapDeployer, mapPlatform, setupMemoryPlatform
+from testUtils.testRunner import TestGeneratorArgumentParser
+from testUtils.tilingUtils import TrainingSBTiler
+from testUtils.typeMapping import inferTypeAndOffset
+
+from Deeploy.AbstractDataTypes import PointerClass
+from Deeploy.CommonExtensions.DataTypes import float32_t, uint8_t
+from Deeploy.DeeployTypes import CodeGenVerbosity, NetworkDeployer, _NoVerbosity
+from Deeploy.Logging import DEFAULT_LOGGER as log
+from Deeploy.MemoryLevelExtension.MemoryLevels import MemoryHierarchy, MemoryLevel
+from Deeploy.MemoryLevelExtension.NetworkDeployers.MemoryLevelDeployer import MemoryDeployerWrapper
+from Deeploy.MemoryLevelExtension.OptimizationPasses.MemoryLevelAnnotationPasses import AnnotateDefaultMemoryLevel, \
+    AnnotateIOMemoryLevel
+from Deeploy.Targets.PULPOpen.Platform import PULPClusterEngine
+from Deeploy.TilingExtension.TilerExtension import TilerDeployerWrapper
+
+_GRAD_ACC = "_grad.accumulation.buffer"
+
+
+# ---------------------------------------------------------------------------
+# Helpers copied from generateTrainingNetwork.py
+# ---------------------------------------------------------------------------
+
+def _load_reference_losses(train_dir: str) -> list:
+    """Load reference loss values from outputs.npz."""
+    outputs_path = os.path.join(train_dir, "outputs.npz")
+    if not os.path.exists(outputs_path):
+        log.warning(f"outputs.npz not found at {outputs_path} — loss comparison skipped")
+        return None
+    try:
+        outputs = np.load(outputs_path)
+    except Exception as e:
+        log.warning(f"Failed to load outputs.npz: {e} — loss comparison skipped")
+        return None
+    for key in outputs.files:
+        if 'loss' in key.lower():
+            vals = [float(v) for v in np.array(outputs[key]).flatten().tolist()]
+            log.info(f"Reference losses loaded from outputs.npz['{key}']: {vals}")
+            return vals
+    log.warning("No 'loss' key found in outputs.npz — loss comparison skipped")
+    return None
+
+
+def _infer_num_data_inputs(inputs_path: str) -> int:
+    inputs = np.load(inputs_path)
+    base_keys = sorted(k for k in inputs.files if not k.startswith('mb') and not k.startswith('meta_'))
+    count = sum(1 for k in base_keys if f'mb1_{k}' in inputs.files)
+    if count == 0:
+        raise ValueError(
+            "Cannot auto-detect num_data_inputs: inputs.npz has only one mini-batch "
+            "(no mb1_arr_* entries found). Please pass --num-data-inputs explicitly.")
+    return count
+
+
+def _infer_total_mb(inputs_path: str) -> int:
+    inputs = np.load(inputs_path)
+    if "meta_n_batches" in inputs.files:
+        return int(inputs["meta_n_batches"].flat[0])
+    mb_indices = set()
+    for key in inputs.files:
+        if key.startswith('mb'):
+            try:
+                idx = int(key.split('_')[0][2:])
+                mb_indices.add(idx)
+            except ValueError:
+                pass
+    return 1 + len(mb_indices)
+
+
+def _infer_data_size(inputs_path: str) -> int:
+    inputs = np.load(inputs_path)
+    if "meta_data_size" in inputs.files:
+        return int(inputs["meta_data_size"].flat[0])
+    return _infer_total_mb(inputs_path)
+
+
+def _infer_n_accum(inputs_path: str) -> int:
+    inputs = np.load(inputs_path)
+    if "meta_n_accum" in inputs.files:
+        return int(inputs["meta_n_accum"].flat[0])
+    return 1
+
+
+# ---------------------------------------------------------------------------
+# Mock scheduler (same as testMVP.py)
+# ---------------------------------------------------------------------------
+
+def _mockScheduler(graph: gs.Graph) -> List[List[gs.Node]]:
+    """Wrap every node in a singleton list for the Tiler pattern interface."""
+    return [[node] for node in graph.nodes]
+
+
+# ---------------------------------------------------------------------------
+# Main generation function
+# ---------------------------------------------------------------------------
+
+def generateTiledTrainingNetwork(args) -> None:
+    log.debug("Arguments: %s", args)
+
+    # 1. Load network.onnx (training graph with forward + backward ops).
+    onnx_graph = onnx.load_model(f'{args.dir}/network.onnx')
+    graph = gs.import_onnx(onnx_graph)
+
+    # 1a. Strip UNDEFINED-typed unused optional outputs (e.g. MaxPool mask indices).
+    _stripped = False
+    for node in graph.nodes:
+        filtered = [out for out in node.outputs if not (out.dtype == 0 and len(out.outputs) == 0)]
+        if len(filtered) < len(node.outputs):
+            node.outputs = filtered
+            _stripped = True
+    if _stripped:
+        graph.cleanup()
+        log.debug("Stripped UNDEFINED-typed unused optional outputs from graph nodes")
+
+    # 2. Load inputs.npz.
+    inputs_path = f'{args.dir}/inputs.npz'
+    inputs = np.load(inputs_path)
+
+    # 3. Platform setup.
+    platform, signProp = mapPlatform(args.platform)
+    log.debug(f"Platform: {platform} (sign: {signProp})")
+
+    clusters = [engine for engine in platform.engines if isinstance(engine, PULPClusterEngine)]
+    for cluster in clusters:
+        cluster.n_cores = args.cores
+
+    # 4. Identify grad acc buf positions in the ONNX graph.
+    graph_input_names = [inp.name for inp in onnx_graph.graph.input]
+    grad_acc_set = {i for i, n in enumerate(graph_input_names) if _GRAD_ACC in n}
+    non_grad_indices = [i for i in range(len(graph_input_names)) if i not in grad_acc_set]
+
+    base_keys = sorted(k for k in inputs.files if not k.startswith('mb') and not k.startswith('meta_'))
+    npz_base = [inputs[k] for k in base_keys]
+
+    if len(npz_base) != len(non_grad_indices):
+        raise ValueError(
+            f"inputs.npz has {len(npz_base)} base entries but network.onnx has "
+            f"{len(non_grad_indices)} non-grad-buf inputs. "
+            f"Re-generate inputs.npz with the updated exporter.")
+
+    # 5. Build inputTypes / inputOffsets for ALL graph input positions.
+    inputTypes = {}
+    inputOffsets = {}
+
+    npz_idx = 0
+    for graph_idx, name in enumerate(graph_input_names):
+        if graph_idx in grad_acc_set:
+            inputTypes[f"input_{graph_idx}"] = PointerClass(float32_t)
+            inputOffsets[f"input_{graph_idx}"] = 0
+        else:
+            arr = npz_base[npz_idx]
+            npz_idx += 1
+            if arr.dtype == bool or arr.dtype == np.bool_:
+                inputTypes[f"input_{graph_idx}"] = PointerClass(uint8_t)
+                inputOffsets[f"input_{graph_idx}"] = 0
+            elif arr.dtype in (np.float32, np.float64):
+                inputTypes[f"input_{graph_idx}"] = PointerClass(float32_t)
+                inputOffsets[f"input_{graph_idx}"] = 0
+            elif np.prod(arr.shape) == 0:
+                pass
+            else:
+                values = arr.reshape(-1).astype(np.float32)
+                _type, offset = inferTypeAndOffset(values, signProp=False)
+                inputTypes[f"input_{graph_idx}"] = _type
+                inputOffsets[f"input_{graph_idx}"] = offset
+
+    # 6. Create deployer with _mockScheduler (required for TilerDeployerWrapper).
+    _DEEPLOYSTATEDIR = os.path.join(args.dumpdir, "deeployStates")
+
+    deployer = mapDeployer(platform,
+                           graph,
+                           inputTypes,
+                           name="DeeployTrainingNetwork",
+                           deeployStateDir=_DEEPLOYSTATEDIR,
+                           inputOffsets=inputOffsets,
+                           scheduler=_mockScheduler)
+
+    # 7. Set up memory hierarchy.
+    L3 = MemoryLevel(name="L3", neighbourNames=["L2"], size=64_000_000)
+    L2 = MemoryLevel(name="L2", neighbourNames=["L3", "L1"], size=args.l2)
+    L1 = MemoryLevel(name="L1", neighbourNames=["L2"], size=args.l1)
+    memoryHierarchy = MemoryHierarchy([L3, L2, L1])
+    memoryHierarchy.setDefaultMemoryLevel(args.defaultMemLevel)
+
+    defaultTargetMemLevel = L1
+    defaultIoMemLevel = memoryHierarchy.memoryLevels[args.defaultMemLevel]
+
+    # 8. Wrap with memory-level annotation.
+    deployer.Platform = setupMemoryPlatform(deployer.Platform, memoryHierarchy, defaultTargetMemLevel)
+
+    deployer = MemoryDeployerWrapper(deployer, [
+        AnnotateIOMemoryLevel(defaultIoMemLevel.name),
+        AnnotateDefaultMemoryLevel(memoryHierarchy),
+    ])
+
+    # 9. Wrap with tiler (TrainingSBTiler: SB strategy + extended input lifetimes for backward pass).
+    unique_params = f"{args.dumpdir}_L1{args.l1}_L2{args.l2}_{args.defaultMemLevel}"
+    testIdentifier = hashlib.md5(unique_params.encode()).hexdigest()[:16]
+
+    deployer = TilerDeployerWrapper(deployer, TrainingSBTiler, testName=testIdentifier, workDir=args.dumpdir)
+    deployer.tiler.visualizeMemoryAlloc = args.plotMemAlloc
+    deployer.tiler.memoryAllocStrategy = args.memAllocStrategy
+    deployer.tiler.searchStrategy = args.searchStrategy
+
+    # 10. Prepare deployer.
+    verbosityCfg = _NoVerbosity
+    if args.profileTiling:
+        verbosityCfg = CodeGenVerbosity(tilingProfiling = True)
+    _ = deployer.prepare(verbosityCfg)
+
+    # 11. Resolve num_data_inputs, n_steps, n_accum.
+    num_data = args.num_data_inputs
+    if num_data is None:
+        num_data = _infer_num_data_inputs(inputs_path)
+        log.info(f"Auto-detected num_data_inputs={num_data} from inputs.npz")
+
+    n_steps = args.n_steps
+    n_accum = args.n_accum
+    if n_steps is None or n_accum is None:
+        total_mb = _infer_total_mb(inputs_path)
+        log.info(f"Auto-detected total_mb={total_mb} from inputs.npz")
+        if n_steps is None and n_accum is None:
+            n_accum = _infer_n_accum(inputs_path)
+            n_steps = max(1, total_mb // n_accum)
+        elif n_steps is None:
+            n_steps = max(1, total_mb // n_accum)
+        else:
+            n_accum = max(1, total_mb // n_steps)
+
+    log.info(f"Training config: n_steps={n_steps} n_accum={n_accum} num_data_inputs={num_data}")
+
+    # 12. Build unique_mb_data from npz.
+    total_mb = n_steps * n_accum
+    data_size = _infer_data_size(inputs_path)
+    log.info(f"Data cycling: data_size={data_size}, total_mb={total_mb}")
+    mb0_data = list(npz_base[:num_data])
+
+    unique_mb_data = []
+    for mb in range(data_size):
+        if mb == 0:
+            unique_mb_data.append(mb0_data)
+        else:
+            mb_row = []
+            for buf_idx in range(num_data):
+                key = f"mb{mb}_arr_{buf_idx:04d}"
+                mb_row.append(inputs[key] if key in inputs else mb0_data[buf_idx])
+            unique_mb_data.append(mb_row)
+
+    # Grad acc buf info for testinputs.h.
+    if grad_acc_set:
+        sorted_grad = sorted(grad_acc_set)
+        grad_buf_start_idx = sorted_grad[0]
+    else:
+        grad_buf_start_idx = -1
+    num_grad_inputs = len(grad_acc_set)
+
+    if grad_buf_start_idx > num_data:
+        init_weights = list(npz_base[num_data:grad_buf_start_idx])
+    else:
+        init_weights = []
+
+    # 13. Load reference losses.
+    reference_losses = _load_reference_losses(args.dir)
+
+    # 14. Generate output files.
+    os.makedirs(args.dumpdir, exist_ok=True)
+
+    generateTrainingTestNetwork(deployer,
+                                unique_mb_data,
+                                args.dumpdir,
+                                verbosityCfg,
+                                n_steps=n_steps,
+                                n_accum=n_accum,
+                                num_data_inputs=num_data,
+                                grad_buf_start_idx=grad_buf_start_idx,
+                                num_grad_inputs=num_grad_inputs,
+                                learning_rate=args.learning_rate,
+                                reference_losses=reference_losses,
+                                init_weights=init_weights,
+                                data_size=data_size,
+                                tolerance_abs=args.tolerance_abs)
+
+    # 15. Write resolved config for execution.py to pick up.
+    meta = {
+        "n_train_steps": n_steps,
+        "n_accum_steps": n_accum,
+        "training_num_data_inputs": num_data,
+    }
+    meta_path = os.path.join(args.dumpdir, "training_meta.json")
+    with open(meta_path, 'w') as f:
+        json.dump(meta, f, indent=2)
+    log.info(f"Training meta written to {meta_path}: {meta}")
+
+
+# ---------------------------------------------------------------------------
+# CLI
+# ---------------------------------------------------------------------------
+
+if __name__ == '__main__':
+
+    parser = TestGeneratorArgumentParser(description="Deeploy Tiled Training Code Generation Utility.")
+
+    # Training params (same as generateTrainingNetwork.py)
+    parser.add_argument(
+        "--cores",
+        type=int,
+        default=1,
+        help="Number of cores on which the network is run. Default: 1.",
+    )
+    parser.add_argument(
+        "--num-data-inputs",
+        type=int,
+        dest="num_data_inputs",
+        default=None,
+        help="Number of DATA inputs that change per mini-batch. Auto-detected if not specified.",
+    )
+    parser.add_argument(
+        "--n-steps",
+        type=int,
+        dest="n_steps",
+        default=None,
+        help="N_TRAIN_STEPS: number of gradient-accumulation update steps.",
+    )
+    parser.add_argument(
+        "--n-accum",
+        type=int,
+        dest="n_accum",
+        default=None,
+        help="N_ACCUM_STEPS: number of mini-batches per update step.",
+    )
+    parser.add_argument(
+        "--learning-rate",
+        type=float,
+        dest="learning_rate",
+        default=0.001,
+        help="SGD learning rate emitted as TRAINING_LEARNING_RATE in testinputs.h. Default: 0.001.",
+    )
+
+    # Tiling params (same as testMVP.py)
+    parser.add_argument(
+        '--l1',
+        type=int,
+        dest='l1',
+        default=64_000,
+        help='Set L1 size in bytes. Default: 64000.',
+    )
+    parser.add_argument(
+        '--l2',
+        type=int,
+        dest='l2',
+        default=1_024_000,
+        help='Set L2 size in bytes. Default: 1024000.',
+    )
+    parser.add_argument(
+        '--defaultMemLevel',
+        type=str,
+        dest='defaultMemLevel',
+        default="L2",
+        help='Default memory level for IO buffers. Default: L2.',
+    )
+    parser.add_argument(
+        '--memAllocStrategy',
+        type=str,
+        dest='memAllocStrategy',
+        default="MiniMalloc",
+        help='Memory allocation strategy. Default: MiniMalloc.',
+    )
+    parser.add_argument(
+        '--searchStrategy',
+        type=str,
+        dest='searchStrategy',
+        default="random-max",
+        help='CP solver search strategy. Default: random-max.',
+    )
+    parser.add_argument(
+        '--plotMemAlloc',
+        action='store_true',
+        help='Save memory allocation plots in the deeployStates folder.',
+    )
+    parser.add_argument(
+        '--profileTiling',
+        action='store_true',
+        help='Enable tiling profiling (inserts cycle counters around each tiled kernel).',
+    )
+    parser.add_argument(
+        '--tolerance',
+        type=float,
+        dest='tolerance_abs',
+        default=1e-3,
+        help='Absolute loss tolerance emitted as TRAINING_TOLERANCE_ABS in testoutputs.h. Default: 1e-3.',
+    )
+    parser.add_argument('--shouldFail', action='store_true')
+    parser.set_defaults(shouldFail=False)
+
+    args = parser.parse_args()
+
+    try:
+        generateTiledTrainingNetwork(args)
+    except Exception as e:
+        if args.shouldFail:
+            print("\033[92mTiled training network generation ended, failed as expected!\033[0m")
+            sys.exit(0)
+        else:
+            raise e
+
+    if args.shouldFail:
+        raise RuntimeError("Expected to fail!")
diff --git a/DeeployTest/testUtils/codeGenerate.py b/DeeployTest/testUtils/codeGenerate.py
index 39a44d9442..ea73d320e1 100644
--- a/DeeployTest/testUtils/codeGenerate.py
+++ b/DeeployTest/testUtils/codeGenerate.py
@@ -3,7 +3,9 @@
 # SPDX-License-Identifier: Apache-2.0
 
 import os
-from typing import List, Tuple
+import re
+from pathlib import Path
+from typing import Dict, List, Optional, Tuple
 
 import numpy as np
 
@@ -194,6 +196,15 @@ def generateTestNetworkImplementation(deployer: NetworkDeployer, verbosityCfg: C
         """
     retStr += deployer.generateEngineInitializationCode()
     retStr += deployer.generateBufferAllocationCode()
+
+    # Initialize all output buffers to zero
+    output_idx = 0
+    while deployer.ctxt.is_buffer(f'output_{output_idx}'):
+        output_buffer = deployer.ctxt.lookup(f'output_{output_idx}')
+        output_size = np.prod(output_buffer.shape) if hasattr(output_buffer, 'shape') else output_buffer._type.referencedType.typeWidth
+        typeName = output_buffer._type.referencedType.typeName
+        output_idx += 1
+
     retStr += """
     }
     """
@@ -287,3 +298,865 @@ def generateTestNetwork(deployer: NetworkDeployer, test_inputs: List[np.ndarray]
     os.system(f'clang-format -i --style="{clang_format}" {dumpdir}/Network.h')
     os.system(f'clang-format -i --style="{clang_format}" {dumpdir}/testoutputs.h')
     os.system(f'clang-format -i --style="{clang_format}" {dumpdir}/testinputs.h')
+
+
+# ---------------------------------------------------------------------------
+# Training code-generation helpers
+# ---------------------------------------------------------------------------
+
+
+def generateTrainingTestInputsHeader(deployer: NetworkDeployer, all_mb_data: List[List[np.ndarray]], n_steps: int,
+                                     n_accum: int, grad_buf_start_idx: int = 0, num_grad_inputs: int = 0,
+                                     learning_rate: float = 0.001, init_weights: List[np.ndarray] = None,
+                                     data_size: int = None) -> str:
+    """Generate testinputs.h for training tests.
+
+    Parameters
+    ----------
+    deployer : NetworkDeployer
+        Prepared deployer (used to look up buffer types).
+    all_mb_data : list of list of np.ndarray
+        Per-mini-batch DATA arrays: ``all_mb_data[mb][buf]`` is the array for
+        mini-batch *mb* and DATA buffer *buf*.  All mini-batches must have the
+        same number of buffers.
+    n_steps : int
+        N_TRAIN_STEPS macro value.
+    n_accum : int
+        N_ACCUM_STEPS macro value.
+    grad_buf_start_idx : int
+        Index of the first grad accumulation buffer in DeeployNetwork_inputs[].
+        Used to emit TRAINING_GRAD_BUF_START_IDX.  Pass 0 (and num_grad_inputs=0)
+        to suppress the define (e.g. when no grad bufs exist).
+    num_grad_inputs : int
+        Number of grad accumulation buffers.  Used to emit TRAINING_NUM_GRAD_INPUTS.
+
+    Returns
+    -------
+    str
+        C header string.
+    """
+    total_mb = n_steps * n_accum
+    num_data = len(all_mb_data[0]) if all_mb_data else 0
+    # data_size: number of unique samples stored in C arrays.
+    # C harness cycles: testDataVector[mb % TRAINING_DATA_SIZE].
+    # Defaults to total_mb (no cycling) for backward compatibility.
+    effective_data_size = data_size if (data_size is not None and data_size < total_mb) else total_mb
+
+    retStr = ""
+    retStr += f"#define N_TRAIN_STEPS {n_steps}\n"
+    retStr += f"#define N_ACCUM_STEPS {n_accum}\n"
+    retStr += f"#define TRAINING_DATA_SIZE {effective_data_size}\n"
+    retStr += f"#define TRAINING_NUM_DATA_INPUTS {num_data}\n"
+    if num_grad_inputs > 0:
+        retStr += f"#define TRAINING_GRAD_BUF_START_IDX {grad_buf_start_idx}\n"
+        retStr += f"#define TRAINING_NUM_GRAD_INPUTS {num_grad_inputs}\n"
+        num_weight_inputs = grad_buf_start_idx - num_data
+        retStr += f"#define TRAINING_NUM_WEIGHT_INPUTS {num_weight_inputs}\n"
+    retStr += f"#define TRAINING_LEARNING_RATE {learning_rate:.10g}f\n"
+    retStr += "\n"
+
+    # Emit per-mini-batch buffer arrays — only effective_data_size unique rows.
+    # all_mb_data must contain exactly effective_data_size rows.
+    for mb in range(effective_data_size):
+        mb_data = all_mb_data[mb] if mb < len(all_mb_data) else all_mb_data[-1]
+        row_entries = []
+        for buf_idx, arr in enumerate(mb_data):
+            values = arr.reshape(-1)
+
+            # Determine C type from deployer context (buffer "input_N").
+            input_key = f"input_{buf_idx}"
+            if deployer.ctxt.is_buffer(input_key):
+                buffer = deployer.ctxt.lookup(input_key)
+                typeName = buffer._type.referencedType.typeName
+                typeWidth = buffer._type.referencedType.typeWidth
+            else:
+                # Fallback: infer from numpy dtype
+                if arr.dtype == np.float32 or arr.dtype == np.float64:
+                    typeName = "float32_t"
+                    typeWidth = 32
+                elif arr.dtype == np.int64:
+                    typeName = "int64_t"
+                    typeWidth = 64
+                elif arr.dtype == np.bool_ or arr.dtype == bool:
+                    typeName = "uint8_t"
+                    typeWidth = 8
+                else:
+                    typeName = "int32_t"
+                    typeWidth = 32
+
+            buf_name = f"testData_mb{mb}_buf{buf_idx}"
+            row_entries.append(buf_name)
+
+            # Format values
+            if typeName == 'float32_t':
+                list_str = ", ".join([
+                    f'{float(x)}f' if not (np.isinf(x) or np.isnan(x)) else str(x) for x in values.astype(np.float32)
+                ])
+            else:
+                list_str = ", ".join([str(x) for x in values])
+
+            # 4-byte alignment padding
+            total_bytes = (values.size * typeWidth) // 8
+            pad_bytes = (-total_bytes) % 4
+            if pad_bytes:
+                paddingElements = (pad_bytes * 8 + typeWidth - 1) // typeWidth
+                list_str += ", " + ", ".join("0" for _ in range(paddingElements))
+
+            retStr += f"{typeName} {buf_name}[] = {{{list_str}}};\n"
+
+        # Emit the row pointer array for this mini-batch
+        row_name = f"testDataRow{mb}"
+        retStr += f"void* {row_name}[] = {{{', '.join(f'(void*){e}' for e in row_entries)}}};\n"
+        retStr += "\n"
+
+    # Emit the top-level vector of row pointers (only unique samples; C harness cycles via modulo).
+    retStr += f"void** testDataVector[{effective_data_size}] = {{{', '.join(f'testDataRow{mb}' for mb in range(effective_data_size))}}};\n"
+
+    # Emit initial weight arrays (one per weight input, indices num_data..grad_buf_start_idx-1).
+    if init_weights:
+        retStr += "\n"
+        weight_entries = []
+        num_data = len(all_mb_data[0]) if all_mb_data else 0
+        for wi, arr in enumerate(init_weights):
+            buf_global_idx = num_data + wi
+            input_key = f"input_{buf_global_idx}"
+            if deployer.ctxt.is_buffer(input_key):
+                buffer = deployer.ctxt.lookup(input_key)
+                typeName = buffer._type.referencedType.typeName
+                typeWidth = buffer._type.referencedType.typeWidth
+            else:
+                typeName = "float32_t"
+                typeWidth = 32
+            values = arr.reshape(-1).astype(np.float32)
+            # Tile values to match Deeploy's internal (possibly sequence-length-tiled) shape.
+            if deployer.ctxt.is_buffer(input_key):
+                expected_nelems = int(np.prod(deployer.ctxt.lookup(input_key).shape))
+                if expected_nelems > len(values) and expected_nelems % len(values) == 0:
+                    values = np.tile(values, expected_nelems // len(values))
+            list_str = ", ".join([f'{float(x)}f' for x in values])
+            buf_name = f"testInitWeight_{wi}"
+            weight_entries.append(buf_name)
+            retStr += f"{typeName} {buf_name}[] = {{{list_str}}};\n"
+        retStr += f"void* testInitWeights[{len(weight_entries)}] = {{{', '.join(f'(void*){e}' for e in weight_entries)}}};\n"
+
+    return retStr
+
+
+def generateTrainingTestOutputsHeader(
+    reference_losses: List = None,
+    tolerance_abs: float = 1e-3,
+) -> str:
+    """Generate testoutputs.h for training tests — loss comparison only.
+
+    Parameters
+    ----------
+    reference_losses : list of float, optional
+        Reference loss value for each forward pass (one per mini-batch step).
+        If None, loss comparison is skipped.
+    tolerance_abs : float
+        Absolute comparison tolerance emitted as TRAINING_TOLERANCE_ABS.
+
+    Returns
+    -------
+    str
+        C header string.
+    """
+    has_loss = reference_losses is not None and len(reference_losses) > 0
+
+    retStr = "// testoutputs.h — Phase 2: loss verification\n"
+    retStr += f"#define TRAINING_TOLERANCE_ABS {tolerance_abs:.10g}f\n\n"
+
+    if has_loss:
+        n = len(reference_losses)
+        retStr += "// Expected loss for each forward pass (one per mini-batch)\n"
+        retStr += f"#define N_LOSS_REFS {n}\n"
+        vals = ", ".join(f"{float(v):.10g}f" for v in reference_losses)
+        retStr += f"float32_t testLossRef[{n}] = {{{vals}}};\n\n"
+    else:
+        retStr += "// No loss reference available — loss comparison skipped.\n"
+        retStr += "#define N_LOSS_REFS 0\n\n"
+
+    return retStr
+
+
+def generateTrainingNetworkHeader(deployer: NetworkDeployer) -> str:
+    """Generate TrainingNetwork.h — same as generateTestNetworkHeader but with
+    RunTrainingNetwork / InitTrainingNetwork function names and a distinct header guard.
+
+    Parameters
+    ----------
+    deployer : NetworkDeployer
+        Prepared deployer.
+
+    Returns
+    -------
+    str
+        C header string.
+    """
+    retStr = ""
+
+    retStr += """
+#ifndef __DEEPLOY_TRAINING_HEADER__
+#define __DEEPLOY_TRAINING_HEADER__
+#include <stdio.h>
+#include <stdint.h>
+#include <stdlib.h>
+"""
+    retStr += deployer.generateIncludeString()
+    if isinstance(deployer.Platform, (PULPPlatform, MemoryPULPPlatform, MemoryPULPPlatformWrapper)):
+        retStr += """
+void RunTrainingNetwork();
+void InitTrainingNetwork();
+
+"""
+    else:
+        retStr += """
+void RunTrainingNetwork(uint32_t core_id, uint32_t numThreads);
+void InitTrainingNetwork(uint32_t core_id, uint32_t numThread);
+
+"""
+
+    retStr += deployer.generateIOBufferInitializationCode()
+    retStr += """
+#endif
+"""
+
+    return retStr
+
+
+def generateTrainingNetworkImplementation(deployer: NetworkDeployer, verbosityCfg: CodeGenVerbosity) -> str:
+    """Generate TrainingNetwork.c — same as generateTestNetworkImplementation but with
+    RunTrainingNetwork / InitTrainingNetwork function names and including TrainingNetwork.h.
+
+    Parameters
+    ----------
+    deployer : NetworkDeployer
+        Prepared deployer.
+    verbosityCfg : CodeGenVerbosity
+        Verbosity configuration.
+
+    Returns
+    -------
+    str
+        C implementation string.
+    """
+    retStr = ""
+
+    retStr += """#include <stdio.h>
+#include <stdlib.h>
+#include <math.h>
+"""
+    retStr += deployer.generateIncludeString()
+    retStr += """
+
+#include "TrainingNetwork.h"
+
+"""
+
+    retStr += deployer.generateBufferInitializationCode()
+    retStr += deployer.generateGlobalDefinitionCode()
+
+    if isinstance(deployer.Platform, MemPoolPlatform):
+        retStr += deployer.generateInferenceInitializationCode()
+        retStr += """
+void RunTrainingNetwork(__attribute__((unused)) uint32_t core_id, __attribute__((unused)) uint32_t numThreads){
+"""
+    elif isinstance(deployer.Platform, (PULPPlatform, MemoryPULPPlatform, MemoryPULPPlatformWrapper)):
+        retStr += """
+void RunTrainingNetwork(){
+"""
+        retStr += deployer.generateInferenceInitializationCode()
+    else:
+        retStr += """
+void RunTrainingNetwork(__attribute__((unused)) uint32_t core_id, __attribute__((unused)) uint32_t numThreads){
+"""
+        retStr += deployer.generateInferenceInitializationCode()
+
+    retStr += deployer.generateFunction(verbosityCfg)
+    if isinstance(deployer.Platform, (PULPPlatform, MemoryPULPPlatform, MemoryPULPPlatformWrapper)):
+        retStr += """
+}
+
+void InitTrainingNetwork(){
+"""
+    else:
+        retStr += """
+}
+
+void InitTrainingNetwork(__attribute__((unused)) uint32_t core_id, __attribute__((unused)) uint32_t numThreads){
+"""
+    retStr += deployer.generateEngineInitializationCode()
+    retStr += deployer.generateBufferAllocationCode()
+    retStr += """
+}
+"""
+
+    return retStr
+
+
+def generateTrainingTestNetwork(deployer: NetworkDeployer, all_mb_data: List[List[np.ndarray]], dumpdir: str,
+                                verbosityCfg: CodeGenVerbosity, n_steps: int = 1, n_accum: int = 1,
+                                num_data_inputs: int = 2, grad_buf_start_idx: int = 0, num_grad_inputs: int = 0,
+                                learning_rate: float = 0.001, reference_losses: List = None,
+                                init_weights: List = None, data_size: int = None,
+                                tolerance_abs: float = 1e-3) -> None:
+    """Generate all training test files: testinputs.h, testoutputs.h, TrainingNetwork.h, TrainingNetwork.c.
+
+    Parameters
+    ----------
+    deployer : NetworkDeployer
+        Prepared deployer (ctxt.name must already be set to "DeeployTrainingNetwork").
+    all_mb_data : list of list of np.ndarray
+        Per-mini-batch DATA arrays: ``all_mb_data[mb][buf]`` is the array for
+        mini-batch *mb* and DATA buffer *buf*.
+    dumpdir : str
+        Output directory for generated files.
+    verbosityCfg : CodeGenVerbosity
+        Verbosity configuration.
+    n_steps : int
+        N_TRAIN_STEPS value.
+    n_accum : int
+        N_ACCUM_STEPS value.
+    num_data_inputs : int
+        Number of data inputs (TRAINING_NUM_DATA_INPUTS).
+    grad_buf_start_idx : int
+        Index of the first grad accumulation buffer in DeeployNetwork_inputs[].
+    num_grad_inputs : int
+        Number of grad accumulation buffers (TRAINING_NUM_GRAD_INPUTS).
+    """
+    assert deployer.prepared, "An unprepared deployer was given"
+
+    os.makedirs(dumpdir, exist_ok=True)
+
+    # testinputs.h
+    testInputStr = generateTrainingTestInputsHeader(deployer, all_mb_data, n_steps, n_accum, grad_buf_start_idx,
+                                                    num_grad_inputs, learning_rate, init_weights=init_weights,
+                                                    data_size=data_size)
+    with open(f'{dumpdir}/testinputs.h', 'w') as f:
+        f.write(testInputStr)
+
+    # testoutputs.h
+    testOutputStr = generateTrainingTestOutputsHeader(
+        reference_losses=reference_losses,
+        tolerance_abs=tolerance_abs,
+    )
+    with open(f'{dumpdir}/testoutputs.h', 'w') as f:
+        f.write(testOutputStr)
+
+    # TrainingNetwork.h
+    headerStr = generateTrainingNetworkHeader(deployer)
+    with open(f'{dumpdir}/TrainingNetwork.h', 'w') as f:
+        f.write(headerStr)
+
+    # TrainingNetwork.c
+    implStr = generateTrainingNetworkImplementation(deployer, verbosityCfg)
+    with open(f'{dumpdir}/TrainingNetwork.c', 'w') as f:
+        f.write(implStr)
+
+    clang_format = "{BasedOnStyle: llvm, IndentWidth: 2, ColumnLimit: 160}"
+    for fname in ['TrainingNetwork.c', 'TrainingNetwork.h', 'testinputs.h', 'testoutputs.h']:
+        os.system(f'clang-format -i --style="{clang_format}" {dumpdir}/{fname}')
+
+    # Build initial-value list for every input_N buffer so that L3 hex files
+    # can be written.  The list must cover all N where "input_N" exists in the
+    # deployer context.  Layout (must match DeeployNetwork_inputs[] order):
+    #   [0 .. num_data_inputs-1]              → first mini-batch data
+    #   [num_data_inputs .. grad_start-1]     → initial weights
+    #   [grad_start .. grad_start+num_grad-1] → zeros  (grad acc bufs)
+    #   [last]                                → lazy_reset_grad = 1 (uint8)
+    l3_initial_inputs: List[np.ndarray] = []
+    # Count how many input_N buffers exist in the deployer context
+    n_total_inputs = sum(1 for name in deployer.ctxt.globalObjects
+                         if name.startswith("input_") and name[len("input_"):].isdigit())
+    for i in range(n_total_inputs):
+        if all_mb_data and i < len(all_mb_data[0]):
+            # Data / label input
+            l3_initial_inputs.append(all_mb_data[0][i])
+        elif (init_weights is not None and grad_buf_start_idx > 0
+              and num_data_inputs <= i < grad_buf_start_idx):
+            # Weight input
+            wi = i - num_data_inputs
+            l3_initial_inputs.append(init_weights[wi] if wi < len(init_weights) else np.array([0.0], dtype=np.float32))
+        elif (grad_buf_start_idx > 0 and num_grad_inputs > 0
+              and grad_buf_start_idx <= i < grad_buf_start_idx + num_grad_inputs):
+            # Gradient accumulation buffer — zero-initialised
+            buf = deployer.ctxt.globalObjects.get(f"input_{i}")
+            shape = buf.shape if (buf is not None and hasattr(buf, 'shape')) else (1,)
+            l3_initial_inputs.append(np.zeros(shape, dtype=np.float32))
+        else:
+            # lazy_reset_grad (last input) or any unknown slot — default 1 / uint8
+            buf = deployer.ctxt.globalObjects.get(f"input_{i}")
+            shape = buf.shape if (buf is not None and hasattr(buf, 'shape')) else (1,)
+            l3_initial_inputs.append(np.ones(shape, dtype=np.uint8))
+
+    generateL3HexDump(deployer, os.path.join(dumpdir, 'hex'), l3_initial_inputs, [])
+
+
+# ---------------------------------------------------------------------------
+# Optimizer network code-generation helpers
+# ---------------------------------------------------------------------------
+
+_OPT_PREFIX = "DeeployOptNetwork_"
+_TRAIN_PREFIX = "DeeployNetwork_"
+
+
+def build_shared_buffer_maps(train_onnx_path: str, opt_onnx_model) -> Tuple[Dict[int, int], Dict[int, int]]:
+    """Build optimizer→training index maps for tensors shared between the two graphs.
+
+    The optimizer ONNX inputs are interleaved weight/grad pairs that have the
+    same tensor names as inputs in the training ONNX graph.  We match by name
+    so that ``InitOptimizerNetwork`` can reference the already-allocated
+    ``DeeployNetwork_input_N`` pointers instead of allocating fresh buffers.
+
+    Parameters
+    ----------
+    train_onnx_path : str
+        Path to the training ``network.onnx``.
+    opt_onnx_model :
+        Already-loaded optimizer ONNX model (``onnx.ModelProto``).
+
+    Returns
+    -------
+    shared_input_map : Dict[int, int]
+        opt_input_idx → train_input_idx
+    shared_output_map : Dict[int, int]
+        opt_output_idx → train_input_idx  (SGD outputs == updated weights,
+        same physical buffer as the weight input)
+    """
+    import onnx as _onnx
+    train_model = _onnx.load_model(train_onnx_path)
+    train_names = [inp.name for inp in train_model.graph.input]
+    train_name_to_idx = {name: i for i, name in enumerate(train_names)}
+
+    opt_input_names = [inp.name for inp in opt_onnx_model.graph.input]
+    opt_output_names = [out.name for out in opt_onnx_model.graph.output]
+
+    shared_input_map: Dict[int, int] = {}
+    for opt_idx, name in enumerate(opt_input_names):
+        if name in train_name_to_idx:
+            shared_input_map[opt_idx] = train_name_to_idx[name]
+
+    shared_output_map: Dict[int, int] = {}
+    for opt_idx, name in enumerate(opt_output_names):
+        # Try exact match first; then strip the '_updated' suffix that the SGD
+        # node appends to output tensor names (e.g. 'conv1_weight_updated' → 'conv1_weight').
+        lookup_name = name
+        if lookup_name not in train_name_to_idx and lookup_name.endswith('_updated'):
+            lookup_name = lookup_name[: -len('_updated')]
+        if lookup_name in train_name_to_idx:
+            shared_output_map[opt_idx] = train_name_to_idx[lookup_name]
+
+    return shared_input_map, shared_output_map
+
+
+def _patch_shared_buffers(retStr: str, shared_input_map: Dict[int, int], shared_output_map: Dict[int, int]) -> str:
+    """Redirect optimizer I/O buffers to Training's already-allocated buffers.
+
+    Must be called AFTER the _TRAIN_PREFIX → _OPT_PREFIX substitution so that
+    the generated symbols already carry the ``DeeployOptNetwork_`` prefix.
+
+    Handles two allocation styles produced by Deeploy:
+
+    *Non-tiled* (per-buffer malloc)::
+
+        DeeployOptNetwork_input_N = (SomeType *)pi_l2_malloc(sizeof(...));
+
+    *Tiled* (single arena with offsets)::
+
+        DeeployOptNetwork_input_N = (float32_t *)((char *)DeeployOptNetwork_MEMORYARENA_L2 + OFFSET);
+
+    Both are replaced with direct pointers into the TrainingNetwork arenas::
+
+        DeeployOptNetwork_input_N = (float32_t *)DeeployNetwork_input_M;
+
+    After all I/O pointers are redirected, if a ``MEMORYARENA_L2`` or
+    ``MEMORYARENA_L3`` allocation is no longer referenced anywhere in the Init
+    body (i.e., the shared buffers consumed the entire arena), the now-unused
+    malloc is also removed to reclaim the L2/L3 memory.
+
+    Parameters
+    ----------
+    retStr : str
+        The already-prefix-substituted C source string.
+    shared_input_map : Dict[int, int]
+        Optimizer input index → training input index.
+    shared_output_map : Dict[int, int]
+        Optimizer output index → training input index (in-place update).
+
+    Returns
+    -------
+    str
+        Patched C source string.
+    """
+    if not shared_input_map and not shared_output_map:
+        return retStr
+
+    # ------------------------------------------------------------------
+    # Pattern 1 (non-tiled): individual pi_*_malloc per buffer
+    # ------------------------------------------------------------------
+    _malloc_pat = re.compile(
+        r'(DeeployOptNetwork_(input|output)_(\d+))\s*=\s*\([^)]+\s*\*\s*\)\s*pi_\w+_malloc\([^;]+\);'
+    )
+
+    # ------------------------------------------------------------------
+    # Pattern 2 (tiled): arena-offset assignment
+    #   DeeployOptNetwork_input_N = (Type *)((char *)DeeployOptNetwork_MEMORYARENA_Lx + OFFSET);
+    # ------------------------------------------------------------------
+    _arena_pat = re.compile(
+        r'(DeeployOptNetwork_(input|output)_(\d+))\s*=\s*\([^)]+\s*\*\s*\)'
+        r'\s*\(\s*\(char\s*\*\)\s*DeeployOptNetwork_MEMORYARENA_L\w+\s*\+\s*\d+\s*\)\s*;'
+    )
+
+    def _make_replacement(symbol: str, kind: str, idx: int) -> Optional[str]:
+        if kind == "input" and idx in shared_input_map:
+            train_idx = shared_input_map[idx]
+            return f'{symbol} = (float32_t *){_TRAIN_PREFIX}input_{train_idx};  /* shared with TrainingNetwork */'
+        if kind == "output" and idx in shared_output_map:
+            train_idx = shared_output_map[idx]
+            return f'{symbol} = (float32_t *){_TRAIN_PREFIX}input_{train_idx};  /* in-place, shared with TrainingNetwork */'
+        return None
+
+    def _replace(m: re.Match) -> str:
+        replacement = _make_replacement(m.group(1), m.group(2), int(m.group(3)))
+        return replacement if replacement is not None else m.group(0)
+
+    retStr = _malloc_pat.sub(_replace, retStr)
+    retStr = _arena_pat.sub(_replace, retStr)
+
+    # ------------------------------------------------------------------
+    # Arena elimination: if a MEMORYARENA_Lx is no longer used for any
+    # pointer arithmetic after the redirects, its malloc is dead and can
+    # be removed to reclaim L2/L3.  The global declaration is left in
+    # place (harmless; the variable will be NULL at runtime).
+    # ------------------------------------------------------------------
+    for level in ('L2', 'L3'):
+        arena_sym = f'DeeployOptNetwork_MEMORYARENA_{level}'
+        # Pattern for the malloc assignment line itself
+        malloc_line_pat = re.compile(
+            rf'[^\n]*{re.escape(arena_sym)}\s*=\s*\([^)]+\)\s*pi_\w+_malloc\([^;]+\);\s*\n'
+        )
+        # Pattern for any use of the arena in pointer arithmetic:
+        #   (char *)ARENA + OFFSET  or  (void *)ARENA  etc.
+        arena_use_pat = re.compile(
+            rf'\(\s*(?:char|void|int8_t)\s*\*\s*\)\s*{re.escape(arena_sym)}'
+        )
+        if not arena_use_pat.search(retStr):
+            # No remaining pointer arithmetic — the malloc is dead
+            retStr = malloc_line_pat.sub('', retStr)
+
+    # ------------------------------------------------------------------
+    # Inject TrainingNetwork header so DeeployNetwork_input_N symbols resolve
+    # ------------------------------------------------------------------
+    retStr = retStr.replace(
+        '#include "OptimizerNetwork.h"',
+        '#include "OptimizerNetwork.h"\n#include "TrainingNetwork.h"',
+    )
+    return retStr
+
+
+def _patch_shared_arenas(retStr: str, train_c_source: str) -> str:
+    """Redirect optimizer L1/L2 arena allocations to reuse training network's arenas.
+
+    TrainingNetwork and OptimizerNetwork run strictly sequentially: RunTrainingNetwork()
+    completes before RunOptimizerNetwork() starts.  Their L1/L2 tile-working arenas
+    therefore never overlap in time and can share the same physical memory.
+
+    Only the L1 arena is shared: it is pure tile-compute scratch whose content is
+    dead after each kernel returns.  The L2 arena is NOT shared because it may hold
+    persistent tensor data (weights, activations) at fixed offsets in non-tiled mode;
+    sharing it would let the optimizer's L2 staging buffers overwrite that data.
+
+    Must be called AFTER the _TRAIN_PREFIX → _OPT_PREFIX substitution.
+
+    Parameters
+    ----------
+    retStr : str
+        The already-prefix-substituted C source string for the optimizer.
+    train_c_source : str
+        The full text of TrainingNetwork.c (used to confirm the arena symbols exist).
+
+    Returns
+    -------
+    str
+        Patched C source string.
+    """
+    for level in ('L1',):
+        train_sym = f'DeeployNetwork_MEMORYARENA_{level}'
+        # Only alias if the training network actually has this arena
+        if train_sym not in train_c_source:
+            continue
+
+        opt_sym = f'DeeployOptNetwork_MEMORYARENA_{level}'
+        opt_malloc_pat = re.compile(
+            rf'({re.escape(opt_sym)})\s*=\s*\([^)]+\)\s*\w+\(sizeof\([^)]+\)\s*\*\s*\d+\)\s*;'
+        )
+        if not opt_malloc_pat.search(retStr):
+            continue
+
+        replacement = f'{opt_sym} = (int8_t *){train_sym};  /* shared with TrainingNetwork */'
+        retStr = opt_malloc_pat.sub(replacement, retStr)
+
+    # Inject TrainingNetwork header if not already present
+    # (_patch_shared_buffers may have already added it; guard against duplicates)
+    if '#include "TrainingNetwork.h"' not in retStr:
+        retStr = retStr.replace(
+            '#include "OptimizerNetwork.h"',
+            '#include "OptimizerNetwork.h"\n#include "TrainingNetwork.h"',
+        )
+
+    return retStr
+
+
+def _ensure_training_l1_capacity(dumpdir: str, train_c_source: str, opt_alloc_code: str) -> str:
+    """Enlarge TrainingNetwork's L1 arena to cover the optimizer's L1 needs.
+
+    Since the two networks share the same L1 arena, TrainingNetwork must allocate
+    at least max(train_L1, opt_L1) bytes.  When the optimizer needs more L1 than
+    training (rare but possible, e.g. autoencoder), this function patches
+    TrainingNetwork.c and TrainingNetwork.h in-place and returns the updated
+    TrainingNetwork.c source string.
+
+    Parameters
+    ----------
+    dumpdir : str
+        Directory containing TrainingNetwork.c and TrainingNetwork.h.
+    train_c_source : str
+        Current content of TrainingNetwork.c.
+    opt_alloc_code : str
+        Optimizer buffer-allocation code after _TRAIN_PREFIX → _OPT_PREFIX
+        substitution (used to extract the optimizer's L1 size).
+
+    Returns
+    -------
+    str
+        (Possibly updated) TrainingNetwork.c source string.
+    """
+    m_opt = re.search(
+        r'DeeployOptNetwork_MEMORYARENA_L1\s*=\s*\([^)]+\)\s*pmsis_l1_malloc\(sizeof\([^)]+\)\s*\*\s*(\d+)\)',
+        opt_alloc_code,
+    )
+    if not m_opt:
+        return train_c_source
+
+    opt_l1 = int(m_opt.group(1))
+
+    m_train = re.search(
+        r'(DeeployNetwork_MEMORYARENA_L1\s*=\s*\([^)]+\)\s*pmsis_l1_malloc\(sizeof\([^)]+\)\s*\*\s*)(\d+)(\))',
+        train_c_source,
+    )
+    if not m_train:
+        return train_c_source
+
+    train_l1 = int(m_train.group(2))
+    if opt_l1 <= train_l1:
+        return train_c_source  # Already large enough
+
+    new_l1 = opt_l1
+
+    # Patch TrainingNetwork.c malloc size
+    train_c_new = train_c_source.replace(
+        m_train.group(0),
+        f'{m_train.group(1)}{new_l1}{m_train.group(3)}',
+        1,
+    )
+    train_c_path = os.path.join(dumpdir, 'TrainingNetwork.c')
+    with open(train_c_path, 'w') as f:
+        f.write(train_c_new)
+
+    # Patch TrainingNetwork.h _len constant
+    train_h_path = os.path.join(dumpdir, 'TrainingNetwork.h')
+    if os.path.exists(train_h_path):
+        train_h = open(train_h_path).read()
+        train_h_new = re.sub(
+            r'(DeeployNetwork_MEMORYARENA_L1_len\s*=\s*)\d+',
+            rf'\g<1>{new_l1}',
+            train_h,
+        )
+        with open(train_h_path, 'w') as f:
+            f.write(train_h_new)
+
+    return train_c_new
+
+
+def generateOptimizerNetworkHeader(deployer: NetworkDeployer) -> str:
+    """Generate OptimizerNetwork.h.
+
+    Reuses the Deeploy deployer's output and applies two transformations:
+      1. Replace the buffer prefix ``DeeployNetwork_`` → ``DeeployOptNetwork_``
+      2. Inject ``RunOptimizerNetwork`` / ``InitOptimizerNetwork`` function declarations.
+
+    Parameters
+    ----------
+    deployer : NetworkDeployer
+        Prepared deployer for the optimizer ONNX graph.
+
+    Returns
+    -------
+    str
+        C header string.
+    """
+    retStr = ""
+    retStr += """
+#ifndef __DEEPLOY_OPTIMIZER_HEADER__
+#define __DEEPLOY_OPTIMIZER_HEADER__
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+"""
+    retStr += deployer.generateIncludeString()
+    if isinstance(deployer.Platform, (PULPPlatform, MemoryPULPPlatform, MemoryPULPPlatformWrapper)):
+        retStr += """
+void RunOptimizerNetwork();
+void InitOptimizerNetwork();
+
+"""
+    else:
+        retStr += """
+void RunOptimizerNetwork(uint32_t core_id, uint32_t numThreads);
+void InitOptimizerNetwork(uint32_t core_id, uint32_t numThreads);
+
+"""
+    retStr += deployer.generateIOBufferInitializationCode()
+    retStr += """
+#endif
+"""
+    # Prefix substitution: all Deeploy-generated DeeployNetwork_ → DeeployOptNetwork_
+    retStr = retStr.replace(_TRAIN_PREFIX, _OPT_PREFIX)
+    return retStr
+
+
+def generateOptimizerNetworkImplementation(deployer: NetworkDeployer,
+                                           verbosityCfg: CodeGenVerbosity,
+                                           shared_input_map: Optional[Dict[int, int]] = None,
+                                           shared_output_map: Optional[Dict[int, int]] = None,
+                                           train_c_source: Optional[str] = None) -> str:
+    """Generate OptimizerNetwork.c.
+
+    Parameters
+    ----------
+    deployer : NetworkDeployer
+        Prepared deployer for the optimizer ONNX graph.
+    verbosityCfg : CodeGenVerbosity
+        Verbosity configuration.
+    shared_input_map : Dict[int, int], optional
+        Optimizer input index → training input index for shared weight/grad buffers.
+        When provided, those malloc calls are replaced with references to the
+        already-allocated TrainingNetwork buffers.
+    shared_output_map : Dict[int, int], optional
+        Optimizer output index → training input index for in-place shared outputs.
+    train_c_source : str, optional
+        Full text of TrainingNetwork.c.  When provided, the optimizer's L1/L2 arena
+        malloc calls are replaced with direct pointers to the training arenas,
+        saving one L1 and one L2 allocation (safe because the two networks run
+        strictly sequentially).
+
+    Returns
+    -------
+    str
+        C implementation string.
+    """
+    retStr = ""
+    retStr += """#include <math.h>
+#include <stdio.h>
+#include <stdlib.h>
+"""
+    retStr += deployer.generateIncludeString()
+    retStr += """
+#include "OptimizerNetwork.h"
+
+"""
+    retStr += deployer.generateBufferInitializationCode()
+    retStr += deployer.generateGlobalDefinitionCode()
+
+    if isinstance(deployer.Platform, MemPoolPlatform):
+        retStr += deployer.generateInferenceInitializationCode()
+        retStr += """
+void RunOptimizerNetwork(__attribute__((unused)) uint32_t core_id, __attribute__((unused)) uint32_t numThreads){
+"""
+    elif isinstance(deployer.Platform, (PULPPlatform, MemoryPULPPlatform, MemoryPULPPlatformWrapper)):
+        retStr += """
+void RunOptimizerNetwork(){
+"""
+        retStr += deployer.generateInferenceInitializationCode()
+    else:
+        retStr += """
+void RunOptimizerNetwork(__attribute__((unused)) uint32_t core_id, __attribute__((unused)) uint32_t numThreads){
+"""
+        retStr += deployer.generateInferenceInitializationCode()
+
+    retStr += deployer.generateFunction(verbosityCfg)
+
+    if isinstance(deployer.Platform, (PULPPlatform, MemoryPULPPlatform, MemoryPULPPlatformWrapper)):
+        retStr += """
+}
+
+void InitOptimizerNetwork(){
+"""
+    else:
+        retStr += """
+}
+
+void InitOptimizerNetwork(__attribute__((unused)) uint32_t core_id, __attribute__((unused)) uint32_t numThreads){
+"""
+    retStr += deployer.generateEngineInitializationCode()
+    retStr += deployer.generateBufferAllocationCode()
+    retStr += """
+}
+"""
+    # Prefix substitution
+    retStr = retStr.replace(_TRAIN_PREFIX, _OPT_PREFIX)
+    # Replace malloc calls for shared weight/grad buffers with Training pointers
+    retStr = _patch_shared_buffers(retStr, shared_input_map or {}, shared_output_map or {})
+    # Redirect optimizer L1/L2 arena mallocs to reuse training arenas
+    if train_c_source:
+        retStr = _patch_shared_arenas(retStr, train_c_source)
+    return retStr
+
+
+def generateOptimizerTestNetwork(deployer: NetworkDeployer,
+                                 dumpdir: str,
+                                 verbosityCfg: CodeGenVerbosity,
+                                 shared_input_map: Optional[Dict[int, int]] = None,
+                                 shared_output_map: Optional[Dict[int, int]] = None) -> None:
+    """Generate OptimizerNetwork.h and OptimizerNetwork.c.
+
+    Parameters
+    ----------
+    deployer : NetworkDeployer
+        Prepared deployer for the optimizer ONNX graph.
+    dumpdir : str
+        Output directory for generated files.
+    verbosityCfg : CodeGenVerbosity
+        Verbosity configuration.
+    shared_input_map : Dict[int, int], optional
+        Optimizer input index → training input index for shared weight/grad buffers.
+    shared_output_map : Dict[int, int], optional
+        Optimizer output index → training input index for in-place shared outputs.
+    """
+    assert deployer.prepared, "An unprepared deployer was given"
+
+    os.makedirs(dumpdir, exist_ok=True)
+
+    train_c_path = os.path.join(dumpdir, 'TrainingNetwork.c')
+    train_c_source: Optional[str] = None
+    if os.path.exists(train_c_path):
+        with open(train_c_path, 'r') as f:
+            train_c_source = f.read()
+
+    # Enlarge training L1 arena if optimizer needs more (so unconditional L1 sharing is safe)
+    if train_c_source:
+        opt_alloc_preview = deployer.generateBufferAllocationCode().replace(_TRAIN_PREFIX, _OPT_PREFIX)
+        train_c_source = _ensure_training_l1_capacity(dumpdir, train_c_source, opt_alloc_preview)
+
+    headerStr = generateOptimizerNetworkHeader(deployer)
+    with open(f'{dumpdir}/OptimizerNetwork.h', 'w') as f:
+        f.write(headerStr)
+
+    implStr = generateOptimizerNetworkImplementation(deployer, verbosityCfg, shared_input_map, shared_output_map,
+                                                     train_c_source)
+    with open(f'{dumpdir}/OptimizerNetwork.c', 'w') as f:
+        f.write(implStr)
+
+    clang_format = "{BasedOnStyle: llvm, IndentWidth: 2, ColumnLimit: 160}"
+    for fname in ['OptimizerNetwork.c', 'OptimizerNetwork.h']:
+        os.system(f'clang-format -i --style="{clang_format}" {dumpdir}/{fname}')
diff --git a/DeeployTest/testUtils/core/config.py b/DeeployTest/testUtils/core/config.py
index e932c23962..0ecf45d467 100644
--- a/DeeployTest/testUtils/core/config.py
+++ b/DeeployTest/testUtils/core/config.py
@@ -24,6 +24,14 @@ class DeeployTestConfig:
     gen_args: List[str] = None
     verbose: int = 0
     debug: bool = False
+    training: bool = False
+    # None means "auto-detect from ONNX graph / inputs.npz during codegen"
+    n_train_steps: Optional[int] = None
+    n_accum_steps: Optional[int] = None
+    training_num_data_inputs: Optional[int] = None
+    # Directory containing the optimizer ONNX (network.onnx with SGD nodes).
+    # If None, defaults to <test_dir>/../simplemlp_optimizer when training=True.
+    optimizer_dir: Optional[str] = None
 
     def __post_init__(self):
         if self.cmake_args is None:
diff --git a/DeeployTest/testUtils/core/execution.py b/DeeployTest/testUtils/core/execution.py
index 1dcddeea62..9aff13cede 100644
--- a/DeeployTest/testUtils/core/execution.py
+++ b/DeeployTest/testUtils/core/execution.py
@@ -2,6 +2,7 @@
 #
 # SPDX-License-Identifier: Apache-2.0
 
+import json
 import os
 import shutil
 import subprocess
@@ -14,10 +15,56 @@
 from .output_parser import TestResult, parse_test_output
 
 
+def _augment_path(env: dict) -> dict:
+    """Prepend gvsoc/llvm bin dirs to PATH based on installed env vars.
+
+    The install dirs are already set as env vars (GVSOC_INSTALL_DIR,
+    LLVM_INSTALL_DIR) but their bin/ subdirectories may not be in PATH.
+
+    If a virtual environment is active (VIRTUAL_ENV is set), its bin dir
+    is prepended so that shebang-invoked scripts (kconfigtool.py, gapy)
+    resolve python3 to the venv interpreter, which has kconfiglib.
+    Without this, /usr/bin/python3 would be picked up instead, which
+    lacks kconfiglib and causes CMake kconfig setup to fail.
+    """
+    venv = env.get('VIRTUAL_ENV', '')
+    extra = [str(Path(venv) / 'bin')] if venv else ['/usr/bin']
+    for var in ('GVSOC_INSTALL_DIR', 'LLVM_INSTALL_DIR'):
+        install_dir = env.get(var, '')
+        if install_dir:
+            bin_dir = str(Path(install_dir) / 'bin')
+            current = env.get('PATH', '').split(':')
+            if bin_dir not in current:
+                extra.append(bin_dir)
+    env['PATH'] = ':'.join(extra) + ':' + env.get('PATH', '')
+    return env
+
+
+def _resolve_optimizer_dir(config: DeeployTestConfig) -> str:
+    """Return the optimizer ONNX directory for this config.
+
+    Falls back to <test_dir>/../<model>_optimizer if not explicitly set,
+    where <model> is derived by replacing the '_train' suffix of the test
+    directory name with '_optimizer' (e.g. simplemlp_train → simplemlp_optimizer,
+    sleepconvit_train → sleepconvit_optimizer).
+    """
+    if config.optimizer_dir:
+        return config.optimizer_dir
+    test_parent = Path(config.test_dir).parent
+    test_dir_name = Path(config.test_dir).name
+    optimizer_name = test_dir_name.replace("_train", "_optimizer")
+    return str(test_parent / optimizer_name)
+
+
 def generate_network(config: DeeployTestConfig, skip: bool = False) -> None:
     """
     Generate network code from ONNX model.
 
+    In training mode, generates both TrainingNetwork (fwd+bwd) and
+    OptimizerNetwork (SGD) into the same gen_dir.  Auto-detected training
+    parameters (n_steps, n_accum, num_data_inputs) are written to
+    gen_dir/training_meta.json and read back into config after codegen.
+
     Raises:
         RuntimeError: If network generation fails
     """
@@ -27,31 +74,175 @@ def generate_network(config: DeeployTestConfig, skip: bool = False) -> None:
 
     script_dir = Path(__file__).parent.parent.parent
 
-    if config.tiling:
+    if config.training and config.tiling:
+        # --- Tiled training: testMVPTraining.py (tiling pipeline + training init) ---
+        generation_script = script_dir / "testMVPTraining.py"
+        cmd = [
+            sys.executable,
+            str(generation_script),
+            "-d", config.gen_dir,
+            "-t", config.test_dir,
+            "-p", config.platform,
+        ]
+        if config.n_train_steps is not None:
+            cmd.append(f"--n-steps={config.n_train_steps}")
+        if config.n_accum_steps is not None:
+            cmd.append(f"--n-accum={config.n_accum_steps}")
+        if config.training_num_data_inputs is not None:
+            cmd.append(f"--num-data-inputs={config.training_num_data_inputs}")
+        if config.verbose > 0:
+            cmd.append("-" + "v" * config.verbose)
+        if config.debug:
+            cmd.append("--debug")
+        cmd.extend(config.gen_args)
+
+        log.debug(f"[Execution] Tiled training generation command: {' '.join(cmd)}")
+        result = subprocess.run(cmd, check=False)
+        if result.returncode != 0:
+            raise RuntimeError(f"Tiled training network generation failed for {config.test_name}")
+
+        # Read back auto-detected values written by testMVPTraining.py
+        meta_path = Path(config.gen_dir) / "training_meta.json"
+        if meta_path.exists():
+            with open(meta_path) as f:
+                meta = json.load(f)
+            config.n_train_steps = meta["n_train_steps"]
+            config.n_accum_steps = meta["n_accum_steps"]
+            config.training_num_data_inputs = meta["training_num_data_inputs"]
+            log.info(f"[Execution] Training meta: {meta}")
+
+        # --- Step 2: Tiled optimizer network (SGD via testMVPOptimizer.py) ---
+        opt_dir = _resolve_optimizer_dir(config)
+        opt_script = script_dir / "testMVPOptimizer.py"
+
+        if not Path(opt_dir).exists():
+            log.warning(f"Optimizer directory not found: {opt_dir} — skipping optimizer codegen")
+        elif not opt_script.exists():
+            log.warning(f"testMVPOptimizer.py not found — skipping optimizer codegen")
+        else:
+            opt_cmd = [
+                sys.executable,
+                str(opt_script),
+                "-d", config.gen_dir,
+                "-t", opt_dir,
+                "-p", config.platform,
+                f"--training-dir={config.test_dir}",
+            ]
+            _OPT_PASSTHROUGH = ("--cores", "--l1", "--l2",
+                                "--defaultMemLevel",
+                                "--memAllocStrategy", "--searchStrategy",
+                                "--plotMemAlloc", "--profileTiling")
+            for arg in config.gen_args:
+                if any(arg.startswith(p) for p in _OPT_PASSTHROUGH):
+                    opt_cmd.append(arg)
+            # If no --defaultMemLevel was passed through, default to L2
+            if not any(arg.startswith("--defaultMemLevel") for arg in opt_cmd):
+                opt_cmd.append("--defaultMemLevel=L2")
+            if config.verbose > 0:
+                opt_cmd.append("-" + "v" * config.verbose)
+
+            log.debug(f"[Execution] Tiled optimizer generation command: {' '.join(opt_cmd)}")
+            result = subprocess.run(opt_cmd, check=False)
+            if result.returncode != 0:
+                raise RuntimeError(f"Tiled optimizer network generation failed for {config.test_name}")
+
+        return  # early return — tiled training path complete
+
+    elif config.training:
+        # --- Step 1: Training network (forward + backward + accumulation) ---
+        generation_script = script_dir / "generateTrainingNetwork.py"
+        cmd = [
+            sys.executable,
+            str(generation_script),
+            "-d", config.gen_dir,
+            "-t", config.test_dir,
+            "-p", config.platform,
+        ]
+        # Only pass values when explicitly set; otherwise let the script auto-detect
+        if config.n_train_steps is not None:
+            cmd.append(f"--n-steps={config.n_train_steps}")
+        if config.n_accum_steps is not None:
+            cmd.append(f"--n-accum={config.n_accum_steps}")
+        if config.training_num_data_inputs is not None:
+            cmd.append(f"--num-data-inputs={config.training_num_data_inputs}")
+
+        if config.verbose > 0:
+            cmd.append("-" + "v" * config.verbose)
+        if config.debug:
+            cmd.append("--debug")
+        cmd.extend(config.gen_args)
+
+        log.debug(f"[Execution] Training generation command: {' '.join(cmd)}")
+        result = subprocess.run(cmd, check=False)
+        if result.returncode != 0:
+            raise RuntimeError(f"Training network generation failed for {config.test_name}")
+
+        # Read back auto-detected values written by generateTrainingNetwork.py
+        meta_path = Path(config.gen_dir) / "training_meta.json"
+        if meta_path.exists():
+            with open(meta_path) as f:
+                meta = json.load(f)
+            config.n_train_steps = meta["n_train_steps"]
+            config.n_accum_steps = meta["n_accum_steps"]
+            config.training_num_data_inputs = meta["training_num_data_inputs"]
+            log.info(f"[Execution] Training meta: {meta}")
+
+        # --- Step 2: Optimizer network (SGD) ---
+        opt_dir = _resolve_optimizer_dir(config)
+        opt_script = script_dir / "generateOptimizerNetwork.py"
+
+        if not Path(opt_dir).exists():
+            log.warning(f"Optimizer directory not found: {opt_dir} — skipping optimizer codegen")
+        elif not opt_script.exists():
+            log.warning(f"generateOptimizerNetwork.py not found — skipping optimizer codegen")
+        else:
+            opt_cmd = [
+                sys.executable,
+                str(opt_script),
+                "-d", config.gen_dir,
+                "-t", opt_dir,
+                "-p", config.platform,
+                f"--training-dir={config.test_dir}",
+            ]
+            _OPT_PASSTHROUGH = ("--cores", "--l1", "--l2", "--defaultMemLevel")
+            for arg in config.gen_args:
+                if any(arg.startswith(p) for p in _OPT_PASSTHROUGH):
+                    opt_cmd.append(arg)
+            if not any(arg.startswith("--defaultMemLevel") for arg in opt_cmd):
+                opt_cmd.append("--defaultMemLevel=L2")
+            if config.verbose > 0:
+                opt_cmd.append("-" + "v" * config.verbose)
+
+            log.debug(f"[Execution] Optimizer generation command: {' '.join(opt_cmd)}")
+            result = subprocess.run(opt_cmd, check=False)
+            if result.returncode != 0:
+                raise RuntimeError(f"Optimizer network generation failed for {config.test_name}")
+
+        return  # early return — training path complete
+
+    elif config.tiling:
         generation_script = script_dir / "testMVP.py"
+        cmd = [
+            sys.executable,
+            str(generation_script),
+            "-d", config.gen_dir,
+            "-t", config.test_dir,
+            "-p", config.platform,
+        ]
     else:
         generation_script = script_dir / "generateNetwork.py"
+        cmd = [
+            sys.executable,
+            str(generation_script),
+            "-d", config.gen_dir,
+            "-t", config.test_dir,
+            "-p", config.platform,
+        ]
 
-    cmd = [
-        "python",
-        str(generation_script),
-        "-d",
-        config.gen_dir,
-        "-t",
-        config.test_dir,
-        "-p",
-        config.platform,
-    ]
-
-    # Add verbosity flags
     if config.verbose > 0:
         cmd.append("-" + "v" * config.verbose)
-
-    # Add debug flag
     if config.debug:
         cmd.append("--debug")
-
-    # Add additional generation arguments
     cmd.extend(config.gen_args)
 
     log.debug(f"[Execution] Generation command: {' '.join(cmd)}")
@@ -72,7 +263,6 @@ def configure_cmake(config: DeeployTestConfig) -> None:
     if cmake_cmd == "cmake" and shutil.which("cmake") is None:
         raise RuntimeError("CMake not found. Please install CMake or set CMAKE environment variable")
 
-    # Build CMake command
     cmd = [
         cmake_cmd,
         f"-DTOOLCHAIN={config.toolchain}",
@@ -102,11 +292,22 @@ def configure_cmake(config: DeeployTestConfig) -> None:
     else:
         cmd.append("-Dgvsoc_simulation=OFF")
 
-    # Last argument is the source directory
+    if config.training:
+        cmd.append("-DTRAINING=ON")
+        # Only add cmake defines when the values are known (after codegen)
+        if config.n_train_steps is not None:
+            cmd.append(f"-DN_TRAIN_STEPS={config.n_train_steps}")
+        if config.n_accum_steps is not None:
+            cmd.append(f"-DN_ACCUM_STEPS={config.n_accum_steps}")
+        if config.training_num_data_inputs is not None:
+            cmd.append(f"-DTRAINING_NUM_DATA_INPUTS={config.training_num_data_inputs}")
+    else:
+        cmd.append("-DTRAINING=OFF")
+
     script_dir = Path(__file__).parent.parent.parent
     cmd.append(str(script_dir.parent))
 
-    env = os.environ.copy()
+    env = _augment_path(os.environ.copy())
     if config.verbose >= 3:
         env["VERBOSE"] = "1"
 
@@ -162,44 +363,49 @@ def run_simulation(config: DeeployTestConfig, skip: bool = False) -> TestResult:
     if config.simulator == 'none':
         raise RuntimeError("No simulator specified!")
 
+    env = _augment_path(os.environ.copy())
+    if config.verbose >= 3:
+        env["VERBOSE"] = "1"
+
     if config.simulator == 'host':
-        # Run binary directly
         binary_path = Path(config.build_dir) / "bin" / config.test_name
         cmd = [str(binary_path)]
-    else:
-        # Run via CMake target
-        cmake_cmd = os.environ.get("CMAKE", "cmake")
-        cmd = [
-            cmake_cmd,
-            "--build",
-            config.build_dir,
-            "--target",
-            f"{config.simulator}_{config.test_name}",
-        ]
 
-    env = os.environ.copy()
-    if config.verbose >= 3:
-        env["VERBOSE"] = "1"
+    elif config.simulator == 'gvsoc':
+        cmake_cmd = os.environ.get("CMAKE", "cmake")
+        cmd = [cmake_cmd, "--build", config.build_dir, "--target",
+               f"gvsoc_{config.test_name}"]
 
-    if config.simulator == 'banshee':
+    elif config.simulator == 'banshee':
         if config.verbose == 1:
             env["BANSHEE_LOG"] = "warn"
         elif config.verbose == 2:
             env["BANSHEE_LOG"] = "info"
         elif config.verbose >= 3:
             env["BANSHEE_LOG"] = "debug"
+        cmake_cmd = os.environ.get("CMAKE", "cmake")
+        cmd = [cmake_cmd, "--build", config.build_dir, "--target",
+               f"{config.simulator}_{config.test_name}"]
 
-    log.debug(f"[Execution] Simulation command: {' '.join(cmd)}")
+    else:
+        cmake_cmd = os.environ.get("CMAKE", "cmake")
+        cmd = [cmake_cmd, "--build", config.build_dir, "--target",
+               f"{config.simulator}_{config.test_name}"]
 
-    result = subprocess.run(cmd, capture_output = True, text = True, env = env)
+    log.debug(f"[Execution] Simulation command: {' '.join(cmd)}")
 
-    if result.stdout:
-        print(result.stdout, end = '')
-    if result.stderr:
-        print(result.stderr, end = '', file = sys.stderr)
+    # Stream output in real-time (line-buffered) and capture for parsing.
+    proc = subprocess.Popen(cmd, stdout = subprocess.PIPE, stderr = subprocess.STDOUT,
+                            text = True, env = env, bufsize = 1)
+    stdout_lines = []
+    for line in proc.stdout:
+        print(line, end = '', flush = True)
+        stdout_lines.append(line)
+    proc.stdout.close()
+    proc.wait()
+    stdout_output = ''.join(stdout_lines)
 
-    # Parse output for error count and cycles
-    test_result = parse_test_output(result.stdout, result.stderr)
+    test_result = parse_test_output(stdout_output, '')
 
     if not test_result.success and test_result.error_count == -1:
         log.warning(f"Could not parse error count from output")
@@ -213,16 +419,9 @@ def run_complete_test(config: DeeployTestConfig, skipgen: bool = False, skipsim:
     """
     log.info(f"################## Testing {config.test_name} on {config.platform} Platform ##################")
 
-    # Step 1: Generate network
     generate_network(config, skip = skipgen)
-
-    # Step 2: Configure CMake
     configure_cmake(config)
-
-    # Step 3: Build binary
     build_binary(config)
-
-    # Step 4: Run simulation
     result = run_simulation(config, skip = skipsim)
 
     return result
diff --git a/DeeployTest/testUtils/deeployTrainingRunner.py b/DeeployTest/testUtils/deeployTrainingRunner.py
new file mode 100644
index 0000000000..9ee4a64cf4
--- /dev/null
+++ b/DeeployTest/testUtils/deeployTrainingRunner.py
@@ -0,0 +1,149 @@
+# SPDX-FileCopyrightText: 2025 ETH Zurich and University of Bologna
+#
+# SPDX-License-Identifier: Apache-2.0
+"""
+Common entry point for Siracusa training test runners (non-tiled and tiled).
+
+Usage:
+    from testUtils.deeployTrainingRunner import main
+    sys.exit(main(tiling_enabled=False))   # non-tiled
+    sys.exit(main(tiling_enabled=True))    # tiled (SBTiler)
+"""
+
+import os
+import sys
+from pathlib import Path
+from typing import Optional
+
+# gapy (gvsoc launcher) uses `#!/usr/bin/env python3`.  Put /usr/bin first so
+# it resolves to /usr/bin/python3 which has all required packages (gapylib,
+# prettytable, …) rather than the minimal venv python.
+os.environ['PATH'] = '/usr/bin:' + os.environ.get('PATH', '')
+
+from .core import DeeployTestConfig, run_complete_test
+from .core.paths import get_test_paths
+from .deeployRunner import DeeployRunnerArgumentParser, print_colored_result, print_configuration
+
+
+def main(tiling_enabled: bool = False, default_platform: str = 'Siracusa', default_simulator: str = 'gvsoc'):
+    """
+    Build parser, parse args, create DeeployTestConfig, and run the training test.
+
+    Parameters
+    ----------
+    tiling_enabled:
+        True  → passes tiling args (--l1, --l2, …) and sets tiling=True in config.
+    default_platform:
+        Platform used when -p is not given on the command line.
+    default_simulator:
+        Simulator used when -s is not given on the command line.
+    """
+
+    parser = DeeployRunnerArgumentParser(tiling_arguments = tiling_enabled, platform_required = False)
+
+    parser.add_argument('--cores', type = int, default = 8, help = 'Number of cluster cores (default: 8)\n')
+    parser.add_argument('--n-steps',
+                        metavar = '<N>',
+                        dest = 'n_steps',
+                        type = int,
+                        default = None,
+                        help = 'N_TRAIN_STEPS: optimizer steps (auto-detected if not given)\n')
+    parser.add_argument('--n-accum',
+                        metavar = '<N>',
+                        dest = 'n_accum',
+                        type = int,
+                        default = None,
+                        help = 'N_ACCUM_STEPS: mini-batches per update step (auto-detected if not given)\n')
+    parser.add_argument('--num-data-inputs',
+                        metavar = '<N>',
+                        dest = 'num_data_inputs',
+                        type = int,
+                        default = None,
+                        help = 'Inputs that change each mini-batch (auto-detected if not given)\n')
+    parser.add_argument('--optimizer-dir',
+                        metavar = '<dir>',
+                        dest = 'optimizer_dir',
+                        type = str,
+                        default = None,
+                        help = 'Directory containing the optimizer network.onnx '
+                               "(default: auto-derived by replacing '_train' with '_optimizer')\n")
+    parser.add_argument('--tolerance',
+                        metavar = '<tol>',
+                        dest = 'tolerance',
+                        type = float,
+                        default = None,
+                        help = 'Absolute loss tolerance for pass/fail comparison (default: auto from generateTrainingNetwork.py)\n')
+
+    args = parser.parse_args()
+
+    platform = default_platform
+    simulator = args.simulator if args.simulator else default_simulator
+
+    script_path = Path(__file__).resolve()
+    base_dir = script_path.parent.parent
+
+    gen_dir, test_dir_abs, test_name = get_test_paths(args.dir, platform, base_dir = str(base_dir))
+
+    worker_id = os.environ.get('PYTEST_XDIST_WORKER', 'master')
+    build_dir = str(base_dir / f'TEST_{platform.upper()}' / f'build_{worker_id}')
+
+    cmake_args = [f'-DNUM_CORES={args.cores}']
+    if args.cmake:
+        cmake_args.extend(args.cmake)
+
+    gen_args = [f'--cores={args.cores}']
+    if args.tolerance is not None:
+        gen_args.append(f'--tolerance={args.tolerance}')
+    if args.input_type_map:
+        gen_args.extend(['--input-type-map'] + list(args.input_type_map))
+    if args.input_offset_map:
+        gen_args.extend(['--input-offset-map'] + list(args.input_offset_map))
+
+    if tiling_enabled:
+        if getattr(args, 'defaultMemLevel', None):
+            gen_args.append(f'--defaultMemLevel={args.defaultMemLevel}')
+        if getattr(args, 'l1', None):
+            gen_args.append(f'--l1={args.l1}')
+        if getattr(args, 'l2', None) and args.l2 != 1024000:
+            gen_args.append(f'--l2={args.l2}')
+        if getattr(args, 'memAllocStrategy', None):
+            gen_args.append(f'--memAllocStrategy={args.memAllocStrategy}')
+        if getattr(args, 'searchStrategy', None):
+            gen_args.append(f'--searchStrategy={args.searchStrategy}')
+        if getattr(args, 'profileTiling', False):
+            gen_args.append('--profileTiling')
+        if getattr(args, 'plotMemAlloc', False):
+            gen_args.append('--plotMemAlloc')
+
+    config = DeeployTestConfig(
+        test_name = test_name,
+        test_dir = test_dir_abs,
+        platform = platform,
+        simulator = simulator,
+        tiling = tiling_enabled,
+        gen_dir = gen_dir,
+        build_dir = build_dir,
+        toolchain = args.toolchain,
+        toolchain_install_dir = args.toolchain_install_dir,
+        cmake_args = cmake_args,
+        gen_args = gen_args,
+        verbose = args.verbose,
+        debug = args.debug,
+        training = True,
+        n_train_steps = args.n_steps,
+        n_accum_steps = args.n_accum,
+        training_num_data_inputs = args.num_data_inputs,
+        optimizer_dir = args.optimizer_dir,
+    )
+
+    print_configuration(config)
+
+    try:
+        result = run_complete_test(config, skipgen = args.skipgen, skipsim = args.skipsim)
+        print_colored_result(result, config.test_name)
+        return 0 if result.success else 1
+    except Exception as e:
+        RED = '\033[91m'
+        RESET = '\033[0m'
+        print(f'\n{RED}✗ Test {config.test_name} FAILED with exception: {e}{RESET}')
+        return 1
diff --git a/DeeployTest/testUtils/tilingUtils.py b/DeeployTest/testUtils/tilingUtils.py
index 0c3986cd6e..1e4b143cfb 100644
--- a/DeeployTest/testUtils/tilingUtils.py
+++ b/DeeployTest/testUtils/tilingUtils.py
@@ -2,11 +2,13 @@
 #
 # SPDX-License-Identifier: Apache-2.0
 
-from typing import List, Union
+from typing import Dict, List, Optional, Tuple, Union
 
 from ortools.constraint_solver.pywrapcp import IntVar
 
 from Deeploy.DeeployTypes import NetworkContext, SubGraph, TransientBuffer
+from Deeploy.TilingExtension.MemoryConstraints import PatternMemoryConstraints
+from Deeploy.TilingExtension.MemoryScheduler import MemoryScheduler
 from Deeploy.TilingExtension.TilerExtension import Tiler
 from Deeploy.TilingExtension.TilerModel import TilerModel
 
@@ -43,3 +45,28 @@ class SBTiler(Tiler):
     def multiBufferStrategy(self, tilerModel: TilerModel, ctxt: NetworkContext, pattern: SubGraph, path: List[str],
                             hop: str, tensorName: str) -> Union[int, IntVar]:
         return 1
+
+
+class TrainingMemoryScheduler(MemoryScheduler):
+    """MemoryScheduler variant for training networks.
+
+    Extends input tensor lifetimes to the end of the full tiling schedule so
+    that forward-pass inputs remain live during the backward pass.
+    """
+
+    def _calculateLifetimes(
+            self, ctxt: NetworkContext, patternMemoryConstraint: PatternMemoryConstraints,
+            memoryLevel: str) -> Tuple[Dict[str, Tuple[int, int]], Dict]:
+        tensorLifetimeMap, tensorMap = super()._calculateLifetimes(ctxt, patternMemoryConstraint, memoryLevel)
+
+        maxStepIdx = len(patternMemoryConstraint.nodeConstraints)
+        for tensorName, lifetime in tensorLifetimeMap.items():
+            buffer = ctxt.lookup(tensorName)
+            if buffer.is_input:
+                tensorLifetimeMap[tensorName] = (0, maxStepIdx)
+
+        return tensorLifetimeMap, tensorMap
+
+
+class TrainingSBTiler(SBTiler):
+    memorySchedulerClass = TrainingMemoryScheduler

From 284f14576d22d7836142e41bd31919e4aa79ca63 Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 11:45:50 +0000
Subject: [PATCH 02/28] training-platform core: apply pre-commit formatting
 (yapf/isort/clang-format)

---
 Deeploy/Targets/Generic/TypeCheckers.py       |   3 +-
 Deeploy/Targets/PULPOpen/Bindings.py          |   8 +-
 Deeploy/Targets/PULPOpen/Platform.py          | 124 +++++++----
 .../FloatInPlaceAccumulatorV2Template.py      |   4 +-
 .../Targets/PULPOpen/Templates/SGDTemplate.py |   5 +-
 .../TileConstraints/SGDTileConstraint.py      |   1 +
 ...rossEntropyLossDualOutputTileConstraint.py |  17 +-
 Deeploy/Targets/PULPOpen/Tiler.py             |  14 +-
 Deeploy/TilingExtension/TilerExtension.py     |   3 +-
 .../Platforms/Siracusa/src/deeploytraintest.c | 204 +++++++++---------
 DeeployTest/deeployTrainingRunner.py          |   8 +-
 DeeployTest/generateOptimizerNetwork.py       |  54 ++---
 DeeployTest/generateTrainingNetwork.py        | 102 +++++----
 DeeployTest/testMVPOptimizer.py               |  96 ++++-----
 DeeployTest/testMVPTraining.py                | 160 +++++++-------
 DeeployTest/testUtils/codeGenerate.py         |  95 ++++----
 DeeployTest/testUtils/core/execution.py       |  85 +++++---
 .../testUtils/deeployTrainingRunner.py        |  17 +-
 DeeployTest/testUtils/tilingUtils.py          |   7 +-
 19 files changed, 537 insertions(+), 470 deletions(-)

diff --git a/Deeploy/Targets/Generic/TypeCheckers.py b/Deeploy/Targets/Generic/TypeCheckers.py
index d65dc455d2..7e9bc923cf 100644
--- a/Deeploy/Targets/Generic/TypeCheckers.py
+++ b/Deeploy/Targets/Generic/TypeCheckers.py
@@ -574,8 +574,7 @@ class SoftmaxCrossEntropyLossChecker(SignPropTypeChecker):
     def __init__(self, input_types: Sequence[Type[Pointer]], output_types: Sequence[Type[Pointer]]):
         super().__init__(input_types, output_types)
 
-    def checkOutputType(self, inputs: List[VariableBuffer],
-                        operatorRepresentation: OperatorRepresentation) -> bool:
+    def checkOutputType(self, inputs: List[VariableBuffer], operatorRepresentation: OperatorRepresentation) -> bool:
         # The parser sets 'loss' to a non-empty string for 2-output nodes, '' for 1-output.
         # Use this to determine the actual output count and match it against this binding.
         actual_num_outputs = 2 if operatorRepresentation.get('loss', '') != '' else 1
diff --git a/Deeploy/Targets/PULPOpen/Bindings.py b/Deeploy/Targets/PULPOpen/Bindings.py
index b3029e7adc..2a1bc9ec02 100644
--- a/Deeploy/Targets/PULPOpen/Bindings.py
+++ b/Deeploy/Targets/PULPOpen/Bindings.py
@@ -379,14 +379,16 @@
 PULPInPlaceAccumulatorV2Bindings = [
     NodeBinding(
         InPlaceAccumulatorV2Checker(
-            [PointerClass(float32_t), PointerClass(float32_t), PointerClass(uint8_t)], [PointerClass(float32_t)]),
-        FloatInPlaceAccumulatorV2Template.referenceTemplate, ForkTransformer)
+            [PointerClass(float32_t), PointerClass(float32_t),
+             PointerClass(uint8_t)], [PointerClass(float32_t)]), FloatInPlaceAccumulatorV2Template.referenceTemplate,
+        ForkTransformer)
 ]
 
 PULPInPlaceAccumulatorV2TiledBindings = [
     NodeBinding(
         InPlaceAccumulatorV2Checker(
-            [PointerClass(float32_t), PointerClass(float32_t), PointerClass(uint8_t)], [PointerClass(float32_t)]),
+            [PointerClass(float32_t), PointerClass(float32_t),
+             PointerClass(uint8_t)], [PointerClass(float32_t)]),
         FloatInPlaceAccumulatorV2Template.tiledReferenceTemplate, ForkTransformer)
 ]
 
diff --git a/Deeploy/Targets/PULPOpen/Platform.py b/Deeploy/Targets/PULPOpen/Platform.py
index 56481f9220..0766548e43 100644
--- a/Deeploy/Targets/PULPOpen/Platform.py
+++ b/Deeploy/Targets/PULPOpen/Platform.py
@@ -20,8 +20,8 @@
     SoftmaxCrossEntropyLossLayer, SoftmaxGradLayer, SoftmaxLayer, TransposeLayer, iHardswishLayer, iRMSNormLayer
 from Deeploy.Targets.Generic.Parsers import AddParser, ConcatParser, DequantParser, FlattenParser, GatherParser, \
     GELUGradParser, GELUParser, GEMMParser, InPlaceAccumulatorV2Parser, LayerNormGradParser, LayerNormParser, \
-    MatMulParser, MaxPool1DParser, MaxPool2DParser, MulParser, Pad1DParser, Pad2DParser, QuantParser, \
-    ReduceSumParser, ReluParser, RequantShiftParser, ReshapeParser, RQAddParser, RQIntegerDivParser, RQSiGELUParser, \
+    MatMulParser, MaxPool1DParser, MaxPool2DParser, MulParser, Pad1DParser, Pad2DParser, QuantParser, ReduceSumParser, \
+    ReluParser, RequantShiftParser, ReshapeParser, RQAddParser, RQIntegerDivParser, RQSiGELUParser, \
     RQSiHardswishParser, SGDParser, SliceParser, SoftmaxCrossEntropyLossGradParser, SoftmaxCrossEntropyLossParser, \
     SoftmaxGradParser, SoftmaxParser, TransposeParser, UniformRequantShiftParser, UnsqueezeParser, iHardswishParser, \
     iRMSNormParser, iSoftmaxParser
@@ -116,48 +116,88 @@
 DequantMapper = NodeMapper(DequantParser(), BasicDequantBindings)
 GEMMDequantMapper = NodeMapper(PULPGEMMParser(), BasicGEMMBindings)
 PULPMapping = {
-    'Conv': ConvLayer([FPConv2DMapper, FPDWConv2DMapper]),
-    'RequantizedConv': PULPRQSConvLayer([Conv2DMapper, DWConv2DMapper, Conv1DMapper, DWConv1DMapper]),
-    'RequantizedGemm': PULPRQSGEMMLayer([MatrixVecMapper, TallGEMMMapper, GEMMMapper]),
-    'Gemm': GEMMLayer([FloatGEMMMapper, GEMMDequantMapper]),
-    'Gelu': GELULayer([GELUMapper]),
-    'GeluGrad': GELUGradLayer([GELUGradMapper]),
-    'LayerNormalization': LayerNormLayer([LayerNormMapper]),
-    'LayerNormalizationGrad': LayerNormGradLayer([LayerNormGradMapper]),
-    'MaxPool': MaxPoolLayer([MaxPool1DMapper, MaxPool2DMapper]),
-    'RequantizediGELU': RQSiGELULayer([RQGELU_int8_Mapper]),
-    'RQIntegerDiv': RQIntegerDivLayer([RQIntegerDivMapper]),
-    'MatMul': MatMulLayer([MatMulMapper]),
-    'IntegerMean': ReduceMeanLayer([ReduceMeanMapper]),
-    'iSoftmax': SoftmaxLayer([Softmax_int8_Mapper]),
-    'Softmax': SoftmaxLayer([SoftmaxMapper]),
-    'ReduceMean': ReduceMeanLayer([ReduceMeanMapper]),
-    'ReduceSum': ReduceSumLayer([ReduceSumMapper]),
-    'RequantShift': RequantShiftLayer([UniformRequantShiftMapper, RequantShiftMapper]),
-    'Add': AddLayer([AddMapper]),
-    'Flatten': ReshapeLayer([FlattenMapper]),
-    'Gather': GatherLayer([GatherMapper]),
-    'Mul': MulLayer([MulMapper]),
-    'Pad': PadLayer([Pad1DMapper, Pad2DMapper]),
-    'Relu': ReluLayer([ReluMapper]),
-    'Reshape': ReshapeLayer([ReshapeMapper]),
-    'Squeeze': ReshapeLayer([UnsqueezeMapper]),
-    'Transpose': TransposeLayer([TransposeMapper]),
-    'Unsqueeze': ReshapeLayer([UnsqueezeMapper]),
-    'Slice': SliceLayer([SliceMapper, DMASliceMapper]),
-    'RequantizedAdd': AddLayer([RQAddMapper]),
-    'Concat': ConcatLayer([ConcatMapper]),
-    'iRMSNorm': iRMSNormLayer([iRMSNormMapper]),
-    'iHardswish': iHardswishLayer([iHardswishMapper]),
-    'RequantizediHardswish': RQSiHardswishLayer([RQSiHardswishMapper]),
-    'Quant': QuantLayer([QuantMapper]),
-    'Dequant': QuantLayer([DequantMapper]),
-    'SoftmaxGrad': SoftmaxGradLayer([SoftmaxGradMapper]),
+    'Conv':
+        ConvLayer([FPConv2DMapper, FPDWConv2DMapper]),
+    'RequantizedConv':
+        PULPRQSConvLayer([Conv2DMapper, DWConv2DMapper, Conv1DMapper, DWConv1DMapper]),
+    'RequantizedGemm':
+        PULPRQSGEMMLayer([MatrixVecMapper, TallGEMMMapper, GEMMMapper]),
+    'Gemm':
+        GEMMLayer([FloatGEMMMapper, GEMMDequantMapper]),
+    'Gelu':
+        GELULayer([GELUMapper]),
+    'GeluGrad':
+        GELUGradLayer([GELUGradMapper]),
+    'LayerNormalization':
+        LayerNormLayer([LayerNormMapper]),
+    'LayerNormalizationGrad':
+        LayerNormGradLayer([LayerNormGradMapper]),
+    'MaxPool':
+        MaxPoolLayer([MaxPool1DMapper, MaxPool2DMapper]),
+    'RequantizediGELU':
+        RQSiGELULayer([RQGELU_int8_Mapper]),
+    'RQIntegerDiv':
+        RQIntegerDivLayer([RQIntegerDivMapper]),
+    'MatMul':
+        MatMulLayer([MatMulMapper]),
+    'IntegerMean':
+        ReduceMeanLayer([ReduceMeanMapper]),
+    'iSoftmax':
+        SoftmaxLayer([Softmax_int8_Mapper]),
+    'Softmax':
+        SoftmaxLayer([SoftmaxMapper]),
+    'ReduceMean':
+        ReduceMeanLayer([ReduceMeanMapper]),
+    'ReduceSum':
+        ReduceSumLayer([ReduceSumMapper]),
+    'RequantShift':
+        RequantShiftLayer([UniformRequantShiftMapper, RequantShiftMapper]),
+    'Add':
+        AddLayer([AddMapper]),
+    'Flatten':
+        ReshapeLayer([FlattenMapper]),
+    'Gather':
+        GatherLayer([GatherMapper]),
+    'Mul':
+        MulLayer([MulMapper]),
+    'Pad':
+        PadLayer([Pad1DMapper, Pad2DMapper]),
+    'Relu':
+        ReluLayer([ReluMapper]),
+    'Reshape':
+        ReshapeLayer([ReshapeMapper]),
+    'Squeeze':
+        ReshapeLayer([UnsqueezeMapper]),
+    'Transpose':
+        TransposeLayer([TransposeMapper]),
+    'Unsqueeze':
+        ReshapeLayer([UnsqueezeMapper]),
+    'Slice':
+        SliceLayer([SliceMapper, DMASliceMapper]),
+    'RequantizedAdd':
+        AddLayer([RQAddMapper]),
+    'Concat':
+        ConcatLayer([ConcatMapper]),
+    'iRMSNorm':
+        iRMSNormLayer([iRMSNormMapper]),
+    'iHardswish':
+        iHardswishLayer([iHardswishMapper]),
+    'RequantizediHardswish':
+        RQSiHardswishLayer([RQSiHardswishMapper]),
+    'Quant':
+        QuantLayer([QuantMapper]),
+    'Dequant':
+        QuantLayer([DequantMapper]),
+    'SoftmaxGrad':
+        SoftmaxGradLayer([SoftmaxGradMapper]),
     'SoftmaxCrossEntropyLoss':
         SoftmaxCrossEntropyLossLayer([SoftmaxCrossEntropyLossDualOutputMapper, SoftmaxCrossEntropyLossMapper]),
-    'SoftmaxCrossEntropyLossGrad': SoftmaxCrossEntropyLossGradLayer([SoftmaxCrossEntropyLossGradMapper]),
-    'SGD': SGDLayer([SGDMapper]),
-    'InPlaceAccumulatorV2': InPlaceAccumulatorV2Layer([InPlaceAccumulatorV2Mapper]),
+    'SoftmaxCrossEntropyLossGrad':
+        SoftmaxCrossEntropyLossGradLayer([SoftmaxCrossEntropyLossGradMapper]),
+    'SGD':
+        SGDLayer([SGDMapper]),
+    'InPlaceAccumulatorV2':
+        InPlaceAccumulatorV2Layer([InPlaceAccumulatorV2Mapper]),
 }
 
 
diff --git a/Deeploy/Targets/PULPOpen/Templates/FloatInPlaceAccumulatorV2Template.py b/Deeploy/Targets/PULPOpen/Templates/FloatInPlaceAccumulatorV2Template.py
index 2c01219dbd..d1cfcc5d01 100644
--- a/Deeploy/Targets/PULPOpen/Templates/FloatInPlaceAccumulatorV2Template.py
+++ b/Deeploy/Targets/PULPOpen/Templates/FloatInPlaceAccumulatorV2Template.py
@@ -2,9 +2,9 @@
 #
 # SPDX-License-Identifier: Apache-2.0
 
-from typing import Dict, List, Tuple
+from typing import List, Tuple
 
-from Deeploy.DeeployTypes import NetworkContext, NodeTemplate, OperatorRepresentation, VariableBuffer
+from Deeploy.DeeployTypes import NetworkContext, NodeTemplate, OperatorRepresentation
 
 
 class _PULPInPlaceAccumulatorV2Template(NodeTemplate):
diff --git a/Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py b/Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py
index da27aab47c..3be74c38d6 100644
--- a/Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py
+++ b/Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py
@@ -4,7 +4,7 @@
 
 from typing import List, Tuple
 
-from Deeploy.DeeployTypes import NetworkContext, NodeTemplate, OperatorRepresentation, VariableBuffer
+from Deeploy.DeeployTypes import NetworkContext, NodeTemplate, OperatorRepresentation
 
 
 class _PULPSGDTemplate(NodeTemplate):
@@ -31,8 +31,7 @@ def alignToContext(
 
         # Make weight_updated share weight's L2 allocation (no separate malloc).
         # The egress DMA then writes updated weights back to weight's L2 address.
-        weight_updated.allocTemplate = NodeTemplate(
-            " ${name} = (${type.typeName}) " + str(weight._instance) + ";")
+        weight_updated.allocTemplate = NodeTemplate(" ${name} = (${type.typeName}) " + str(weight._instance) + ";")
         weight_updated.deallocTemplate = NodeTemplate("")
         return ctxt, operatorRepresentation, []
 
diff --git a/Deeploy/Targets/PULPOpen/TileConstraints/SGDTileConstraint.py b/Deeploy/Targets/PULPOpen/TileConstraints/SGDTileConstraint.py
index ebef4910ca..951713d85d 100644
--- a/Deeploy/Targets/PULPOpen/TileConstraints/SGDTileConstraint.py
+++ b/Deeploy/Targets/PULPOpen/TileConstraints/SGDTileConstraint.py
@@ -11,6 +11,7 @@ class SGDTileConstraint(BOPTileConstraint):
     dataIn2Name = 'grad'
     dataOutName = 'weight_updated'
 
+
 class ReluGradTileConstraint(BOPTileConstraint):
 
     dataIn1Name = 'grad_out'
diff --git a/Deeploy/Targets/PULPOpen/TileConstraints/SoftmaxCrossEntropyLossDualOutputTileConstraint.py b/Deeploy/Targets/PULPOpen/TileConstraints/SoftmaxCrossEntropyLossDualOutputTileConstraint.py
index 3456632b79..a261869711 100644
--- a/Deeploy/Targets/PULPOpen/TileConstraints/SoftmaxCrossEntropyLossDualOutputTileConstraint.py
+++ b/Deeploy/Targets/PULPOpen/TileConstraints/SoftmaxCrossEntropyLossDualOutputTileConstraint.py
@@ -3,16 +3,13 @@
 # SPDX-License-Identifier: Apache-2.0
 
 import copy
-from typing import Dict, List, Tuple, Union
+from typing import List, Tuple
 
 from Deeploy.DeeployTypes import NetworkContext, OperatorRepresentation
+from Deeploy.Targets.PULPOpen.TileConstraints.SoftmaxCrossEntropyTileConstraint import SoftmaxCrossEntropyTileConstraint
 from Deeploy.TilingExtension.MemoryConstraints import NodeMemoryConstraint
 from Deeploy.TilingExtension.TileConstraint import TileConstraint
-from Deeploy.TilingExtension.TilerModel import TilerModel
-from Deeploy.TilingExtension.TilingCodegen import AbsoluteHyperRectangle, HyperRectangle, TilingSchedule, \
-    VariableReplacementScheme
-from Deeploy.Targets.PULPOpen.TileConstraints.SoftmaxCrossEntropyTileConstraint import \
-    SoftmaxCrossEntropyTileConstraint
+from Deeploy.TilingExtension.TilingCodegen import HyperRectangle, TilingSchedule, VariableReplacementScheme
 
 
 class SoftmaxCrossEntropyLossDualOutputTileConstraint(SoftmaxCrossEntropyTileConstraint):
@@ -35,8 +32,8 @@ def wrapTilingSolution(
             cls, tilingSolution: NodeMemoryConstraint, targetMemLevel: str, ctxt: NetworkContext,
             operatorRepresentation: OperatorRepresentation) -> Tuple[VariableReplacementScheme, List[TilingSchedule]]:
 
-        logProbVar = operatorRepresentation[cls.dataOutName]   # e.g. "onnx::log_prob::3"
-        lossVar    = operatorRepresentation.get(cls.dataLossName, '')
+        logProbVar = operatorRepresentation[cls.dataOutName]  # e.g. "onnx::log_prob::3"
+        lossVar = operatorRepresentation.get(cls.dataLossName, '')
 
         # If loss is absent (empty string — single-output fallback) or not in the
         # memory constraint dict, delegate straight to the parent unchanged.
@@ -52,8 +49,8 @@ def wrapTilingSolution(
 
         # Call the base-class wrapTilingSolution, which runs cube computation and
         # calls serializeTilingSolution for log_prob.
-        varReplacement, tilingSchedules = super().wrapTilingSolution(
-            singleOutputSolution, targetMemLevel, ctxt, operatorRepresentation)
+        varReplacement, tilingSchedules = super().wrapTilingSolution(singleOutputSolution, targetMemLevel, ctxt,
+                                                                     operatorRepresentation)
 
         # Extend each TilingSchedule to include the scalar loss output.
         # The loss tensor is always 1 element (0-d scalar represented as [1] for DMA).
diff --git a/Deeploy/Targets/PULPOpen/Tiler.py b/Deeploy/Targets/PULPOpen/Tiler.py
index a135d43812..aa16369f04 100644
--- a/Deeploy/Targets/PULPOpen/Tiler.py
+++ b/Deeploy/Targets/PULPOpen/Tiler.py
@@ -16,13 +16,13 @@
 from Deeploy.Targets.Generic.TileConstraints.UnaryTileConstraint import UnaryTileConstraint
 from Deeploy.Targets.PULPOpen.Bindings import PULPAddBindings, PULPConcatBindings, PULPFloatConv2DBindings, \
     PULPFloatDWConv2DBindings, PULPFloatGELUBinding, PULPFloatGELUGradBinding, PULPFloatGEMMBindings, \
-    PULPGatherBindings, PULPiHardswishBindings, PULPiRMSNormBindings, PULPiRQSGELUBindings, PULPLayernormBinding, \
-    PULPLayernormGradBinding, PULPMatMulBindings, PULPMaxPool1DBindings, PULPMaxPool2DBindings, PULPMulBindings, \
-    PULPReduceMeanBindings, PULPReduceSumBindings, PULPReluBinding, PULPReshapeBindings, PULPRQAddBindings, \
-    PULPInPlaceAccumulatorV2Bindings, PULPInPlaceAccumulatorV2TiledBindings, PULPRQSBindings, \
-    PULPRQSConv1DBindings, PULPRQSConv2DBindings, PULPRQSDWConv2DBindings, PULPRQSGEMMBindings, \
-    PULPRQSiHardswishBindings, PULPRQSMatrixVecBindings, PULPRQSTallGEMMBindings, PULPSGDBindings, PULPSliceBindings, \
-    PULPSoftmaxBindings, PULPSoftmaxCrossEntropyLossBindings, PULPSoftmaxCrossEntropyLossDualOutputBindings, \
+    PULPGatherBindings, PULPiHardswishBindings, PULPInPlaceAccumulatorV2TiledBindings, PULPiRMSNormBindings, \
+    PULPiRQSGELUBindings, PULPLayernormBinding, PULPLayernormGradBinding, PULPMatMulBindings, PULPMaxPool1DBindings, \
+    PULPMaxPool2DBindings, PULPMulBindings, PULPReduceMeanBindings, PULPReduceSumBindings, PULPReluBinding, \
+    PULPReshapeBindings, PULPRQAddBindings, PULPRQSBindings, PULPRQSConv1DBindings, PULPRQSConv2DBindings, \
+    PULPRQSDWConv2DBindings, PULPRQSGEMMBindings, PULPRQSiHardswishBindings, PULPRQSMatrixVecBindings, \
+    PULPRQSTallGEMMBindings, PULPSGDBindings, PULPSliceBindings, PULPSoftmaxBindings, \
+    PULPSoftmaxCrossEntropyLossBindings, PULPSoftmaxCrossEntropyLossDualOutputBindings, \
     PULPSoftmaxCrossEntropyLossGradBindings, PULPSoftmaxGradBindings, PULPTransposeBindings, PULPUniformRQSBindings
 from Deeploy.Targets.PULPOpen.TileConstraints.ConvTileConstraint import Conv2DTileConstraint, RQConv1DTileConstraint, \
     RQConv2DTileConstraint
diff --git a/Deeploy/TilingExtension/TilerExtension.py b/Deeploy/TilingExtension/TilerExtension.py
index e42ddf13ad..294f5b400a 100644
--- a/Deeploy/TilingExtension/TilerExtension.py
+++ b/Deeploy/TilingExtension/TilerExtension.py
@@ -333,7 +333,8 @@ def _convertCtxtToStaticSchedule(self, ctxt: NetworkContext,
                     if _buffer._memoryLevel != memoryLevel:
                         continue
 
-                    if hasattr(_buffer, "_alias") and ctxt.is_global(_buffer._alias) and _buffer._alias not in blockNames:
+                    if hasattr(_buffer, "_alias") and ctxt.is_global(
+                            _buffer._alias) and _buffer._alias not in blockNames:
                         continue
 
                     if hasattr(_buffer, "_alias") and _buffer._alias in blockNames:
diff --git a/DeeployTest/Platforms/Siracusa/src/deeploytraintest.c b/DeeployTest/Platforms/Siracusa/src/deeploytraintest.c
index 2b43c90710..6b324ca7ad 100644
--- a/DeeployTest/Platforms/Siracusa/src/deeploytraintest.c
+++ b/DeeployTest/Platforms/Siracusa/src/deeploytraintest.c
@@ -5,7 +5,8 @@
  */
 
 /*
- * Training harness for Siracusa — Phase 2 (with Deeploy-compiled OptimizerNetwork)
+ * Training harness for Siracusa — Phase 2 (with Deeploy-compiled
+ * OptimizerNetwork)
  *
  * Loop structure:
  *
@@ -15,40 +16,43 @@
  *
  *   for update_step in [0, N_TRAIN_STEPS):          // optimizer steps
  *       for accum_step in [0, N_ACCUM_STEPS):        // mini-batches per update
- *           lazy_reset_grad = (accum_step == 0)      // reset on first, accumulate on rest
- *           load data for this mini-batch
- *           RunTrainingNetwork()                     // fwd + bwd + InPlaceAccumulatorV2
- *           store loss value
+ *           lazy_reset_grad = (accum_step == 0)      // reset on first,
+ * accumulate on rest load data for this mini-batch RunTrainingNetwork() // fwd
+ * + bwd + InPlaceAccumulatorV2 store loss value
  *       // SGD weight update via Deeploy-compiled optimizer kernel:
  *       copy weights + grad_acc → optimizer input buffers
  *       RunOptimizerNetwork()
- *       copy weight_updated ← optimizer output buffers → training weight buffers
+ *       copy weight_updated ← optimizer output buffers → training weight
+ * buffers
  *
  *   Numerical verification:
  *     - Compare stored loss values against testLossRef[] (from testoutputs.h)
  *
  * Buffer layout in DeeployNetwork_inputs[] (must match ONNX input order):
- *   [0 .. TRAINING_NUM_DATA_INPUTS-1]              data + labels (per mini-batch)
- *   [TRAINING_NUM_DATA_INPUTS ..
+ *   [0 .. TRAINING_NUM_DATA_INPUTS-1]              data + labels (per
+ * mini-batch) [TRAINING_NUM_DATA_INPUTS ..
  *    .. TRAINING_GRAD_BUF_START_IDX-1]             weights (persistent)
  *   [TRAINING_GRAD_BUF_START_IDX ..
- *    .. +TRAINING_NUM_GRAD_INPUTS-1]               grad accumulation bufs (persistent)
- *   [DeeployNetwork_num_inputs-1]                  lazy_reset_grad uint8
+ *    .. +TRAINING_NUM_GRAD_INPUTS-1]               grad accumulation bufs
+ * (persistent) [DeeployNetwork_num_inputs-1]                  lazy_reset_grad
+ * uint8
  *
  * Optimizer buffer layout in DeeployOptNetwork_inputs[] (interleaved pairs):
- *   [2*i]   weight_i     (copied from DeeployNetwork_inputs[TRAINING_NUM_DATA_INPUTS+i])
- *   [2*i+1] grad_acc_i   (copied from DeeployNetwork_inputs[TRAINING_GRAD_BUF_START_IDX+i])
+ *   [2*i]   weight_i     (copied from
+ * DeeployNetwork_inputs[TRAINING_NUM_DATA_INPUTS+i]) [2*i+1] grad_acc_i (copied
+ * from DeeployNetwork_inputs[TRAINING_GRAD_BUF_START_IDX+i])
  * DeeployOptNetwork_outputs[i] = weight_i_updated
  *   → copied back to DeeployNetwork_inputs[TRAINING_NUM_DATA_INPUTS+i]
  *
  * Compile-time constants (emitted by code generator into testinputs.h):
  *   N_TRAIN_STEPS              number of optimizer (weight-update) steps
  *   N_ACCUM_STEPS              number of mini-batches accumulated per update
- *   TRAINING_NUM_DATA_INPUTS   inputs that change each mini-batch (data + labels)
- *   TRAINING_GRAD_BUF_START_IDX  first grad acc buffer index in DeeployNetwork_inputs[]
- *   TRAINING_NUM_GRAD_INPUTS   number of grad accumulation buffers (== number of weights)
- *   TRAINING_NUM_WEIGHT_INPUTS number of trainable weight buffers
- *   TRAINING_LEARNING_RATE     SGD learning rate (for reference — embedded in optimizer ONNX)
+ *   TRAINING_NUM_DATA_INPUTS   inputs that change each mini-batch (data +
+ * labels) TRAINING_GRAD_BUF_START_IDX  first grad acc buffer index in
+ * DeeployNetwork_inputs[] TRAINING_NUM_GRAD_INPUTS   number of grad
+ * accumulation buffers (== number of weights) TRAINING_NUM_WEIGHT_INPUTS number
+ * of trainable weight buffers TRAINING_LEARNING_RATE     SGD learning rate (for
+ * reference — embedded in optimizer ONNX)
  *
  * Reference comparison constants (emitted into testoutputs.h):
  *   N_LOSS_REFS                number of reference loss values
@@ -68,8 +72,9 @@
 #include "testinputs.h"
 #include "testoutputs.h"
 
-/* Helper: true when ptr is in L2 (CPU-accessible); false when in L3 (external RAM) */
-#define IS_L2(ptr)  ((uint32_t)(ptr) >= 0x10000000u)
+/* Helper: true when ptr is in L2 (CPU-accessible); false when in L3 (external
+ * RAM) */
+#define IS_L2(ptr) ((uint32_t)(ptr) >= 0x10000000u)
 
 /* -------------------------------------------------------------------------
  * Compile-time defaults — override via CMake target_compile_definitions
@@ -87,7 +92,7 @@
 #define TRAINING_NUM_DATA_INPUTS 2
 #endif
 
-#define MAINSTACKSIZE  12000
+#define MAINSTACKSIZE 12000
 #define SLAVESTACKSIZE 3800
 
 /* -------------------------------------------------------------------------
@@ -96,7 +101,6 @@
 
 struct pi_device cluster_dev;
 
-
 /* -------------------------------------------------------------------------
  * Loss storage (one value per forward pass)
  * ---------------------------------------------------------------------- */
@@ -140,7 +144,8 @@ static void connect_optimizer_buffers(void) {
 #if defined(TRAINING_NUM_WEIGHT_INPUTS) && (TRAINING_NUM_WEIGHT_INPUTS > 0)
   /* Nothing to pre-allocate — InitOptimizerNetwork() already allocated the
    * optimizer's static buffers and set DeeployOptNetwork_inputs[]/outputs[].
-   * We only need to sync data at each optimizer step (see run_optimizer_step). */
+   * We only need to sync data at each optimizer step (see run_optimizer_step).
+   */
   (void)0;
 #endif
 }
@@ -152,15 +157,17 @@ static void run_optimizer_step(void) {
   for (uint32_t wi = 0; wi < (uint32_t)TRAINING_NUM_WEIGHT_INPUTS; wi++) {
     uint32_t train_w_idx = (uint32_t)TRAINING_NUM_DATA_INPUTS + wi;
     uint32_t train_g_idx = (uint32_t)TRAINING_GRAD_BUF_START_IDX + wi;
-    uint32_t opt_w_in    = 2u * wi;
-    uint32_t opt_g_in    = 2u * wi + 1u;
+    uint32_t opt_w_in = 2u * wi;
+    uint32_t opt_g_in = 2u * wi + 1u;
 
-    if (DeeployOptNetwork_inputs[opt_w_in] != DeeployNetwork_inputs[train_w_idx]) {
+    if (DeeployOptNetwork_inputs[opt_w_in] !=
+        DeeployNetwork_inputs[train_w_idx]) {
       l3_aware_copy(DeeployOptNetwork_inputs[opt_w_in],
                     DeeployNetwork_inputs[train_w_idx],
                     DeeployOptNetwork_inputs_bytes[opt_w_in]);
     }
-    if (DeeployOptNetwork_inputs[opt_g_in] != DeeployNetwork_inputs[train_g_idx]) {
+    if (DeeployOptNetwork_inputs[opt_g_in] !=
+        DeeployNetwork_inputs[train_g_idx]) {
       l3_aware_copy(DeeployOptNetwork_inputs[opt_g_in],
                     DeeployNetwork_inputs[train_g_idx],
                     DeeployOptNetwork_inputs_bytes[opt_g_in]);
@@ -169,33 +176,34 @@ static void run_optimizer_step(void) {
 
   struct pi_cluster_task opt_task;
   pi_cluster_task(&opt_task, RunOptimizerNetwork, NULL);
-  opt_task.stack_size       = MAINSTACKSIZE;
+  opt_task.stack_size = MAINSTACKSIZE;
   opt_task.slave_stack_size = SLAVESTACKSIZE;
   pi_cluster_send_task_to_cl(&cluster_dev, &opt_task);
 
-  /* --- Step C: copy weight_updated back to training network's weight buffers ---
-   * Skipped when codegen has shared the output buffer with the training input. */
+  /* --- Step C: copy weight_updated back to training network's weight buffers
+   * --- Skipped when codegen has shared the output buffer with the training
+   * input. */
   for (uint32_t wi = 0; wi < (uint32_t)TRAINING_NUM_WEIGHT_INPUTS; wi++) {
-    uint32_t train_w_idx  = (uint32_t)TRAINING_NUM_DATA_INPUTS + wi;
-    uint32_t opt_w_out    = wi;
+    uint32_t train_w_idx = (uint32_t)TRAINING_NUM_DATA_INPUTS + wi;
+    uint32_t opt_w_out = wi;
 
-    if (DeeployOptNetwork_outputs[opt_w_out] == DeeployNetwork_inputs[train_w_idx]) {
-      continue;  /* in-place: training buffer already updated */
+    if (DeeployOptNetwork_outputs[opt_w_out] ==
+        DeeployNetwork_inputs[train_w_idx]) {
+      continue; /* in-place: training buffer already updated */
     }
 
-    uint32_t opt_bytes   = DeeployOptNetwork_outputs_bytes[opt_w_out];
+    uint32_t opt_bytes = DeeployOptNetwork_outputs_bytes[opt_w_out];
     uint32_t train_bytes = DeeployNetwork_inputs_bytes[train_w_idx];
     if (opt_bytes == train_bytes) {
       l3_aware_copy(DeeployNetwork_inputs[train_w_idx],
-                    DeeployOptNetwork_outputs[opt_w_out],
-                    opt_bytes);
+                    DeeployOptNetwork_outputs[opt_w_out], opt_bytes);
     } else {
       /* Broadcasted bias: fill every tile with updated value. */
       for (uint32_t off = 0; off < train_bytes; off += opt_bytes) {
-        uint32_t chunk = (off + opt_bytes <= train_bytes) ? opt_bytes : (train_bytes - off);
+        uint32_t chunk =
+            (off + opt_bytes <= train_bytes) ? opt_bytes : (train_bytes - off);
         l3_aware_copy((char *)DeeployNetwork_inputs[train_w_idx] + off,
-                      DeeployOptNetwork_outputs[opt_w_out],
-                      chunk);
+                      DeeployOptNetwork_outputs[opt_w_out], chunk);
       }
     }
   }
@@ -207,23 +215,25 @@ static void run_optimizer_step(void) {
  * ---------------------------------------------------------------------- */
 
 typedef struct {
-  float    *computed;
-  float    *reference;
-  uint32_t  n;
+  float *computed;
+  float *reference;
+  uint32_t n;
   uint32_t *err_count;
 } LossCompareArgs;
 
 static void CompareLossesOnCluster(void *args) {
-  if (pi_core_id() != 0) return;
+  if (pi_core_id() != 0)
+    return;
   LossCompareArgs *a = (LossCompareArgs *)args;
-  float tol = TRAINING_TOLERANCE_ABS;  /* read on cluster — has FPU */
+  float tol = TRAINING_TOLERANCE_ABS; /* read on cluster — has FPU */
   uint32_t errors = 0;
   for (uint32_t i = 0; i < a->n; i++) {
     float diff = a->computed[i] - a->reference[i];
-    if (diff < 0.0f) diff = -diff;
-    printf("  [loss %u] computed=%.6f  ref=%.6f  diff=%.6f  TOL=%.6f\r\n",
-             i, (double)a->computed[i], (double)a->reference[i],
-             (double)diff, (double)tol);
+    if (diff < 0.0f)
+      diff = -diff;
+    printf("  [loss %u] computed=%.6f  ref=%.6f  diff=%.6f  TOL=%.6f\r\n", i,
+           (double)a->computed[i], (double)a->reference[i], (double)diff,
+           (double)tol);
     if (diff > tol) {
       errors++;
     }
@@ -237,16 +247,15 @@ static void CompareLossesOnCluster(void *args) {
 
 int main(void) {
 
+  printf("=== Siracusa Training Harness (Phase 2 — with OptimizerNetwork) "
+         "===\r\n");
+  printf("N_TRAIN_STEPS=%u  N_ACCUM_STEPS=%u  DATA_INPUTS=%u\r\n",
+         (unsigned)N_TRAIN_STEPS, (unsigned)N_ACCUM_STEPS,
+         (unsigned)TRAINING_NUM_DATA_INPUTS);
 
-printf("=== Siracusa Training Harness (Phase 2 — with OptimizerNetwork) ===\r\n");
-printf("N_TRAIN_STEPS=%u  N_ACCUM_STEPS=%u  DATA_INPUTS=%u\r\n",
-        (unsigned)N_TRAIN_STEPS, (unsigned)N_ACCUM_STEPS,
-        (unsigned)TRAINING_NUM_DATA_INPUTS);
-
-
-//   /* ------------------------------------------------------------------
-//    * Cluster bring-up
-//    * ------------------------------------------------------------------ */
+  //   /* ------------------------------------------------------------------
+  //    * Cluster bring-up
+  //    * ------------------------------------------------------------------ */
 
   struct pi_cluster_conf conf;
   pi_cluster_conf_init(&conf);
@@ -268,7 +277,7 @@ printf("N_TRAIN_STEPS=%u  N_ACCUM_STEPS=%u  DATA_INPUTS=%u\r\n",
 
   printf("Initializing TrainingNetwork...\r\n");
   pi_cluster_task(&cluster_task, InitTrainingNetwork, NULL);
-  cluster_task.stack_size       = MAINSTACKSIZE;
+  cluster_task.stack_size = MAINSTACKSIZE;
   cluster_task.slave_stack_size = SLAVESTACKSIZE;
   pi_cluster_send_task_to_cl(&cluster_dev, &cluster_task);
 
@@ -276,24 +285,23 @@ printf("N_TRAIN_STEPS=%u  N_ACCUM_STEPS=%u  DATA_INPUTS=%u\r\n",
    * Zero-initialise gradient accumulation buffers.
    * ------------------------------------------------------------------ */
 
-
-for (uint32_t _gi = 0; _gi < (uint32_t)TRAINING_NUM_GRAD_INPUTS; _gi++) {
-  uint32_t _idx = (uint32_t)TRAINING_GRAD_BUF_START_IDX + _gi;
-  uint32_t bytes = DeeployNetwork_inputs_bytes[_idx];
-  void *buf = DeeployNetwork_inputs[_idx];
-  if (IS_L2(buf)) {
-    memset(buf, 0, bytes);
-  } else {
-    /* Write zeros into L3 via DMA using a temporary L2 zero page */
-    uint8_t *zero_page = pi_l2_malloc(512);
-    memset(zero_page, 0, 512);
-    for (uint32_t off = 0; off < bytes; off += 512) {
-      uint32_t chunk = (off + 512 <= bytes) ? 512 : (bytes - off);
-      ram_write((char *)buf + off, zero_page, chunk);
+  for (uint32_t _gi = 0; _gi < (uint32_t)TRAINING_NUM_GRAD_INPUTS; _gi++) {
+    uint32_t _idx = (uint32_t)TRAINING_GRAD_BUF_START_IDX + _gi;
+    uint32_t bytes = DeeployNetwork_inputs_bytes[_idx];
+    void *buf = DeeployNetwork_inputs[_idx];
+    if (IS_L2(buf)) {
+      memset(buf, 0, bytes);
+    } else {
+      /* Write zeros into L3 via DMA using a temporary L2 zero page */
+      uint8_t *zero_page = pi_l2_malloc(512);
+      memset(zero_page, 0, 512);
+      for (uint32_t off = 0; off < bytes; off += 512) {
+        uint32_t chunk = (off + 512 <= bytes) ? 512 : (bytes - off);
+        ram_write((char *)buf + off, zero_page, chunk);
+      }
+      pi_l2_free(zero_page, 512);
     }
-    pi_l2_free(zero_page, 512);
   }
-}
 
   /* ------------------------------------------------------------------
    * Init optimizer network
@@ -301,15 +309,15 @@ for (uint32_t _gi = 0; _gi < (uint32_t)TRAINING_NUM_GRAD_INPUTS; _gi++) {
 
   printf("Initializing OptimizerNetwork...\r\n");
   pi_cluster_task(&cluster_task, InitOptimizerNetwork, NULL);
-  cluster_task.stack_size       = MAINSTACKSIZE;
+  cluster_task.stack_size = MAINSTACKSIZE;
   cluster_task.slave_stack_size = SLAVESTACKSIZE;
   pi_cluster_send_task_to_cl(&cluster_dev, &cluster_task);
 
-//   connect_optimizer_buffers();
+  //   connect_optimizer_buffers();
 
-//   /* ------------------------------------------------------------------
-//    * lazy_reset_grad is the last input of the training network.
-//    * ------------------------------------------------------------------ */
+  //   /* ------------------------------------------------------------------
+  //    * lazy_reset_grad is the last input of the training network.
+  //    * ------------------------------------------------------------------ */
 
   uint32_t reset_idx = DeeployNetwork_num_inputs - 1;
 
@@ -322,15 +330,16 @@ for (uint32_t _gi = 0; _gi < (uint32_t)TRAINING_NUM_GRAD_INPUTS; _gi++) {
 #if defined(TRAINING_NUM_WEIGHT_INPUTS) && (TRAINING_NUM_WEIGHT_INPUTS > 0)
   for (uint32_t wi = 0; wi < (uint32_t)TRAINING_NUM_WEIGHT_INPUTS; wi++) {
     uint32_t idx = (uint32_t)TRAINING_NUM_DATA_INPUTS + wi;
-    l3_aware_copy(DeeployNetwork_inputs[idx], testInitWeights[wi], DeeployNetwork_inputs_bytes[idx]);
+    l3_aware_copy(DeeployNetwork_inputs[idx], testInitWeights[wi],
+                  DeeployNetwork_inputs_bytes[idx]);
   }
 #endif
 
   printf("Starting training (%u optimizer steps x %u accum steps)...\r\n",
          (unsigned)N_TRAIN_STEPS, (unsigned)N_ACCUM_STEPS);
 
-  uint32_t training_cycles   = 0;
-  uint32_t optimizer_cycles  = 0;
+  uint32_t training_cycles = 0;
+  uint32_t optimizer_cycles = 0;
 
   for (uint32_t update_step = 0; update_step < N_TRAIN_STEPS; update_step++) {
 
@@ -339,10 +348,8 @@ for (uint32_t _gi = 0; _gi < (uint32_t)TRAINING_NUM_GRAD_INPUTS; _gi++) {
       uint32_t mb = update_step * N_ACCUM_STEPS + accum_step;
 
       printf("  update %u/%u  accum %u/%u  (mini-batch %u)\r\n",
-             update_step + 1, (unsigned)N_TRAIN_STEPS,
-             accum_step + 1,  (unsigned)N_ACCUM_STEPS,
-             mb);
-
+             update_step + 1, (unsigned)N_TRAIN_STEPS, accum_step + 1,
+             (unsigned)N_ACCUM_STEPS, mb);
 
       /* ① Set lazy_reset_grad. */
       {
@@ -355,7 +362,8 @@ for (uint32_t _gi = 0; _gi < (uint32_t)TRAINING_NUM_GRAD_INPUTS; _gi++) {
         }
       }
 
-      /* ② Load this mini-batch's data + labels (cycle through unique samples). */
+      /* ② Load this mini-batch's data + labels (cycle through unique samples).
+       */
       for (uint32_t buf = 0; buf < TRAINING_NUM_DATA_INPUTS; buf++) {
         l3_aware_copy(DeeployNetwork_inputs[buf],
                       testDataVector[mb % TRAINING_DATA_SIZE][buf],
@@ -364,7 +372,7 @@ for (uint32_t _gi = 0; _gi < (uint32_t)TRAINING_NUM_GRAD_INPUTS; _gi++) {
 
       /* ③ Forward + backward + InPlaceAccumulatorV2. */
       pi_cluster_task(&cluster_task, RunTrainingNetwork, NULL);
-      cluster_task.stack_size       = MAINSTACKSIZE;
+      cluster_task.stack_size = MAINSTACKSIZE;
       cluster_task.slave_stack_size = SLAVESTACKSIZE;
       pi_cluster_send_task_to_cl(&cluster_dev, &cluster_task);
 
@@ -389,27 +397,25 @@ for (uint32_t _gi = 0; _gi < (uint32_t)TRAINING_NUM_GRAD_INPUTS; _gi++) {
   // printf("Total training cycles  : %u\r\n", training_cycles);
   // printf("Total optimizer cycles : %u\r\n", optimizer_cycles);
 
-
   /* ------------------------------------------------------------------
    * Numerical verification — run on cluster (FC has no FPU)
    * ------------------------------------------------------------------ */
 
   uint32_t loss_err_count = 0;
-  uint32_t total_loss_checks = (TOTAL_FWD_PASSES < N_LOSS_REFS) ? TOTAL_FWD_PASSES : N_LOSS_REFS;
+  uint32_t total_loss_checks =
+      (TOTAL_FWD_PASSES < N_LOSS_REFS) ? TOTAL_FWD_PASSES : N_LOSS_REFS;
   LossCompareArgs loss_cmp_args = {
-    .computed  = stored_losses,
-    .reference = (float *)testLossRef,
-    .n         = total_loss_checks,
-    .err_count = &loss_err_count,
+      .computed = stored_losses,
+      .reference = (float *)testLossRef,
+      .n = total_loss_checks,
+      .err_count = &loss_err_count,
   };
   pi_cluster_task(&cluster_task, CompareLossesOnCluster, &loss_cmp_args);
-  cluster_task.stack_size       = MAINSTACKSIZE;
+  cluster_task.stack_size = MAINSTACKSIZE;
   cluster_task.slave_stack_size = SLAVESTACKSIZE;
   pi_cluster_send_task_to_cl(&cluster_dev, &cluster_task);
-  printf("Errors: %u out of %u\r\n", (unsigned)loss_err_count, (unsigned)total_loss_checks);
-
-
+  printf("Errors: %u out of %u\r\n", (unsigned)loss_err_count,
+         (unsigned)total_loss_checks);
 
   return 0;
-
 }
diff --git a/DeeployTest/deeployTrainingRunner.py b/DeeployTest/deeployTrainingRunner.py
index 815d713ad9..7dfc7d965d 100644
--- a/DeeployTest/deeployTrainingRunner.py
+++ b/DeeployTest/deeployTrainingRunner.py
@@ -22,9 +22,9 @@
 
 if __name__ == '__main__':
     # Peek at --tiled and -p before passing to main(), which builds its own parser.
-    pre = argparse.ArgumentParser(add_help=False)
-    pre.add_argument('--tiled', action='store_true', default=False)
-    pre.add_argument('-p', '--platform', default='Siracusa')
+    pre = argparse.ArgumentParser(add_help = False)
+    pre.add_argument('--tiled', action = 'store_true', default = False)
+    pre.add_argument('-p', '--platform', default = 'Siracusa')
     known, _ = pre.parse_known_args()
 
-    sys.exit(main(tiling_enabled=known.tiled, default_platform=known.platform))
+    sys.exit(main(tiling_enabled = known.tiled, default_platform = known.platform))
diff --git a/DeeployTest/generateOptimizerNetwork.py b/DeeployTest/generateOptimizerNetwork.py
index b2d3031fe9..567f8e1a1e 100644
--- a/DeeployTest/generateOptimizerNetwork.py
+++ b/DeeployTest/generateOptimizerNetwork.py
@@ -25,7 +25,6 @@
 import sys
 from pathlib import Path
 
-import numpy as np
 import onnx
 import onnx_graphsurgeon as gs
 from testUtils.codeGenerate import build_shared_buffer_maps, generateOptimizerTestNetwork
@@ -72,18 +71,18 @@ def generateOptimizerNetwork(args):
     deployer = mapDeployer(platform,
                            graph,
                            inputTypes,
-                           name="DeeployOptimizerNetwork",
-                           deeployStateDir=_DEEPLOYSTATEDIR,
-                           inputOffsets=inputOffsets)
+                           name = "DeeployOptimizerNetwork",
+                           deeployStateDir = _DEEPLOYSTATEDIR,
+                           inputOffsets = inputOffsets)
 
     # Set up memory hierarchy so AnnotateDefaultMemoryLevel assigns the correct
     # memory level to ConstantBuffers (weights).  The optimizer graph is NOT
     # tiled, but it must share the same memory-level view as the training graph
     # so that weights end up in the same physical location (L2 when L3 is the
     # training default, see AnnotateDefaultMemoryLevel).
-    L3 = MemoryLevel(name="L3", neighbourNames=["L2"], size=64000000)
-    L2 = MemoryLevel(name="L2", neighbourNames=["L3", "L1"], size=args.l2)
-    L1 = MemoryLevel(name="L1", neighbourNames=["L2"], size=args.l1)
+    L3 = MemoryLevel(name = "L3", neighbourNames = ["L2"], size = 64000000)
+    L2 = MemoryLevel(name = "L2", neighbourNames = ["L3", "L1"], size = args.l2)
+    L1 = MemoryLevel(name = "L1", neighbourNames = ["L2"], size = args.l1)
     memoryHierarchy = MemoryHierarchy([L3, L2, L1])
     memoryHierarchy.setDefaultMemoryLevel(args.defaultMemLevel)
     defaultTargetMemoryLevel = memoryHierarchy.memoryLevels[args.defaultMemLevel]
@@ -110,41 +109,44 @@ def generateOptimizerNetwork(args):
                         "generating standalone OptimizerNetwork (no buffer sharing)")
 
     # 6. Generate OptimizerNetwork.c / OptimizerNetwork.h
-    os.makedirs(args.dumpdir, exist_ok=True)
+    os.makedirs(args.dumpdir, exist_ok = True)
     generateOptimizerTestNetwork(deployer, args.dumpdir, verbosityCfg, shared_input_map, shared_output_map)
 
     log.info(f"Optimizer network code generated in: {args.dumpdir}")
     print(f"[OptimizerNetwork] Generated OptimizerNetwork.c/h in {args.dumpdir}")
 
+
 if __name__ == '__main__':
 
-    parser = TestGeneratorArgumentParser(description="Deeploy Optimizer Network Code Generation.")
+    parser = TestGeneratorArgumentParser(description = "Deeploy Optimizer Network Code Generation.")
     parser.add_argument(
         "--cores",
-        type=int,
-        default=1,
-        help="Number of cluster cores. Default: 1.",
+        type = int,
+        default = 1,
+        help = "Number of cluster cores. Default: 1.",
     )
     parser.add_argument(
         "--lr",
-        type=float,
-        default=0.001,
-        help="Learning rate (informational only; embedded in optimizer ONNX attributes). Default: 0.001.",
+        type = float,
+        default = 0.001,
+        help = "Learning rate (informational only; embedded in optimizer ONNX attributes). Default: 0.001.",
     )
-    parser.add_argument("--defaultMemLevel", type=str, default="L2",
-                        help="Default memory level (L2 or L3). Must match the training graph. Default: L2.")
-    parser.add_argument("--l1", type=int, default=64000, help="L1 size in bytes. Default: 64000.")
-    parser.add_argument("--l2", type=int, default=1024000, help="L2 size in bytes. Default: 1024000.")
+    parser.add_argument("--defaultMemLevel",
+                        type = str,
+                        default = "L2",
+                        help = "Default memory level (L2 or L3). Must match the training graph. Default: L2.")
+    parser.add_argument("--l1", type = int, default = 64000, help = "L1 size in bytes. Default: 64000.")
+    parser.add_argument("--l2", type = int, default = 1024000, help = "L2 size in bytes. Default: 1024000.")
     parser.add_argument(
         "--training-dir",
-        type=str,
-        default=None,
-        help="Directory containing the training network.onnx.  When provided, "
-             "weight and grad-acc buffers are shared with TrainingNetwork instead "
-             "of being allocated independently.",
+        type = str,
+        default = None,
+        help = "Directory containing the training network.onnx.  When provided, "
+        "weight and grad-acc buffers are shared with TrainingNetwork instead "
+        "of being allocated independently.",
     )
-    parser.add_argument('--shouldFail', action='store_true')
-    parser.set_defaults(shouldFail=False)
+    parser.add_argument('--shouldFail', action = 'store_true')
+    parser.set_defaults(shouldFail = False)
 
     args = parser.parse_args()
 
diff --git a/DeeployTest/generateTrainingNetwork.py b/DeeployTest/generateTrainingNetwork.py
index d27e74aba8..bab1c33b36 100644
--- a/DeeployTest/generateTrainingNetwork.py
+++ b/DeeployTest/generateTrainingNetwork.py
@@ -18,7 +18,7 @@
 from Deeploy.CommonExtensions.DataTypes import float32_t, uint8_t
 from Deeploy.DeeployTypes import _NoVerbosity
 from Deeploy.Logging import DEFAULT_LOGGER as log
-from Deeploy.Targets.PULPOpen.Platform import PULPClusterEngine, PULPPlatform
+from Deeploy.Targets.PULPOpen.Platform import PULPClusterEngine
 
 _GRAD_ACC = "_grad.accumulation.buffer"
 
@@ -60,10 +60,8 @@ def _infer_num_data_inputs(inputs_path: str) -> int:
     base_keys = sorted(k for k in inputs.files if not k.startswith('mb') and not k.startswith('meta_'))
     count = sum(1 for k in base_keys if f'mb1_{k}' in inputs.files)
     if count == 0:
-        raise ValueError(
-            "Cannot auto-detect num_data_inputs: inputs.npz has only one mini-batch "
-            "(no mb1_arr_* entries found). Please pass --num-data-inputs explicitly."
-        )
+        raise ValueError("Cannot auto-detect num_data_inputs: inputs.npz has only one mini-batch "
+                         "(no mb1_arr_* entries found). Please pass --num-data-inputs explicitly.")
     return count
 
 
@@ -127,8 +125,7 @@ def generateTrainingNetwork(args):
     _stripped = False
     _patched = False
     for node in graph.nodes:
-        filtered = [out for out in node.outputs
-                    if not (out.dtype == 0 and len(out.outputs) == 0)]
+        filtered = [out for out in node.outputs if not (out.dtype == 0 and len(out.outputs) == 0)]
         if len(filtered) < len(node.outputs):
             node.outputs = filtered
             _stripped = True
@@ -166,10 +163,9 @@ def generateTrainingNetwork(args):
     npz_base = [inputs[k] for k in base_keys]
 
     if len(npz_base) != len(non_grad_indices):
-        raise ValueError(
-            f"inputs.npz has {len(npz_base)} base entries but network.onnx has "
-            f"{len(non_grad_indices)} non-grad-buf inputs. "
-            f"Re-generate inputs.npz with the updated exporter.")
+        raise ValueError(f"inputs.npz has {len(npz_base)} base entries but network.onnx has "
+                         f"{len(non_grad_indices)} non-grad-buf inputs. "
+                         f"Re-generate inputs.npz with the updated exporter.")
 
     # Build inputTypes / inputOffsets for ALL graph input positions.
     inputTypes = {}
@@ -197,7 +193,7 @@ def generateTrainingNetwork(args):
                 pass
             else:
                 values = arr.reshape(-1).astype(np.float32)
-                _type, offset = inferTypeAndOffset(values, signProp=False)
+                _type, offset = inferTypeAndOffset(values, signProp = False)
                 inputTypes[f"input_{graph_idx}"] = _type
                 inputOffsets[f"input_{graph_idx}"] = offset
 
@@ -207,9 +203,9 @@ def generateTrainingNetwork(args):
     deployer = mapDeployer(platform,
                            graph,
                            inputTypes,
-                           name="DeeployTrainingNetwork",
-                           deeployStateDir=_DEEPLOYSTATEDIR,
-                           inputOffsets=inputOffsets)
+                           name = "DeeployTrainingNetwork",
+                           deeployStateDir = _DEEPLOYSTATEDIR,
+                           inputOffsets = inputOffsets)
 
     log.debug(f"Deployer: {deployer}")
 
@@ -278,22 +274,22 @@ def generateTrainingNetwork(args):
     reference_losses = _load_reference_losses(args.dir)
 
     # 10. Generate all output files
-    os.makedirs(args.dumpdir, exist_ok=True)
+    os.makedirs(args.dumpdir, exist_ok = True)
 
     generateTrainingTestNetwork(deployer,
                                 unique_mb_data,
                                 args.dumpdir,
                                 verbosityCfg,
-                                n_steps=n_steps,
-                                n_accum=n_accum,
-                                num_data_inputs=num_data,
-                                grad_buf_start_idx=grad_buf_start_idx,
-                                num_grad_inputs=num_grad_inputs,
-                                learning_rate=args.learning_rate,
-                                reference_losses=reference_losses,
-                                init_weights=init_weights,
-                                data_size=data_size,
-                                tolerance_abs=args.tolerance_abs)
+                                n_steps = n_steps,
+                                n_accum = n_accum,
+                                num_data_inputs = num_data,
+                                grad_buf_start_idx = grad_buf_start_idx,
+                                num_grad_inputs = num_grad_inputs,
+                                learning_rate = args.learning_rate,
+                                reference_losses = reference_losses,
+                                init_weights = init_weights,
+                                data_size = data_size,
+                                tolerance_abs = args.tolerance_abs)
 
     # 11. Write resolved config for execution.py to pick up after subprocess call.
     meta = {
@@ -303,60 +299,60 @@ def generateTrainingNetwork(args):
     }
     meta_path = os.path.join(args.dumpdir, "training_meta.json")
     with open(meta_path, 'w') as f:
-        json.dump(meta, f, indent=2)
+        json.dump(meta, f, indent = 2)
     log.info(f"Training meta written to {meta_path}: {meta}")
 
 
 if __name__ == '__main__':
 
-    parser = TestGeneratorArgumentParser(description="Deeploy Training Code Generation Utility.")
+    parser = TestGeneratorArgumentParser(description = "Deeploy Training Code Generation Utility.")
     parser.add_argument(
         "--cores",
-        type=int,
-        default=1,
-        help="Number of cores on which the network is run. "
+        type = int,
+        default = 1,
+        help = "Number of cores on which the network is run. "
         "Currently required for im2col buffer sizing on Siracusa. Default: 1.",
     )
     parser.add_argument(
         "--num-data-inputs",
-        type=int,
-        dest="num_data_inputs",
-        default=None,
-        help="Number of DATA inputs that change per mini-batch. "
+        type = int,
+        dest = "num_data_inputs",
+        default = None,
+        help = "Number of DATA inputs that change per mini-batch. "
         "Auto-detected from ONNX graph if not specified.",
     )
     parser.add_argument(
         "--n-steps",
-        type=int,
-        dest="n_steps",
-        default=None,
-        help="N_TRAIN_STEPS: number of gradient-accumulation update steps. "
+        type = int,
+        dest = "n_steps",
+        default = None,
+        help = "N_TRAIN_STEPS: number of gradient-accumulation update steps. "
         "Auto-detected from inputs.npz mini-batch count if not specified.",
     )
     parser.add_argument(
         "--n-accum",
-        type=int,
-        dest="n_accum",
-        default=None,
-        help="N_ACCUM_STEPS: number of mini-batches per update step. "
+        type = int,
+        dest = "n_accum",
+        default = None,
+        help = "N_ACCUM_STEPS: number of mini-batches per update step. "
         "Auto-detected from inputs.npz mini-batch count if not specified.",
     )
     parser.add_argument(
         "--learning-rate",
-        type=float,
-        dest="learning_rate",
-        default=0.001,
-        help="SGD learning rate emitted as TRAINING_LEARNING_RATE in testinputs.h. Default: 0.001.",
+        type = float,
+        dest = "learning_rate",
+        default = 0.001,
+        help = "SGD learning rate emitted as TRAINING_LEARNING_RATE in testinputs.h. Default: 0.001.",
     )
     parser.add_argument(
         "--tolerance",
-        type=float,
-        dest="tolerance_abs",
-        default=1e-3,
-        help="Absolute loss tolerance emitted as TRAINING_TOLERANCE_ABS in testoutputs.h. Default: 1e-3.",
+        type = float,
+        dest = "tolerance_abs",
+        default = 1e-3,
+        help = "Absolute loss tolerance emitted as TRAINING_TOLERANCE_ABS in testoutputs.h. Default: 1e-3.",
     )
-    parser.add_argument('--shouldFail', action='store_true')
-    parser.set_defaults(shouldFail=False)
+    parser.add_argument('--shouldFail', action = 'store_true')
+    parser.set_defaults(shouldFail = False)
 
     args = parser.parse_args()
 
diff --git a/DeeployTest/testMVPOptimizer.py b/DeeployTest/testMVPOptimizer.py
index 9e29d79c55..3fdf4faae6 100644
--- a/DeeployTest/testMVPOptimizer.py
+++ b/DeeployTest/testMVPOptimizer.py
@@ -84,16 +84,16 @@ def generateTiledOptimizerNetwork(args) -> None:
     deployer = mapDeployer(platform,
                            graph,
                            inputTypes,
-                           name="DeeployOptimizerNetwork",
-                           deeployStateDir=_DEEPLOYSTATEDIR,
-                           inputOffsets=inputOffsets,
-                           scheduler=_mockScheduler)
+                           name = "DeeployOptimizerNetwork",
+                           deeployStateDir = _DEEPLOYSTATEDIR,
+                           inputOffsets = inputOffsets,
+                           scheduler = _mockScheduler)
 
     # 5. Set up memory hierarchy.
     #    Tiles execute in L1; optimizer I/O (weights, grads) live in L2 (or L3).
-    L3 = MemoryLevel(name="L3", neighbourNames=["L2"], size=64_000_000)
-    L2 = MemoryLevel(name="L2", neighbourNames=["L3", "L1"], size=args.l2)
-    L1 = MemoryLevel(name="L1", neighbourNames=["L2"], size=args.l1)
+    L3 = MemoryLevel(name = "L3", neighbourNames = ["L2"], size = 64_000_000)
+    L2 = MemoryLevel(name = "L2", neighbourNames = ["L3", "L1"], size = args.l2)
+    L1 = MemoryLevel(name = "L1", neighbourNames = ["L2"], size = args.l1)
     memoryHierarchy = MemoryHierarchy([L3, L2, L1])
     memoryHierarchy.setDefaultMemoryLevel(args.defaultMemLevel)
 
@@ -115,7 +115,7 @@ def generateTiledOptimizerNetwork(args) -> None:
     # schedule (via TrainingMemoryScheduler).  This prevents the allocator from
     # reusing the space of a consumed input (e.g. fc1 weight) for a later
     # output (e.g. fc2 updated weight), which would corrupt the weight buffer.
-    deployer = TilerDeployerWrapper(deployer, TrainingSBTiler, testName=testIdentifier, workDir=args.dumpdir)
+    deployer = TilerDeployerWrapper(deployer, TrainingSBTiler, testName = testIdentifier, workDir = args.dumpdir)
     deployer.tiler.visualizeMemoryAlloc = args.plotMemAlloc
     deployer.tiler.memoryAllocStrategy = args.memAllocStrategy
     deployer.tiler.searchStrategy = args.searchStrategy
@@ -123,7 +123,7 @@ def generateTiledOptimizerNetwork(args) -> None:
     # 8. Prepare deployer.
     verbosityCfg = _NoVerbosity
     if args.profileTiling:
-        verbosityCfg = CodeGenVerbosity(tilingProfiling=True)
+        verbosityCfg = CodeGenVerbosity(tilingProfiling = True)
     _ = deployer.prepare(verbosityCfg)
 
     # 9. Build shared-buffer maps when the training ONNX is available
@@ -142,7 +142,7 @@ def generateTiledOptimizerNetwork(args) -> None:
                         "generating standalone OptimizerNetwork (no buffer sharing)")
 
     # 10. Generate OptimizerNetwork.c / OptimizerNetwork.h
-    os.makedirs(args.dumpdir, exist_ok=True)
+    os.makedirs(args.dumpdir, exist_ok = True)
     generateOptimizerTestNetwork(deployer, args.dumpdir, verbosityCfg, shared_input_map, shared_output_map)
 
     log.info(f"Tiled optimizer network code generated in: {args.dumpdir}")
@@ -151,75 +151,75 @@ def generateTiledOptimizerNetwork(args) -> None:
 
 if __name__ == '__main__':
 
-    parser = TestGeneratorArgumentParser(description="Deeploy Tiled Optimizer Network Code Generation.")
+    parser = TestGeneratorArgumentParser(description = "Deeploy Tiled Optimizer Network Code Generation.")
 
     parser.add_argument(
         "--cores",
-        type=int,
-        default=1,
-        help="Number of cluster cores. Default: 1.",
+        type = int,
+        default = 1,
+        help = "Number of cluster cores. Default: 1.",
     )
     parser.add_argument(
         "--lr",
-        type=float,
-        default=0.001,
-        help="Learning rate (informational only; embedded in optimizer ONNX attributes). Default: 0.001.",
+        type = float,
+        default = 0.001,
+        help = "Learning rate (informational only; embedded in optimizer ONNX attributes). Default: 0.001.",
     )
     parser.add_argument(
         '--l1',
-        type=int,
-        dest='l1',
-        default=64_000,
-        help='L1 size in bytes. Default: 64000.',
+        type = int,
+        dest = 'l1',
+        default = 64_000,
+        help = 'L1 size in bytes. Default: 64000.',
     )
     parser.add_argument(
         '--l2',
-        type=int,
-        dest='l2',
-        default=1_024_000,
-        help='L2 size in bytes. Default: 1024000.',
+        type = int,
+        dest = 'l2',
+        default = 1_024_000,
+        help = 'L2 size in bytes. Default: 1024000.',
     )
     parser.add_argument(
         '--defaultMemLevel',
-        type=str,
-        dest='defaultMemLevel',
-        default="L2",
-        help='Default memory level for optimizer I/O buffers (L2 or L3). Must match the training graph. Default: L2.',
+        type = str,
+        dest = 'defaultMemLevel',
+        default = "L2",
+        help = 'Default memory level for optimizer I/O buffers (L2 or L3). Must match the training graph. Default: L2.',
     )
     parser.add_argument(
         '--memAllocStrategy',
-        type=str,
-        dest='memAllocStrategy',
-        default="MiniMalloc",
-        help='Memory allocation strategy. Default: MiniMalloc.',
+        type = str,
+        dest = 'memAllocStrategy',
+        default = "MiniMalloc",
+        help = 'Memory allocation strategy. Default: MiniMalloc.',
     )
     parser.add_argument(
         '--searchStrategy',
-        type=str,
-        dest='searchStrategy',
-        default="random-max",
-        help='CP solver search strategy. Default: random-max.',
+        type = str,
+        dest = 'searchStrategy',
+        default = "random-max",
+        help = 'CP solver search strategy. Default: random-max.',
     )
     parser.add_argument(
         '--plotMemAlloc',
-        action='store_true',
-        help='Save memory allocation plots in the deeployStates folder.',
+        action = 'store_true',
+        help = 'Save memory allocation plots in the deeployStates folder.',
     )
     parser.add_argument(
         '--profileTiling',
-        action='store_true',
-        help='Enable tiling profiling (inserts cycle counters around each tiled kernel).',
+        action = 'store_true',
+        help = 'Enable tiling profiling (inserts cycle counters around each tiled kernel).',
     )
     parser.add_argument(
         "--training-dir",
-        type=str,
-        default=None,
-        help="Directory containing the training network.onnx.  When provided, "
-             "weight and grad-acc buffers are shared with TrainingNetwork instead "
-             "of being allocated independently.",
+        type = str,
+        default = None,
+        help = "Directory containing the training network.onnx.  When provided, "
+        "weight and grad-acc buffers are shared with TrainingNetwork instead "
+        "of being allocated independently.",
     )
-    parser.add_argument('--shouldFail', action='store_true')
-    parser.set_defaults(shouldFail=False)
+    parser.add_argument('--shouldFail', action = 'store_true')
+    parser.set_defaults(shouldFail = False)
 
     args = parser.parse_args()
 
diff --git a/DeeployTest/testMVPTraining.py b/DeeployTest/testMVPTraining.py
index 30b23dd1e3..438e6985ce 100644
--- a/DeeployTest/testMVPTraining.py
+++ b/DeeployTest/testMVPTraining.py
@@ -19,7 +19,7 @@
 
 from Deeploy.AbstractDataTypes import PointerClass
 from Deeploy.CommonExtensions.DataTypes import float32_t, uint8_t
-from Deeploy.DeeployTypes import CodeGenVerbosity, NetworkDeployer, _NoVerbosity
+from Deeploy.DeeployTypes import CodeGenVerbosity, _NoVerbosity
 from Deeploy.Logging import DEFAULT_LOGGER as log
 from Deeploy.MemoryLevelExtension.MemoryLevels import MemoryHierarchy, MemoryLevel
 from Deeploy.MemoryLevelExtension.NetworkDeployers.MemoryLevelDeployer import MemoryDeployerWrapper
@@ -30,11 +30,11 @@
 
 _GRAD_ACC = "_grad.accumulation.buffer"
 
-
 # ---------------------------------------------------------------------------
 # Helpers copied from generateTrainingNetwork.py
 # ---------------------------------------------------------------------------
 
+
 def _load_reference_losses(train_dir: str) -> list:
     """Load reference loss values from outputs.npz."""
     outputs_path = os.path.join(train_dir, "outputs.npz")
@@ -60,9 +60,8 @@ def _infer_num_data_inputs(inputs_path: str) -> int:
     base_keys = sorted(k for k in inputs.files if not k.startswith('mb') and not k.startswith('meta_'))
     count = sum(1 for k in base_keys if f'mb1_{k}' in inputs.files)
     if count == 0:
-        raise ValueError(
-            "Cannot auto-detect num_data_inputs: inputs.npz has only one mini-batch "
-            "(no mb1_arr_* entries found). Please pass --num-data-inputs explicitly.")
+        raise ValueError("Cannot auto-detect num_data_inputs: inputs.npz has only one mini-batch "
+                         "(no mb1_arr_* entries found). Please pass --num-data-inputs explicitly.")
     return count
 
 
@@ -99,6 +98,7 @@ def _infer_n_accum(inputs_path: str) -> int:
 # Mock scheduler (same as testMVP.py)
 # ---------------------------------------------------------------------------
 
+
 def _mockScheduler(graph: gs.Graph) -> List[List[gs.Node]]:
     """Wrap every node in a singleton list for the Tiler pattern interface."""
     return [[node] for node in graph.nodes]
@@ -108,6 +108,7 @@ def _mockScheduler(graph: gs.Graph) -> List[List[gs.Node]]:
 # Main generation function
 # ---------------------------------------------------------------------------
 
+
 def generateTiledTrainingNetwork(args) -> None:
     log.debug("Arguments: %s", args)
 
@@ -147,10 +148,9 @@ def generateTiledTrainingNetwork(args) -> None:
     npz_base = [inputs[k] for k in base_keys]
 
     if len(npz_base) != len(non_grad_indices):
-        raise ValueError(
-            f"inputs.npz has {len(npz_base)} base entries but network.onnx has "
-            f"{len(non_grad_indices)} non-grad-buf inputs. "
-            f"Re-generate inputs.npz with the updated exporter.")
+        raise ValueError(f"inputs.npz has {len(npz_base)} base entries but network.onnx has "
+                         f"{len(non_grad_indices)} non-grad-buf inputs. "
+                         f"Re-generate inputs.npz with the updated exporter.")
 
     # 5. Build inputTypes / inputOffsets for ALL graph input positions.
     inputTypes = {}
@@ -174,7 +174,7 @@ def generateTiledTrainingNetwork(args) -> None:
                 pass
             else:
                 values = arr.reshape(-1).astype(np.float32)
-                _type, offset = inferTypeAndOffset(values, signProp=False)
+                _type, offset = inferTypeAndOffset(values, signProp = False)
                 inputTypes[f"input_{graph_idx}"] = _type
                 inputOffsets[f"input_{graph_idx}"] = offset
 
@@ -184,15 +184,15 @@ def generateTiledTrainingNetwork(args) -> None:
     deployer = mapDeployer(platform,
                            graph,
                            inputTypes,
-                           name="DeeployTrainingNetwork",
-                           deeployStateDir=_DEEPLOYSTATEDIR,
-                           inputOffsets=inputOffsets,
-                           scheduler=_mockScheduler)
+                           name = "DeeployTrainingNetwork",
+                           deeployStateDir = _DEEPLOYSTATEDIR,
+                           inputOffsets = inputOffsets,
+                           scheduler = _mockScheduler)
 
     # 7. Set up memory hierarchy.
-    L3 = MemoryLevel(name="L3", neighbourNames=["L2"], size=64_000_000)
-    L2 = MemoryLevel(name="L2", neighbourNames=["L3", "L1"], size=args.l2)
-    L1 = MemoryLevel(name="L1", neighbourNames=["L2"], size=args.l1)
+    L3 = MemoryLevel(name = "L3", neighbourNames = ["L2"], size = 64_000_000)
+    L2 = MemoryLevel(name = "L2", neighbourNames = ["L3", "L1"], size = args.l2)
+    L1 = MemoryLevel(name = "L1", neighbourNames = ["L2"], size = args.l1)
     memoryHierarchy = MemoryHierarchy([L3, L2, L1])
     memoryHierarchy.setDefaultMemoryLevel(args.defaultMemLevel)
 
@@ -211,7 +211,7 @@ def generateTiledTrainingNetwork(args) -> None:
     unique_params = f"{args.dumpdir}_L1{args.l1}_L2{args.l2}_{args.defaultMemLevel}"
     testIdentifier = hashlib.md5(unique_params.encode()).hexdigest()[:16]
 
-    deployer = TilerDeployerWrapper(deployer, TrainingSBTiler, testName=testIdentifier, workDir=args.dumpdir)
+    deployer = TilerDeployerWrapper(deployer, TrainingSBTiler, testName = testIdentifier, workDir = args.dumpdir)
     deployer.tiler.visualizeMemoryAlloc = args.plotMemAlloc
     deployer.tiler.memoryAllocStrategy = args.memAllocStrategy
     deployer.tiler.searchStrategy = args.searchStrategy
@@ -277,22 +277,22 @@ def generateTiledTrainingNetwork(args) -> None:
     reference_losses = _load_reference_losses(args.dir)
 
     # 14. Generate output files.
-    os.makedirs(args.dumpdir, exist_ok=True)
+    os.makedirs(args.dumpdir, exist_ok = True)
 
     generateTrainingTestNetwork(deployer,
                                 unique_mb_data,
                                 args.dumpdir,
                                 verbosityCfg,
-                                n_steps=n_steps,
-                                n_accum=n_accum,
-                                num_data_inputs=num_data,
-                                grad_buf_start_idx=grad_buf_start_idx,
-                                num_grad_inputs=num_grad_inputs,
-                                learning_rate=args.learning_rate,
-                                reference_losses=reference_losses,
-                                init_weights=init_weights,
-                                data_size=data_size,
-                                tolerance_abs=args.tolerance_abs)
+                                n_steps = n_steps,
+                                n_accum = n_accum,
+                                num_data_inputs = num_data,
+                                grad_buf_start_idx = grad_buf_start_idx,
+                                num_grad_inputs = num_grad_inputs,
+                                learning_rate = args.learning_rate,
+                                reference_losses = reference_losses,
+                                init_weights = init_weights,
+                                data_size = data_size,
+                                tolerance_abs = args.tolerance_abs)
 
     # 15. Write resolved config for execution.py to pick up.
     meta = {
@@ -302,7 +302,7 @@ def generateTiledTrainingNetwork(args) -> None:
     }
     meta_path = os.path.join(args.dumpdir, "training_meta.json")
     with open(meta_path, 'w') as f:
-        json.dump(meta, f, indent=2)
+        json.dump(meta, f, indent = 2)
     log.info(f"Training meta written to {meta_path}: {meta}")
 
 
@@ -312,99 +312,99 @@ def generateTiledTrainingNetwork(args) -> None:
 
 if __name__ == '__main__':
 
-    parser = TestGeneratorArgumentParser(description="Deeploy Tiled Training Code Generation Utility.")
+    parser = TestGeneratorArgumentParser(description = "Deeploy Tiled Training Code Generation Utility.")
 
     # Training params (same as generateTrainingNetwork.py)
     parser.add_argument(
         "--cores",
-        type=int,
-        default=1,
-        help="Number of cores on which the network is run. Default: 1.",
+        type = int,
+        default = 1,
+        help = "Number of cores on which the network is run. Default: 1.",
     )
     parser.add_argument(
         "--num-data-inputs",
-        type=int,
-        dest="num_data_inputs",
-        default=None,
-        help="Number of DATA inputs that change per mini-batch. Auto-detected if not specified.",
+        type = int,
+        dest = "num_data_inputs",
+        default = None,
+        help = "Number of DATA inputs that change per mini-batch. Auto-detected if not specified.",
     )
     parser.add_argument(
         "--n-steps",
-        type=int,
-        dest="n_steps",
-        default=None,
-        help="N_TRAIN_STEPS: number of gradient-accumulation update steps.",
+        type = int,
+        dest = "n_steps",
+        default = None,
+        help = "N_TRAIN_STEPS: number of gradient-accumulation update steps.",
     )
     parser.add_argument(
         "--n-accum",
-        type=int,
-        dest="n_accum",
-        default=None,
-        help="N_ACCUM_STEPS: number of mini-batches per update step.",
+        type = int,
+        dest = "n_accum",
+        default = None,
+        help = "N_ACCUM_STEPS: number of mini-batches per update step.",
     )
     parser.add_argument(
         "--learning-rate",
-        type=float,
-        dest="learning_rate",
-        default=0.001,
-        help="SGD learning rate emitted as TRAINING_LEARNING_RATE in testinputs.h. Default: 0.001.",
+        type = float,
+        dest = "learning_rate",
+        default = 0.001,
+        help = "SGD learning rate emitted as TRAINING_LEARNING_RATE in testinputs.h. Default: 0.001.",
     )
 
     # Tiling params (same as testMVP.py)
     parser.add_argument(
         '--l1',
-        type=int,
-        dest='l1',
-        default=64_000,
-        help='Set L1 size in bytes. Default: 64000.',
+        type = int,
+        dest = 'l1',
+        default = 64_000,
+        help = 'Set L1 size in bytes. Default: 64000.',
     )
     parser.add_argument(
         '--l2',
-        type=int,
-        dest='l2',
-        default=1_024_000,
-        help='Set L2 size in bytes. Default: 1024000.',
+        type = int,
+        dest = 'l2',
+        default = 1_024_000,
+        help = 'Set L2 size in bytes. Default: 1024000.',
     )
     parser.add_argument(
         '--defaultMemLevel',
-        type=str,
-        dest='defaultMemLevel',
-        default="L2",
-        help='Default memory level for IO buffers. Default: L2.',
+        type = str,
+        dest = 'defaultMemLevel',
+        default = "L2",
+        help = 'Default memory level for IO buffers. Default: L2.',
     )
     parser.add_argument(
         '--memAllocStrategy',
-        type=str,
-        dest='memAllocStrategy',
-        default="MiniMalloc",
-        help='Memory allocation strategy. Default: MiniMalloc.',
+        type = str,
+        dest = 'memAllocStrategy',
+        default = "MiniMalloc",
+        help = 'Memory allocation strategy. Default: MiniMalloc.',
     )
     parser.add_argument(
         '--searchStrategy',
-        type=str,
-        dest='searchStrategy',
-        default="random-max",
-        help='CP solver search strategy. Default: random-max.',
+        type = str,
+        dest = 'searchStrategy',
+        default = "random-max",
+        help = 'CP solver search strategy. Default: random-max.',
     )
     parser.add_argument(
         '--plotMemAlloc',
-        action='store_true',
-        help='Save memory allocation plots in the deeployStates folder.',
+        action = 'store_true',
+        help = 'Save memory allocation plots in the deeployStates folder.',
     )
     parser.add_argument(
         '--profileTiling',
-        action='store_true',
-        help='Enable tiling profiling (inserts cycle counters around each tiled kernel).',
+        action = 'store_true',
+        help = 'Enable tiling profiling (inserts cycle counters around each tiled kernel).',
     )
     parser.add_argument(
         '--tolerance',
-        type=float,
-        dest='tolerance_abs',
-        default=1e-3,
-        help='Absolute loss tolerance emitted as TRAINING_TOLERANCE_ABS in testoutputs.h. Default: 1e-3.',
+        type = float,
+        dest = 'tolerance_abs',
+        default = 1e-3,
+        help = 'Absolute loss tolerance emitted as TRAINING_TOLERANCE_ABS in testoutputs.h. Default: 1e-3.',
     )
-    parser.add_argument('--shouldFail', action='store_true')
-    parser.set_defaults(shouldFail=False)
+    parser.add_argument('--shouldFail', action = 'store_true')
+    parser.set_defaults(shouldFail = False)
 
     args = parser.parse_args()
 
diff --git a/DeeployTest/testUtils/codeGenerate.py b/DeeployTest/testUtils/codeGenerate.py
index ea73d320e1..aa18f155b2 100644
--- a/DeeployTest/testUtils/codeGenerate.py
+++ b/DeeployTest/testUtils/codeGenerate.py
@@ -4,7 +4,6 @@
 
 import os
 import re
-from pathlib import Path
 from typing import Dict, List, Optional, Tuple
 
 import numpy as np
@@ -201,7 +200,8 @@ def generateTestNetworkImplementation(deployer: NetworkDeployer, verbosityCfg: C
     output_idx = 0
     while deployer.ctxt.is_buffer(f'output_{output_idx}'):
         output_buffer = deployer.ctxt.lookup(f'output_{output_idx}')
-        output_size = np.prod(output_buffer.shape) if hasattr(output_buffer, 'shape') else output_buffer._type.referencedType.typeWidth
+        output_size = np.prod(output_buffer.shape) if hasattr(output_buffer,
+                                                              'shape') else output_buffer._type.referencedType.typeWidth
         typeName = output_buffer._type.referencedType.typeName
         output_idx += 1
 
@@ -305,9 +305,14 @@ def generateTestNetwork(deployer: NetworkDeployer, test_inputs: List[np.ndarray]
 # ---------------------------------------------------------------------------
 
 
-def generateTrainingTestInputsHeader(deployer: NetworkDeployer, all_mb_data: List[List[np.ndarray]], n_steps: int,
-                                     n_accum: int, grad_buf_start_idx: int = 0, num_grad_inputs: int = 0,
-                                     learning_rate: float = 0.001, init_weights: List[np.ndarray] = None,
+def generateTrainingTestInputsHeader(deployer: NetworkDeployer,
+                                     all_mb_data: List[List[np.ndarray]],
+                                     n_steps: int,
+                                     n_accum: int,
+                                     grad_buf_start_idx: int = 0,
+                                     num_grad_inputs: int = 0,
+                                     learning_rate: float = 0.001,
+                                     init_weights: List[np.ndarray] = None,
                                      data_size: int = None) -> str:
     """Generate testinputs.h for training tests.
 
@@ -389,9 +394,8 @@ def generateTrainingTestInputsHeader(deployer: NetworkDeployer, all_mb_data: Lis
 
             # Format values
             if typeName == 'float32_t':
-                list_str = ", ".join([
-                    f'{float(x)}f' if not (np.isinf(x) or np.isnan(x)) else str(x) for x in values.astype(np.float32)
-                ])
+                list_str = ", ".join(
+                    [f'{float(x)}f' if not (np.isinf(x) or np.isnan(x)) else str(x) for x in values.astype(np.float32)])
             else:
                 list_str = ", ".join([str(x) for x in values])
 
@@ -594,11 +598,19 @@ def generateTrainingNetworkImplementation(deployer: NetworkDeployer, verbosityCf
     return retStr
 
 
-def generateTrainingTestNetwork(deployer: NetworkDeployer, all_mb_data: List[List[np.ndarray]], dumpdir: str,
-                                verbosityCfg: CodeGenVerbosity, n_steps: int = 1, n_accum: int = 1,
-                                num_data_inputs: int = 2, grad_buf_start_idx: int = 0, num_grad_inputs: int = 0,
-                                learning_rate: float = 0.001, reference_losses: List = None,
-                                init_weights: List = None, data_size: int = None,
+def generateTrainingTestNetwork(deployer: NetworkDeployer,
+                                all_mb_data: List[List[np.ndarray]],
+                                dumpdir: str,
+                                verbosityCfg: CodeGenVerbosity,
+                                n_steps: int = 1,
+                                n_accum: int = 1,
+                                num_data_inputs: int = 2,
+                                grad_buf_start_idx: int = 0,
+                                num_grad_inputs: int = 0,
+                                learning_rate: float = 0.001,
+                                reference_losses: List = None,
+                                init_weights: List = None,
+                                data_size: int = None,
                                 tolerance_abs: float = 1e-3) -> None:
     """Generate all training test files: testinputs.h, testoutputs.h, TrainingNetwork.h, TrainingNetwork.c.
 
@@ -626,19 +638,25 @@ def generateTrainingTestNetwork(deployer: NetworkDeployer, all_mb_data: List[Lis
     """
     assert deployer.prepared, "An unprepared deployer was given"
 
-    os.makedirs(dumpdir, exist_ok=True)
+    os.makedirs(dumpdir, exist_ok = True)
 
     # testinputs.h
-    testInputStr = generateTrainingTestInputsHeader(deployer, all_mb_data, n_steps, n_accum, grad_buf_start_idx,
-                                                    num_grad_inputs, learning_rate, init_weights=init_weights,
-                                                    data_size=data_size)
+    testInputStr = generateTrainingTestInputsHeader(deployer,
+                                                    all_mb_data,
+                                                    n_steps,
+                                                    n_accum,
+                                                    grad_buf_start_idx,
+                                                    num_grad_inputs,
+                                                    learning_rate,
+                                                    init_weights = init_weights,
+                                                    data_size = data_size)
     with open(f'{dumpdir}/testinputs.h', 'w') as f:
         f.write(testInputStr)
 
     # testoutputs.h
     testOutputStr = generateTrainingTestOutputsHeader(
-        reference_losses=reference_losses,
-        tolerance_abs=tolerance_abs,
+        reference_losses = reference_losses,
+        tolerance_abs = tolerance_abs,
     )
     with open(f'{dumpdir}/testoutputs.h', 'w') as f:
         f.write(testOutputStr)
@@ -666,28 +684,28 @@ def generateTrainingTestNetwork(deployer: NetworkDeployer, all_mb_data: List[Lis
     #   [last]                                → lazy_reset_grad = 1 (uint8)
     l3_initial_inputs: List[np.ndarray] = []
     # Count how many input_N buffers exist in the deployer context
-    n_total_inputs = sum(1 for name in deployer.ctxt.globalObjects
-                         if name.startswith("input_") and name[len("input_"):].isdigit())
+    n_total_inputs = sum(
+        1 for name in deployer.ctxt.globalObjects if name.startswith("input_") and name[len("input_"):].isdigit())
     for i in range(n_total_inputs):
         if all_mb_data and i < len(all_mb_data[0]):
             # Data / label input
             l3_initial_inputs.append(all_mb_data[0][i])
-        elif (init_weights is not None and grad_buf_start_idx > 0
-              and num_data_inputs <= i < grad_buf_start_idx):
+        elif (init_weights is not None and grad_buf_start_idx > 0 and num_data_inputs <= i < grad_buf_start_idx):
             # Weight input
             wi = i - num_data_inputs
-            l3_initial_inputs.append(init_weights[wi] if wi < len(init_weights) else np.array([0.0], dtype=np.float32))
+            l3_initial_inputs.append(init_weights[wi] if wi <
+                                     len(init_weights) else np.array([0.0], dtype = np.float32))
         elif (grad_buf_start_idx > 0 and num_grad_inputs > 0
               and grad_buf_start_idx <= i < grad_buf_start_idx + num_grad_inputs):
             # Gradient accumulation buffer — zero-initialised
             buf = deployer.ctxt.globalObjects.get(f"input_{i}")
             shape = buf.shape if (buf is not None and hasattr(buf, 'shape')) else (1,)
-            l3_initial_inputs.append(np.zeros(shape, dtype=np.float32))
+            l3_initial_inputs.append(np.zeros(shape, dtype = np.float32))
         else:
             # lazy_reset_grad (last input) or any unknown slot — default 1 / uint8
             buf = deployer.ctxt.globalObjects.get(f"input_{i}")
             shape = buf.shape if (buf is not None and hasattr(buf, 'shape')) else (1,)
-            l3_initial_inputs.append(np.ones(shape, dtype=np.uint8))
+            l3_initial_inputs.append(np.ones(shape, dtype = np.uint8))
 
     generateL3HexDump(deployer, os.path.join(dumpdir, 'hex'), l3_initial_inputs, [])
 
@@ -742,7 +760,7 @@ def build_shared_buffer_maps(train_onnx_path: str, opt_onnx_model) -> Tuple[Dict
         # node appends to output tensor names (e.g. 'conv1_weight_updated' → 'conv1_weight').
         lookup_name = name
         if lookup_name not in train_name_to_idx and lookup_name.endswith('_updated'):
-            lookup_name = lookup_name[: -len('_updated')]
+            lookup_name = lookup_name[:-len('_updated')]
         if lookup_name in train_name_to_idx:
             shared_output_map[opt_idx] = train_name_to_idx[lookup_name]
 
@@ -795,17 +813,14 @@ def _patch_shared_buffers(retStr: str, shared_input_map: Dict[int, int], shared_
     # Pattern 1 (non-tiled): individual pi_*_malloc per buffer
     # ------------------------------------------------------------------
     _malloc_pat = re.compile(
-        r'(DeeployOptNetwork_(input|output)_(\d+))\s*=\s*\([^)]+\s*\*\s*\)\s*pi_\w+_malloc\([^;]+\);'
-    )
+        r'(DeeployOptNetwork_(input|output)_(\d+))\s*=\s*\([^)]+\s*\*\s*\)\s*pi_\w+_malloc\([^;]+\);')
 
     # ------------------------------------------------------------------
     # Pattern 2 (tiled): arena-offset assignment
     #   DeeployOptNetwork_input_N = (Type *)((char *)DeeployOptNetwork_MEMORYARENA_Lx + OFFSET);
     # ------------------------------------------------------------------
-    _arena_pat = re.compile(
-        r'(DeeployOptNetwork_(input|output)_(\d+))\s*=\s*\([^)]+\s*\*\s*\)'
-        r'\s*\(\s*\(char\s*\*\)\s*DeeployOptNetwork_MEMORYARENA_L\w+\s*\+\s*\d+\s*\)\s*;'
-    )
+    _arena_pat = re.compile(r'(DeeployOptNetwork_(input|output)_(\d+))\s*=\s*\([^)]+\s*\*\s*\)'
+                            r'\s*\(\s*\(char\s*\*\)\s*DeeployOptNetwork_MEMORYARENA_L\w+\s*\+\s*\d+\s*\)\s*;')
 
     def _make_replacement(symbol: str, kind: str, idx: int) -> Optional[str]:
         if kind == "input" and idx in shared_input_map:
@@ -832,14 +847,10 @@ def _replace(m: re.Match) -> str:
     for level in ('L2', 'L3'):
         arena_sym = f'DeeployOptNetwork_MEMORYARENA_{level}'
         # Pattern for the malloc assignment line itself
-        malloc_line_pat = re.compile(
-            rf'[^\n]*{re.escape(arena_sym)}\s*=\s*\([^)]+\)\s*pi_\w+_malloc\([^;]+\);\s*\n'
-        )
+        malloc_line_pat = re.compile(rf'[^\n]*{re.escape(arena_sym)}\s*=\s*\([^)]+\)\s*pi_\w+_malloc\([^;]+\);\s*\n')
         # Pattern for any use of the arena in pointer arithmetic:
         #   (char *)ARENA + OFFSET  or  (void *)ARENA  etc.
-        arena_use_pat = re.compile(
-            rf'\(\s*(?:char|void|int8_t)\s*\*\s*\)\s*{re.escape(arena_sym)}'
-        )
+        arena_use_pat = re.compile(rf'\(\s*(?:char|void|int8_t)\s*\*\s*\)\s*{re.escape(arena_sym)}')
         if not arena_use_pat.search(retStr):
             # No remaining pointer arithmetic — the malloc is dead
             retStr = malloc_line_pat.sub('', retStr)
@@ -887,9 +898,7 @@ def _patch_shared_arenas(retStr: str, train_c_source: str) -> str:
             continue
 
         opt_sym = f'DeeployOptNetwork_MEMORYARENA_{level}'
-        opt_malloc_pat = re.compile(
-            rf'({re.escape(opt_sym)})\s*=\s*\([^)]+\)\s*\w+\(sizeof\([^)]+\)\s*\*\s*\d+\)\s*;'
-        )
+        opt_malloc_pat = re.compile(rf'({re.escape(opt_sym)})\s*=\s*\([^)]+\)\s*\w+\(sizeof\([^)]+\)\s*\*\s*\d+\)\s*;')
         if not opt_malloc_pat.search(retStr):
             continue
 
@@ -1135,7 +1144,7 @@ def generateOptimizerTestNetwork(deployer: NetworkDeployer,
     """
     assert deployer.prepared, "An unprepared deployer was given"
 
-    os.makedirs(dumpdir, exist_ok=True)
+    os.makedirs(dumpdir, exist_ok = True)
 
     train_c_path = os.path.join(dumpdir, 'TrainingNetwork.c')
     train_c_source: Optional[str] = None
diff --git a/DeeployTest/testUtils/core/execution.py b/DeeployTest/testUtils/core/execution.py
index 9aff13cede..6073800980 100644
--- a/DeeployTest/testUtils/core/execution.py
+++ b/DeeployTest/testUtils/core/execution.py
@@ -80,9 +80,12 @@ def generate_network(config: DeeployTestConfig, skip: bool = False) -> None:
         cmd = [
             sys.executable,
             str(generation_script),
-            "-d", config.gen_dir,
-            "-t", config.test_dir,
-            "-p", config.platform,
+            "-d",
+            config.gen_dir,
+            "-t",
+            config.test_dir,
+            "-p",
+            config.platform,
         ]
         if config.n_train_steps is not None:
             cmd.append(f"--n-steps={config.n_train_steps}")
@@ -97,7 +100,7 @@ def generate_network(config: DeeployTestConfig, skip: bool = False) -> None:
         cmd.extend(config.gen_args)
 
         log.debug(f"[Execution] Tiled training generation command: {' '.join(cmd)}")
-        result = subprocess.run(cmd, check=False)
+        result = subprocess.run(cmd, check = False)
         if result.returncode != 0:
             raise RuntimeError(f"Tiled training network generation failed for {config.test_name}")
 
@@ -123,15 +126,16 @@ def generate_network(config: DeeployTestConfig, skip: bool = False) -> None:
             opt_cmd = [
                 sys.executable,
                 str(opt_script),
-                "-d", config.gen_dir,
-                "-t", opt_dir,
-                "-p", config.platform,
+                "-d",
+                config.gen_dir,
+                "-t",
+                opt_dir,
+                "-p",
+                config.platform,
                 f"--training-dir={config.test_dir}",
             ]
-            _OPT_PASSTHROUGH = ("--cores", "--l1", "--l2",
-                                "--defaultMemLevel",
-                                "--memAllocStrategy", "--searchStrategy",
-                                "--plotMemAlloc", "--profileTiling")
+            _OPT_PASSTHROUGH = ("--cores", "--l1", "--l2", "--defaultMemLevel", "--memAllocStrategy",
+                                "--searchStrategy", "--plotMemAlloc", "--profileTiling")
             for arg in config.gen_args:
                 if any(arg.startswith(p) for p in _OPT_PASSTHROUGH):
                     opt_cmd.append(arg)
@@ -142,7 +146,7 @@ def generate_network(config: DeeployTestConfig, skip: bool = False) -> None:
                 opt_cmd.append("-" + "v" * config.verbose)
 
             log.debug(f"[Execution] Tiled optimizer generation command: {' '.join(opt_cmd)}")
-            result = subprocess.run(opt_cmd, check=False)
+            result = subprocess.run(opt_cmd, check = False)
             if result.returncode != 0:
                 raise RuntimeError(f"Tiled optimizer network generation failed for {config.test_name}")
 
@@ -154,9 +158,12 @@ def generate_network(config: DeeployTestConfig, skip: bool = False) -> None:
         cmd = [
             sys.executable,
             str(generation_script),
-            "-d", config.gen_dir,
-            "-t", config.test_dir,
-            "-p", config.platform,
+            "-d",
+            config.gen_dir,
+            "-t",
+            config.test_dir,
+            "-p",
+            config.platform,
         ]
         # Only pass values when explicitly set; otherwise let the script auto-detect
         if config.n_train_steps is not None:
@@ -173,7 +180,7 @@ def generate_network(config: DeeployTestConfig, skip: bool = False) -> None:
         cmd.extend(config.gen_args)
 
         log.debug(f"[Execution] Training generation command: {' '.join(cmd)}")
-        result = subprocess.run(cmd, check=False)
+        result = subprocess.run(cmd, check = False)
         if result.returncode != 0:
             raise RuntimeError(f"Training network generation failed for {config.test_name}")
 
@@ -199,9 +206,12 @@ def generate_network(config: DeeployTestConfig, skip: bool = False) -> None:
             opt_cmd = [
                 sys.executable,
                 str(opt_script),
-                "-d", config.gen_dir,
-                "-t", opt_dir,
-                "-p", config.platform,
+                "-d",
+                config.gen_dir,
+                "-t",
+                opt_dir,
+                "-p",
+                config.platform,
                 f"--training-dir={config.test_dir}",
             ]
             _OPT_PASSTHROUGH = ("--cores", "--l1", "--l2", "--defaultMemLevel")
@@ -214,7 +224,7 @@ def generate_network(config: DeeployTestConfig, skip: bool = False) -> None:
                 opt_cmd.append("-" + "v" * config.verbose)
 
             log.debug(f"[Execution] Optimizer generation command: {' '.join(opt_cmd)}")
-            result = subprocess.run(opt_cmd, check=False)
+            result = subprocess.run(opt_cmd, check = False)
             if result.returncode != 0:
                 raise RuntimeError(f"Optimizer network generation failed for {config.test_name}")
 
@@ -225,18 +235,24 @@ def generate_network(config: DeeployTestConfig, skip: bool = False) -> None:
         cmd = [
             sys.executable,
             str(generation_script),
-            "-d", config.gen_dir,
-            "-t", config.test_dir,
-            "-p", config.platform,
+            "-d",
+            config.gen_dir,
+            "-t",
+            config.test_dir,
+            "-p",
+            config.platform,
         ]
     else:
         generation_script = script_dir / "generateNetwork.py"
         cmd = [
             sys.executable,
             str(generation_script),
-            "-d", config.gen_dir,
-            "-t", config.test_dir,
-            "-p", config.platform,
+            "-d",
+            config.gen_dir,
+            "-t",
+            config.test_dir,
+            "-p",
+            config.platform,
         ]
 
     if config.verbose > 0:
@@ -373,8 +389,7 @@ def run_simulation(config: DeeployTestConfig, skip: bool = False) -> TestResult:
 
     elif config.simulator == 'gvsoc':
         cmake_cmd = os.environ.get("CMAKE", "cmake")
-        cmd = [cmake_cmd, "--build", config.build_dir, "--target",
-               f"gvsoc_{config.test_name}"]
+        cmd = [cmake_cmd, "--build", config.build_dir, "--target", f"gvsoc_{config.test_name}"]
 
     elif config.simulator == 'banshee':
         if config.verbose == 1:
@@ -384,19 +399,21 @@ def run_simulation(config: DeeployTestConfig, skip: bool = False) -> TestResult:
         elif config.verbose >= 3:
             env["BANSHEE_LOG"] = "debug"
         cmake_cmd = os.environ.get("CMAKE", "cmake")
-        cmd = [cmake_cmd, "--build", config.build_dir, "--target",
-               f"{config.simulator}_{config.test_name}"]
+        cmd = [cmake_cmd, "--build", config.build_dir, "--target", f"{config.simulator}_{config.test_name}"]
 
     else:
         cmake_cmd = os.environ.get("CMAKE", "cmake")
-        cmd = [cmake_cmd, "--build", config.build_dir, "--target",
-               f"{config.simulator}_{config.test_name}"]
+        cmd = [cmake_cmd, "--build", config.build_dir, "--target", f"{config.simulator}_{config.test_name}"]
 
     log.debug(f"[Execution] Simulation command: {' '.join(cmd)}")
 
     # Stream output in real-time (line-buffered) and capture for parsing.
-    proc = subprocess.Popen(cmd, stdout = subprocess.PIPE, stderr = subprocess.STDOUT,
-                            text = True, env = env, bufsize = 1)
+    proc = subprocess.Popen(cmd,
+                            stdout = subprocess.PIPE,
+                            stderr = subprocess.STDOUT,
+                            text = True,
+                            env = env,
+                            bufsize = 1)
     stdout_lines = []
     for line in proc.stdout:
         print(line, end = '', flush = True)
diff --git a/DeeployTest/testUtils/deeployTrainingRunner.py b/DeeployTest/testUtils/deeployTrainingRunner.py
index 9ee4a64cf4..8f523bf264 100644
--- a/DeeployTest/testUtils/deeployTrainingRunner.py
+++ b/DeeployTest/testUtils/deeployTrainingRunner.py
@@ -11,9 +11,7 @@
 """
 
 import os
-import sys
 from pathlib import Path
-from typing import Optional
 
 # gapy (gvsoc launcher) uses `#!/usr/bin/env python3`.  Put /usr/bin first so
 # it resolves to /usr/bin/python3 which has all required packages (gapylib,
@@ -66,13 +64,14 @@ def main(tiling_enabled: bool = False, default_platform: str = 'Siracusa', defau
                         type = str,
                         default = None,
                         help = 'Directory containing the optimizer network.onnx '
-                               "(default: auto-derived by replacing '_train' with '_optimizer')\n")
-    parser.add_argument('--tolerance',
-                        metavar = '<tol>',
-                        dest = 'tolerance',
-                        type = float,
-                        default = None,
-                        help = 'Absolute loss tolerance for pass/fail comparison (default: auto from generateTrainingNetwork.py)\n')
+                        "(default: auto-derived by replacing '_train' with '_optimizer')\n")
+    parser.add_argument(
+        '--tolerance',
+        metavar = '<tol>',
+        dest = 'tolerance',
+        type = float,
+        default = None,
+        help = 'Absolute loss tolerance for pass/fail comparison (default: auto from generateTrainingNetwork.py)\n')
 
     args = parser.parse_args()
 
diff --git a/DeeployTest/testUtils/tilingUtils.py b/DeeployTest/testUtils/tilingUtils.py
index 1e4b143cfb..1dfb43bea4 100644
--- a/DeeployTest/testUtils/tilingUtils.py
+++ b/DeeployTest/testUtils/tilingUtils.py
@@ -2,7 +2,7 @@
 #
 # SPDX-License-Identifier: Apache-2.0
 
-from typing import Dict, List, Optional, Tuple, Union
+from typing import Dict, List, Tuple, Union
 
 from ortools.constraint_solver.pywrapcp import IntVar
 
@@ -54,9 +54,8 @@ class TrainingMemoryScheduler(MemoryScheduler):
     that forward-pass inputs remain live during the backward pass.
     """
 
-    def _calculateLifetimes(
-            self, ctxt: NetworkContext, patternMemoryConstraint: PatternMemoryConstraints,
-            memoryLevel: str) -> Tuple[Dict[str, Tuple[int, int]], Dict]:
+    def _calculateLifetimes(self, ctxt: NetworkContext, patternMemoryConstraint: PatternMemoryConstraints,
+                            memoryLevel: str) -> Tuple[Dict[str, Tuple[int, int]], Dict]:
         tensorLifetimeMap, tensorMap = super()._calculateLifetimes(ctxt, patternMemoryConstraint, memoryLevel)
 
         maxStepIdx = len(patternMemoryConstraint.nodeConstraints)

From 763b464719bec97dbdf847ff941f992b7b4d33a3 Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 12:35:50 +0000
Subject: [PATCH 03/28] training-platform core: canonicalise
 SoftmaxCrossEntropyLoss to 2 outputs

Two fixes for CI regressions on PR #182:

1. SoftmaxCrossEntropyLoss: canonical form is now 2 outputs (loss + log_prob).

   The previous design carried both a legacy 1-output form and a new
   2-output 'dual output' form side by side, which tripped Deeploy's
   root-layer backtracking: Layer.parse() commits to the first mapper
   whose parser succeeds, and if that mapper's bindings all later fail
   typeCheck, the root-layer backtracker raises instead of trying the
   next mapper. That broke the existing upstream
   Tests/Kernels/FP32/Softmax/CrossEntropy.

   Fix: collapse everything to a single 2-output path. SoftmaxCrossEntropy
   LossParser now requires exactly 2 outputs, the SCE template is the
   former 'referenceDualOutputTemplate' (loss + log_prob), the PULPOpen
   binding uses 2-output pointer types, and the tile constraint contains
   the loss patching logic that previously lived in
   SoftmaxCrossEntropyLossDualOutputTileConstraint (now deleted).
   SoftmaxCrossEntropyGradTileConstraint overrides dataLossName to '' so
   it falls straight through to the base-class single-output wrapper.

   The upstream Tests/Kernels/FP32/Softmax/CrossEntropy ONNX is
   regenerated with the canonical 2-output signature and its outputs.npz
   is updated with the computed scalar loss.

2. TilingExtension.TilerExtension: drop the 4-byte MiniMalloc size
   alignment that was added in the training branch to work around a
   Siracusa L3 DMA corner case. That change makes Snitch's tight-fit
   5 kB Kernels/Integer/Add/Large test overflow L1 because the
   rounded-up per-buffer sizes no longer fit. For MLP core the
   alignment is not needed. The alias-skip logic (skip zero-sized
   in-place alias outputs from the MiniMalloc CSV and resolve their
   addrSpace from the alias target after solving) is kept.

Verified locally:
- Tests/Kernels/FP32/Softmax/CrossEntropy                (upstream)  PASSED
- Tests/Kernels/Integer/Add/Large (snitch tiled, L1=5 kB)            PASSED
- deeployTrainingRunner_siracusa.py -t simplemlp_train               PASSED
- deeployTrainingRunner_tiled_siracusa.py -t simplemlp_train         PASSED
---
 Deeploy/Targets/Generic/Parsers.py            |  26 ++--
 Deeploy/Targets/Generic/TypeCheckers.py       |   6 -
 Deeploy/Targets/PULPOpen/Bindings.py          |   8 +-
 Deeploy/Targets/PULPOpen/Platform.py          | 132 ++++++------------
 .../SoftmaxCrossEntropyLossTemplate.py        |  27 +---
 .../SoftmaxCrossEntropyTileConstraint.py      |  54 +++++++
 Deeploy/Targets/PULPOpen/Tiler.py             |  10 +-
 Deeploy/TilingExtension/TilerExtension.py     |   3 +-
 .../FP32/Softmax/CrossEntropy/network.onnx    | Bin 248 -> 204 bytes
 .../FP32/Softmax/CrossEntropy/outputs.npz     | Bin 430 -> 674 bytes
 10 files changed, 116 insertions(+), 150 deletions(-)

diff --git a/Deeploy/Targets/Generic/Parsers.py b/Deeploy/Targets/Generic/Parsers.py
index 1323cc069a..385eb03dff 100644
--- a/Deeploy/Targets/Generic/Parsers.py
+++ b/Deeploy/Targets/Generic/Parsers.py
@@ -2611,16 +2611,18 @@ def parseNodeCtxt(self,
 
 
 class SoftmaxCrossEntropyLossParser(NodeParser):
+    """SoftmaxCrossEntropyLoss parser.
+
+    The canonical form has two outputs: a scalar mean cross-entropy loss and
+    a per-sample log_prob tensor, matching the signature emitted by ONNX
+    Runtime when exporting training graphs.
+    """
 
     def __init__(self):
         super().__init__()
 
     def parseNode(self, node: gs.Node) -> bool:
-
-        # Accept 1 output (log_prob only) or 2 outputs (loss + log_prob)
-        ret = all([len(node.inputs) == 2, len(node.outputs) in (1, 2)])
-
-        return ret
+        return all([len(node.inputs) == 2, len(node.outputs) == 2])
 
     def parseNodeCtxt(self,
                       ctxt: NetworkContext,
@@ -2629,17 +2631,13 @@ def parseNodeCtxt(self,
 
         logits = ctxt.lookup(node.inputs[0].name)
         labels = ctxt.lookup(node.inputs[1].name)
-        if len(node.outputs) == 2:
-            # Dual-output: outputs[0]=loss (scalar), outputs[1]=log_prob
-            loss = ctxt.lookup(node.outputs[0].name)
-            log_prob = ctxt.lookup(node.outputs[1].name)
-            self.operatorRepresentation['loss'] = loss.name
-        else:
-            # Single-output (legacy): outputs[0]=log_prob
-            log_prob = ctxt.lookup(node.outputs[0].name)
-            self.operatorRepresentation['loss'] = ''
+        # outputs[0] = loss (0-d scalar, shape [1] after Deeploy normalisation)
+        # outputs[1] = log_prob tensor
+        loss = ctxt.lookup(node.outputs[0].name)
+        log_prob = ctxt.lookup(node.outputs[1].name)
         self.operatorRepresentation['logits'] = logits.name
         self.operatorRepresentation['labels'] = labels.name
+        self.operatorRepresentation['loss'] = loss.name
         self.operatorRepresentation['log_prob'] = log_prob.name
         self.operatorRepresentation['batch'] = logits.shape[0]
         self.operatorRepresentation['num_classes'] = logits.shape[1]
diff --git a/Deeploy/Targets/Generic/TypeCheckers.py b/Deeploy/Targets/Generic/TypeCheckers.py
index 7e9bc923cf..85453563c3 100644
--- a/Deeploy/Targets/Generic/TypeCheckers.py
+++ b/Deeploy/Targets/Generic/TypeCheckers.py
@@ -574,12 +574,6 @@ class SoftmaxCrossEntropyLossChecker(SignPropTypeChecker):
     def __init__(self, input_types: Sequence[Type[Pointer]], output_types: Sequence[Type[Pointer]]):
         super().__init__(input_types, output_types)
 
-    def checkOutputType(self, inputs: List[VariableBuffer], operatorRepresentation: OperatorRepresentation) -> bool:
-        # The parser sets 'loss' to a non-empty string for 2-output nodes, '' for 1-output.
-        # Use this to determine the actual output count and match it against this binding.
-        actual_num_outputs = 2 if operatorRepresentation.get('loss', '') != '' else 1
-        return actual_num_outputs == len(self.output_types)
-
     def _inferNumLevels(self, inputs: List[VariableBuffer],
                         operatorRepresentation: OperatorRepresentation) -> Optional[List[int]]:
 
diff --git a/Deeploy/Targets/PULPOpen/Bindings.py b/Deeploy/Targets/PULPOpen/Bindings.py
index 2a1bc9ec02..04bd81c172 100644
--- a/Deeploy/Targets/PULPOpen/Bindings.py
+++ b/Deeploy/Targets/PULPOpen/Bindings.py
@@ -353,16 +353,10 @@
 ]
 
 PULPSoftmaxCrossEntropyLossBindings = [
-    NodeBinding(
-        SoftmaxCrossEntropyLossChecker([PointerClass(float32_t), PointerClass(type)], [PointerClass(float32_t)]),
-        SoftmaxCrossEntropyLossTemplate.referenceTemplate, ForkTransformer) for type in IntegerDataTypes
-]
-
-PULPSoftmaxCrossEntropyLossDualOutputBindings = [
     NodeBinding(
         SoftmaxCrossEntropyLossChecker([PointerClass(float32_t), PointerClass(type)],
                                        [PointerClass(float32_t), PointerClass(float32_t)]),
-        SoftmaxCrossEntropyLossTemplate.referenceDualOutputTemplate, ForkTransformer) for type in IntegerDataTypes
+        SoftmaxCrossEntropyLossTemplate.referenceTemplate, ForkTransformer) for type in IntegerDataTypes
 ]
 
 PULPSoftmaxCrossEntropyLossGradBindings = [
diff --git a/Deeploy/Targets/PULPOpen/Platform.py b/Deeploy/Targets/PULPOpen/Platform.py
index 0766548e43..2413942869 100644
--- a/Deeploy/Targets/PULPOpen/Platform.py
+++ b/Deeploy/Targets/PULPOpen/Platform.py
@@ -47,10 +47,9 @@
     PULPRQSConv1DTilingReadyBindings, PULPRQSConv2DTilingReadyBindings, PULPRQSDWConv2DTilingReadyBindings, \
     PULPRQSGEMMTilingReadyBindings, PULPRQSiHardswishTilingReadyBindings, PULPRQSMatrixVecTilingReadyBindings, \
     PULPRQSTallGEMMTilingReadyBindings, PULPRQSTilingReadyBindings, PULPSGDTilingReadyBindings, \
-    PULPSliceTilingReadyBindings, PULPSoftmaxCrossEntropyDualOutputTilingReadyBindings, \
-    PULPSoftmaxCrossEntropyGradTilingReadyBindings, PULPSoftmaxCrossEntropyTilingReadyBindings, \
-    PULPSoftmaxGradTilingReadyBindings, PULPSoftmaxTilingReadyBindings, PULPTransposeTilingReadyBindings, \
-    PULPUniformRQSTilingReadyBindings
+    PULPSliceTilingReadyBindings, PULPSoftmaxCrossEntropyGradTilingReadyBindings, \
+    PULPSoftmaxCrossEntropyTilingReadyBindings, PULPSoftmaxGradTilingReadyBindings, PULPSoftmaxTilingReadyBindings, \
+    PULPTransposeTilingReadyBindings, PULPUniformRQSTilingReadyBindings
 from Deeploy.Targets.PULPOpen.TopologyOptimizationPasses.Passes import PULPAddRequantMergePass, \
     PULPConvRequantMergePass, PULPGEMMRequantMergePass, PULPMatMulRequantMergePass
 
@@ -106,8 +105,6 @@
 iHardswishMapper = NodeMapper(iHardswishParser(), PULPiHardswishTilingReadyBindings)
 RQSiHardswishMapper = NodeMapper(RQSiHardswishParser(), PULPRQSiHardswishTilingReadyBindings)
 SoftmaxCrossEntropyLossMapper = NodeMapper(SoftmaxCrossEntropyLossParser(), PULPSoftmaxCrossEntropyTilingReadyBindings)
-SoftmaxCrossEntropyLossDualOutputMapper = NodeMapper(SoftmaxCrossEntropyLossParser(),
-                                                     PULPSoftmaxCrossEntropyDualOutputTilingReadyBindings)
 SoftmaxCrossEntropyLossGradMapper = NodeMapper(SoftmaxCrossEntropyLossGradParser(),
                                                PULPSoftmaxCrossEntropyGradTilingReadyBindings)
 SGDMapper = NodeMapper(SGDParser(), PULPSGDTilingReadyBindings)
@@ -116,88 +113,47 @@
 DequantMapper = NodeMapper(DequantParser(), BasicDequantBindings)
 GEMMDequantMapper = NodeMapper(PULPGEMMParser(), BasicGEMMBindings)
 PULPMapping = {
-    'Conv':
-        ConvLayer([FPConv2DMapper, FPDWConv2DMapper]),
-    'RequantizedConv':
-        PULPRQSConvLayer([Conv2DMapper, DWConv2DMapper, Conv1DMapper, DWConv1DMapper]),
-    'RequantizedGemm':
-        PULPRQSGEMMLayer([MatrixVecMapper, TallGEMMMapper, GEMMMapper]),
-    'Gemm':
-        GEMMLayer([FloatGEMMMapper, GEMMDequantMapper]),
-    'Gelu':
-        GELULayer([GELUMapper]),
-    'GeluGrad':
-        GELUGradLayer([GELUGradMapper]),
-    'LayerNormalization':
-        LayerNormLayer([LayerNormMapper]),
-    'LayerNormalizationGrad':
-        LayerNormGradLayer([LayerNormGradMapper]),
-    'MaxPool':
-        MaxPoolLayer([MaxPool1DMapper, MaxPool2DMapper]),
-    'RequantizediGELU':
-        RQSiGELULayer([RQGELU_int8_Mapper]),
-    'RQIntegerDiv':
-        RQIntegerDivLayer([RQIntegerDivMapper]),
-    'MatMul':
-        MatMulLayer([MatMulMapper]),
-    'IntegerMean':
-        ReduceMeanLayer([ReduceMeanMapper]),
-    'iSoftmax':
-        SoftmaxLayer([Softmax_int8_Mapper]),
-    'Softmax':
-        SoftmaxLayer([SoftmaxMapper]),
-    'ReduceMean':
-        ReduceMeanLayer([ReduceMeanMapper]),
-    'ReduceSum':
-        ReduceSumLayer([ReduceSumMapper]),
-    'RequantShift':
-        RequantShiftLayer([UniformRequantShiftMapper, RequantShiftMapper]),
-    'Add':
-        AddLayer([AddMapper]),
-    'Flatten':
-        ReshapeLayer([FlattenMapper]),
-    'Gather':
-        GatherLayer([GatherMapper]),
-    'Mul':
-        MulLayer([MulMapper]),
-    'Pad':
-        PadLayer([Pad1DMapper, Pad2DMapper]),
-    'Relu':
-        ReluLayer([ReluMapper]),
-    'Reshape':
-        ReshapeLayer([ReshapeMapper]),
-    'Squeeze':
-        ReshapeLayer([UnsqueezeMapper]),
-    'Transpose':
-        TransposeLayer([TransposeMapper]),
-    'Unsqueeze':
-        ReshapeLayer([UnsqueezeMapper]),
-    'Slice':
-        SliceLayer([SliceMapper, DMASliceMapper]),
-    'RequantizedAdd':
-        AddLayer([RQAddMapper]),
-    'Concat':
-        ConcatLayer([ConcatMapper]),
-    'iRMSNorm':
-        iRMSNormLayer([iRMSNormMapper]),
-    'iHardswish':
-        iHardswishLayer([iHardswishMapper]),
-    'RequantizediHardswish':
-        RQSiHardswishLayer([RQSiHardswishMapper]),
-    'Quant':
-        QuantLayer([QuantMapper]),
-    'Dequant':
-        QuantLayer([DequantMapper]),
-    'SoftmaxGrad':
-        SoftmaxGradLayer([SoftmaxGradMapper]),
-    'SoftmaxCrossEntropyLoss':
-        SoftmaxCrossEntropyLossLayer([SoftmaxCrossEntropyLossDualOutputMapper, SoftmaxCrossEntropyLossMapper]),
-    'SoftmaxCrossEntropyLossGrad':
-        SoftmaxCrossEntropyLossGradLayer([SoftmaxCrossEntropyLossGradMapper]),
-    'SGD':
-        SGDLayer([SGDMapper]),
-    'InPlaceAccumulatorV2':
-        InPlaceAccumulatorV2Layer([InPlaceAccumulatorV2Mapper]),
+    'Conv': ConvLayer([FPConv2DMapper, FPDWConv2DMapper]),
+    'RequantizedConv': PULPRQSConvLayer([Conv2DMapper, DWConv2DMapper, Conv1DMapper, DWConv1DMapper]),
+    'RequantizedGemm': PULPRQSGEMMLayer([MatrixVecMapper, TallGEMMMapper, GEMMMapper]),
+    'Gemm': GEMMLayer([FloatGEMMMapper, GEMMDequantMapper]),
+    'Gelu': GELULayer([GELUMapper]),
+    'GeluGrad': GELUGradLayer([GELUGradMapper]),
+    'LayerNormalization': LayerNormLayer([LayerNormMapper]),
+    'LayerNormalizationGrad': LayerNormGradLayer([LayerNormGradMapper]),
+    'MaxPool': MaxPoolLayer([MaxPool1DMapper, MaxPool2DMapper]),
+    'RequantizediGELU': RQSiGELULayer([RQGELU_int8_Mapper]),
+    'RQIntegerDiv': RQIntegerDivLayer([RQIntegerDivMapper]),
+    'MatMul': MatMulLayer([MatMulMapper]),
+    'IntegerMean': ReduceMeanLayer([ReduceMeanMapper]),
+    'iSoftmax': SoftmaxLayer([Softmax_int8_Mapper]),
+    'Softmax': SoftmaxLayer([SoftmaxMapper]),
+    'ReduceMean': ReduceMeanLayer([ReduceMeanMapper]),
+    'ReduceSum': ReduceSumLayer([ReduceSumMapper]),
+    'RequantShift': RequantShiftLayer([UniformRequantShiftMapper, RequantShiftMapper]),
+    'Add': AddLayer([AddMapper]),
+    'Flatten': ReshapeLayer([FlattenMapper]),
+    'Gather': GatherLayer([GatherMapper]),
+    'Mul': MulLayer([MulMapper]),
+    'Pad': PadLayer([Pad1DMapper, Pad2DMapper]),
+    'Relu': ReluLayer([ReluMapper]),
+    'Reshape': ReshapeLayer([ReshapeMapper]),
+    'Squeeze': ReshapeLayer([UnsqueezeMapper]),
+    'Transpose': TransposeLayer([TransposeMapper]),
+    'Unsqueeze': ReshapeLayer([UnsqueezeMapper]),
+    'Slice': SliceLayer([SliceMapper, DMASliceMapper]),
+    'RequantizedAdd': AddLayer([RQAddMapper]),
+    'Concat': ConcatLayer([ConcatMapper]),
+    'iRMSNorm': iRMSNormLayer([iRMSNormMapper]),
+    'iHardswish': iHardswishLayer([iHardswishMapper]),
+    'RequantizediHardswish': RQSiHardswishLayer([RQSiHardswishMapper]),
+    'Quant': QuantLayer([QuantMapper]),
+    'Dequant': QuantLayer([DequantMapper]),
+    'SoftmaxGrad': SoftmaxGradLayer([SoftmaxGradMapper]),
+    'SoftmaxCrossEntropyLoss': SoftmaxCrossEntropyLossLayer([SoftmaxCrossEntropyLossMapper]),
+    'SoftmaxCrossEntropyLossGrad': SoftmaxCrossEntropyLossGradLayer([SoftmaxCrossEntropyLossGradMapper]),
+    'SGD': SGDLayer([SGDMapper]),
+    'InPlaceAccumulatorV2': InPlaceAccumulatorV2Layer([InPlaceAccumulatorV2Mapper]),
 }
 
 
diff --git a/Deeploy/Targets/PULPOpen/Templates/SoftmaxCrossEntropyLossTemplate.py b/Deeploy/Targets/PULPOpen/Templates/SoftmaxCrossEntropyLossTemplate.py
index 4a3da4b3ee..914a18c3ed 100644
--- a/Deeploy/Targets/PULPOpen/Templates/SoftmaxCrossEntropyLossTemplate.py
+++ b/Deeploy/Targets/PULPOpen/Templates/SoftmaxCrossEntropyLossTemplate.py
@@ -4,33 +4,11 @@
 
 from Deeploy.DeeployTypes import NodeTemplate
 
+# Canonical SoftmaxCrossEntropyLoss: emits both a scalar mean loss and the
+# per-sample log_prob tensor.
 referenceTemplate = NodeTemplate("""
 BEGIN_SINGLE_CORE
     // SoftmaxCrossEntropyLoss (Name: ${nodeName}, Op: ${nodeOp})
-    for (uint32_t i = 0; i < ${batch}; i++) {
-        float max_logit = ${logits}[i * ${num_classes} + 0];
-        for (uint32_t j = 1; j < ${num_classes}; j++) {
-            if (${logits}[i * ${num_classes} + j] > max_logit) {
-                max_logit = ${logits}[i * ${num_classes} + j];
-            }
-        }
-
-        float32_t sum_exp = 0.0f;
-        for (uint32_t j = 0; j < ${num_classes}; j++) {
-            sum_exp += expf(${logits}[i * ${num_classes} + j] - max_logit);
-        }
-
-        for (uint32_t j = 0; j < ${num_classes}; j++) {
-            // log_prob = logit - max_logit - log(sum_exp)
-            ${log_prob}[i * ${num_classes} + j] = ${logits}[i * ${num_classes} + j] - max_logit - logf(sum_exp);
-        }
-    }
-END_SINGLE_CORE
-""")
-
-referenceDualOutputTemplate = NodeTemplate("""
-BEGIN_SINGLE_CORE
-    // SoftmaxCrossEntropyLoss dual-output (Name: ${nodeName}, Op: ${nodeOp})
     float32_t sce_total_loss = 0.0f;
     for (uint32_t i = 0; i < ${batch}; i++) {
         float32_t sce_max_logit = ${logits}[i * ${num_classes}];
@@ -49,7 +27,6 @@
                             - sce_max_logit - sce_log_sum_exp);
     }
     ${loss}[0] = sce_total_loss / (float32_t)${batch};
-    printf("    [SCE] loss=%.6f\\r\\n", (double)${loss}[0]);
 END_SINGLE_CORE
 """)
 
diff --git a/Deeploy/Targets/PULPOpen/TileConstraints/SoftmaxCrossEntropyTileConstraint.py b/Deeploy/Targets/PULPOpen/TileConstraints/SoftmaxCrossEntropyTileConstraint.py
index 38c984de63..78957136e5 100644
--- a/Deeploy/Targets/PULPOpen/TileConstraints/SoftmaxCrossEntropyTileConstraint.py
+++ b/Deeploy/Targets/PULPOpen/TileConstraints/SoftmaxCrossEntropyTileConstraint.py
@@ -2,6 +2,7 @@
 #
 # SPDX-License-Identifier: Apache-2.0
 
+import copy
 from typing import Dict, List, Tuple, Union
 
 from ortools.constraint_solver.pywrapcp import IntVar
@@ -17,10 +18,18 @@
 
 
 class SoftmaxCrossEntropyTileConstraint(TileConstraint):
+    """TileConstraint for SoftmaxCrossEntropyLoss (2 outputs: loss + log_prob).
+
+    Both batch and num_classes are pinned to their full size by
+    addPolicyConstraint, so SCE itself is never tiled — the sole purpose of
+    the wrapTilingSolution override is to bypass the base-class single-output
+    assertion and carry the scalar loss buffer through the DMA schedule.
+    """
 
     dataIn1Name = 'logits'
     dataIn2Name = 'labels'
     dataOutName = 'log_prob'
+    dataLossName = 'loss'
 
     @classmethod
     def addGeometricalConstraint(cls, tilerModel: TilerModel, parseDict: Dict, ctxt: NetworkContext) -> TilerModel:
@@ -108,8 +117,53 @@ def serializeTilingSolution(
 
         return variableReplacementSchedule, tilingSchedule
 
+    @classmethod
+    def wrapTilingSolution(
+            cls, tilingSolution: NodeMemoryConstraint, targetMemLevel: str, ctxt: NetworkContext,
+            operatorRepresentation: OperatorRepresentation) -> Tuple[VariableReplacementScheme, List[TilingSchedule]]:
+        """Override the base-class single-output wrapper.
+
+        SoftmaxCrossEntropyLoss emits two outputs (loss + log_prob) but the
+        base-class wrapTilingSolution asserts exactly one.  We run the base
+        wrapper on a log_prob-only slice of the tiling solution and then patch
+        the scalar loss address / rectangle back into each resulting schedule.
+
+        Grad subclasses that do not have a scalar loss output fall straight
+        through to the base-class behaviour.
+        """
+        lossVar = operatorRepresentation.get(cls.dataLossName, '')
+
+        # No scalar loss output (e.g. Grad subclass) — plain base-class path.
+        if not lossVar or lossVar not in tilingSolution.outputTensorMemoryConstraints:
+            return super().wrapTilingSolution(tilingSolution, targetMemLevel, ctxt, operatorRepresentation)
+
+        # Log_prob-only slice of the tiling solution so the single-output
+        # assertion in the base class passes.
+        logProbVar = operatorRepresentation[cls.dataOutName]
+        singleOutputSolution = copy.deepcopy(tilingSolution)
+        singleOutputSolution.outputTensorMemoryConstraints = {
+            logProbVar: tilingSolution.outputTensorMemoryConstraints[logProbVar]
+        }
+
+        varReplacement, tilingSchedules = super().wrapTilingSolution(singleOutputSolution, targetMemLevel, ctxt,
+                                                                     operatorRepresentation)
+
+        # Patch the scalar loss into each schedule's output list.
+        lossAddr = TileConstraint.getBaseAddr(tilingSolution, targetMemLevel, lossVar)
+        if lossAddr == [None]:
+            return varReplacement, tilingSchedules
+
+        lossRect = HyperRectangle((0,), (1,))
+        for schedule in tilingSchedules:
+            schedule.outputBaseOffsets[cls.dataLossName] = lossAddr
+            for step in schedule.outputLoadSchedule:
+                step[cls.dataLossName] = lossRect
+
+        return varReplacement, tilingSchedules
+
 
 class SoftmaxCrossEntropyGradTileConstraint(SoftmaxCrossEntropyTileConstraint):
     dataIn1Name = 'log_prob'
     dataIn2Name = 'labels'
     dataOutName = 'grad'
+    dataLossName = ''  # no scalar loss output — fall through to base wrapper
diff --git a/Deeploy/Targets/PULPOpen/Tiler.py b/Deeploy/Targets/PULPOpen/Tiler.py
index aa16369f04..b473d4be57 100644
--- a/Deeploy/Targets/PULPOpen/Tiler.py
+++ b/Deeploy/Targets/PULPOpen/Tiler.py
@@ -22,8 +22,8 @@
     PULPReshapeBindings, PULPRQAddBindings, PULPRQSBindings, PULPRQSConv1DBindings, PULPRQSConv2DBindings, \
     PULPRQSDWConv2DBindings, PULPRQSGEMMBindings, PULPRQSiHardswishBindings, PULPRQSMatrixVecBindings, \
     PULPRQSTallGEMMBindings, PULPSGDBindings, PULPSliceBindings, PULPSoftmaxBindings, \
-    PULPSoftmaxCrossEntropyLossBindings, PULPSoftmaxCrossEntropyLossDualOutputBindings, \
-    PULPSoftmaxCrossEntropyLossGradBindings, PULPSoftmaxGradBindings, PULPTransposeBindings, PULPUniformRQSBindings
+    PULPSoftmaxCrossEntropyLossBindings, PULPSoftmaxCrossEntropyLossGradBindings, PULPSoftmaxGradBindings, \
+    PULPTransposeBindings, PULPUniformRQSBindings
 from Deeploy.Targets.PULPOpen.TileConstraints.ConvTileConstraint import Conv2DTileConstraint, RQConv1DTileConstraint, \
     RQConv2DTileConstraint
 from Deeploy.Targets.PULPOpen.TileConstraints.DWConvTileConstraint import DWConv2DTileConstraint, \
@@ -44,8 +44,6 @@
 from Deeploy.Targets.PULPOpen.TileConstraints.RequantShiftTileConstraint import RequantShiftTileConstraint
 from Deeploy.Targets.PULPOpen.TileConstraints.SGDTileConstraint import SGDTileConstraint
 from Deeploy.Targets.PULPOpen.TileConstraints.SliceConstraint import SliceTileConstraint
-from Deeploy.Targets.PULPOpen.TileConstraints.SoftmaxCrossEntropyLossDualOutputTileConstraint import \
-    SoftmaxCrossEntropyLossDualOutputTileConstraint
 from Deeploy.Targets.PULPOpen.TileConstraints.SoftmaxCrossEntropyTileConstraint import \
     SoftmaxCrossEntropyGradTileConstraint, SoftmaxCrossEntropyTileConstraint
 from Deeploy.TilingExtension.TilerExtension import TilingReadyNodeBindings
@@ -148,10 +146,6 @@
 PULPSoftmaxCrossEntropyTilingReadyBindings = TilingReadyNodeBindings(
     nodeBindings = PULPSoftmaxCrossEntropyLossBindings, tileConstraint = SoftmaxCrossEntropyTileConstraint())
 
-PULPSoftmaxCrossEntropyDualOutputTilingReadyBindings = TilingReadyNodeBindings(
-    nodeBindings = PULPSoftmaxCrossEntropyLossDualOutputBindings,
-    tileConstraint = SoftmaxCrossEntropyLossDualOutputTileConstraint())
-
 PULPSoftmaxCrossEntropyGradTilingReadyBindings = TilingReadyNodeBindings(
     nodeBindings = PULPSoftmaxCrossEntropyLossGradBindings, tileConstraint = SoftmaxCrossEntropyGradTileConstraint())
 
diff --git a/Deeploy/TilingExtension/TilerExtension.py b/Deeploy/TilingExtension/TilerExtension.py
index 294f5b400a..aa5a02aed9 100644
--- a/Deeploy/TilingExtension/TilerExtension.py
+++ b/Deeploy/TilingExtension/TilerExtension.py
@@ -441,12 +441,11 @@ def minimalloc(self, memoryMap, ctxt, nodeMemoryConstraint, capacity: int, memor
                                 8) * nodeMemoryConstraint.tensorMemoryConstraints[
                                     memoryBlock.name].memoryConstraints[memoryLevel].multiBufferCoefficient
 
-                _alignedSize = ((int(_bufferSize) + 3) // 4) * 4
                 writer.writerow([
                     memoryBlock.name,
                     str(memoryBlock.lifetime[0]),
                     str(memoryBlock.lifetime[1] + 1),
-                    str(_alignedSize)
+                    str(int(_bufferSize))
                 ])
 
         try:
diff --git a/DeeployTest/Tests/Kernels/FP32/Softmax/CrossEntropy/network.onnx b/DeeployTest/Tests/Kernels/FP32/Softmax/CrossEntropy/network.onnx
index 4e132a326b6bc5e445a44421602b0615b3ec9506..bdd27350f43925ec9b7ba7fb0f81fa22f7907771 100644
GIT binary patch
delta 82
zcmeytc!p7%gTv||BUcC)TTXs@W=S!SPE1P8DHdYM$uBMz;sA=q7Zl|uNii2EPmDBV
f;Ve#0ow&eUScnUvfQyTRgHecui;07A;x#1z>kbxU

delta 126
zcmX@Z_=7Q$gF{HVI6ti<H?bl<xhTK5I6gJ6q$t0jGCnsyB{j$D0wY%{7h6t#dS*#6
okWNfW%_$b*0P^Dtit>}Bq_FGF%TJl8Z77D_<cSI96IUw&09|k|LI3~&

diff --git a/DeeployTest/Tests/Kernels/FP32/Softmax/CrossEntropy/outputs.npz b/DeeployTest/Tests/Kernels/FP32/Softmax/CrossEntropy/outputs.npz
index fede142f839de7f0a1ec73fd200cfc5616159db9..991e9da29a1e1174feec0b47cd367add669db6fc 100644
GIT binary patch
delta 350
zcmZ3-yol8#z?+#xgaHB+89s~unbiX1fG`Jx2t!VOaj{-rK_w%D09Xv9u7zQug)@_(
z&cp^=DFp=uu0wf>4ydMVUXf=qQD}|ll!+P#-tX!;pt92cK+2-b15S$m2j)2h9+0|y
zaetqu!vWtfi}x3WZrH!?efk0Zc>V+FE#LPW7O5WCC#Zga;ak9gIk$K04?AOcpl>oe
zqker<OWA>M21y65tw}p@M1kXgowMeFU*h5W&3@P%_+oE-U|+WPfuk|s_6K+~GKnzb
z@)O7h5YWH~qG3Kn*9G!}0#F4A^MDlUg9spKVB7~JbMn*U3ySiSK!FqB&B_LnVFJPe
KAT7fL;sF3|EM(#U

delta 207
zcmZ3)x{g^Wz?+#xgaHB+8M?O|_MLcPjpyWv8VBC*>N%jY(*HopqRazMiv9=YIs_h&
zxP5VducyNSpD&B|7lv-wzwdqe0ls+t18FVa_Zt+c9@s0Wet_Xyz=7GfckB;4V|bu<
z@=-?p;OLgJ1K$mj4qRK4cHoEt#{oNM%>#eL!}pv0usQJ6-uS@2Z0`fdV!rJU@MdHZ
mVMcY00wV(h2=hSQ2%;Jo879{<iSscsFfv#H`Efvb1_l70p-Wf*


From e3488632a6797ccf82d2f909aa333f033653f6a5 Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 13:32:21 +0000
Subject: [PATCH 04/28] training-platform core: collapse InPlaceAccumulatorV2
 template to single binding
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Drop the separate tiledReferenceTemplate / PULPInPlaceAccumulatorV2TiledBindings
path.  The non-tiled template's data_out write is redundant: InPlaceAccumulatorV2
is terminal in the training graph, no downstream kernel consumes data_out, and
the alias registered in alignToContext already makes the graph output pointer
resolve to accum_buffer's L2 slot.  Emitting the write is not only unnecessary —
in the tiled path it also triggers an L2 egress DMA whose destination may
overlap with live buffers and corrupt L2, which is why the tiled variant
existed in the first place.

One template that only writes accum_buffer is correct for both modes.  Tiler.py
now pulls the (single) PULPInPlaceAccumulatorV2Bindings through
PULPInPlaceAccumulatorV2TilingReadyBindings.

Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every step)
in both non-tiled (deeployTrainingRunner_siracusa.py) and tiled
(deeployTrainingRunner_tiled_siracusa.py --l1 64000 --l2 2000000) runs.
---
 Deeploy/Targets/PULPOpen/Bindings.py          |  8 ---
 .../FloatInPlaceAccumulatorV2Template.py      | 63 +++++++------------
 Deeploy/Targets/PULPOpen/Tiler.py             |  4 +-
 3 files changed, 23 insertions(+), 52 deletions(-)

diff --git a/Deeploy/Targets/PULPOpen/Bindings.py b/Deeploy/Targets/PULPOpen/Bindings.py
index 04bd81c172..06674a7498 100644
--- a/Deeploy/Targets/PULPOpen/Bindings.py
+++ b/Deeploy/Targets/PULPOpen/Bindings.py
@@ -378,14 +378,6 @@
         ForkTransformer)
 ]
 
-PULPInPlaceAccumulatorV2TiledBindings = [
-    NodeBinding(
-        InPlaceAccumulatorV2Checker(
-            [PointerClass(float32_t), PointerClass(float32_t),
-             PointerClass(uint8_t)], [PointerClass(float32_t)]),
-        FloatInPlaceAccumulatorV2Template.tiledReferenceTemplate, ForkTransformer)
-]
-
 PULPTransposeBindings = [
     NodeBinding(TransposeChecker([PointerClass(type)], [PointerClass(type)]), TransposeTemplate.referenceTemplate,
                 ForkTransformer) for type in IntegerDataTypes
diff --git a/Deeploy/Targets/PULPOpen/Templates/FloatInPlaceAccumulatorV2Template.py b/Deeploy/Targets/PULPOpen/Templates/FloatInPlaceAccumulatorV2Template.py
index d1cfcc5d01..f7864c7261 100644
--- a/Deeploy/Targets/PULPOpen/Templates/FloatInPlaceAccumulatorV2Template.py
+++ b/Deeploy/Targets/PULPOpen/Templates/FloatInPlaceAccumulatorV2Template.py
@@ -10,19 +10,25 @@
 class _PULPInPlaceAccumulatorV2Template(NodeTemplate):
     """True in-place InPlaceAccumulatorV2 template for PULP.
 
-    Writes the result directly into accum_buffer (the graph input) rather
-    than into a separate data_out buffer.  data_out is registered as an
-    alias of accum_buffer so the memory allocator knows they share memory
-    and will not free accum_buffer prematurely.
+    Writes the accumulation result into ``accum_buffer`` (the graph input).
+    ``data_out`` is registered as an alias of ``accum_buffer`` so the memory
+    allocator knows they share memory and will not free ``accum_buffer``
+    prematurely.
+
+    ``data_out`` is intentionally *not* written by the emitted C code:
+
+    - InPlaceAccumulatorV2 is terminal in the training graph — no downstream
+      kernel consumes ``data_out``; it only exists as a symbolic output so
+      the graph stays well-formed.
+    - In the tiled path, emitting a write to ``data_out`` would also make
+      Deeploy generate an L2 egress DMA for it, and ``data_out``'s L2 slot
+      may overlap with other live buffers, corrupting L2.
 
     Semantics:
-        if lazy_reset_grad: accum_buffer = gradient        (reset)
-        else:               accum_buffer += gradient       (accumulate)
+        if lazy_reset_grad: accum_buffer  = gradient   (reset)
+        else:               accum_buffer += gradient   (accumulate)
     """
 
-    def __init__(self, templateStr):
-        super().__init__(templateStr)
-
     def alignToContext(
             self, ctxt: NetworkContext,
             operatorRepresentation: OperatorRepresentation) -> Tuple[NetworkContext, OperatorRepresentation, List[str]]:
@@ -32,43 +38,16 @@ def alignToContext(
         accum_buffer.aliases.add(data_out.name)
         data_out.aliases.add(accum_buffer.name)
         data_out._alias = accum_buffer.name
+
         return ctxt, operatorRepresentation, []
 
 
 referenceTemplate = _PULPInPlaceAccumulatorV2Template("""
-// InPlaceAccumulatorV2 - true in-place (Name: ${nodeName}, Op: ${nodeOp})
-// Writes result to accum_buffer (in-place) and data_out (explicit output).
-// In training, data_out aliases accum_buffer (same or separate allocation).
-// Reset (lazy_reset_grad=1): accum_buffer  = gradient
-// Accum (lazy_reset_grad=0): accum_buffer += gradient
-int8_t ${nodeName}_core_id = pi_core_id();
-int8_t ${nodeName}_log2Core = log2(NUM_CORES);
-int32_t ${nodeName}_chunk = (${size} >> ${nodeName}_log2Core) + ((${size} & (NUM_CORES-1))!=0);
-int32_t ${nodeName}_start = MIN(${nodeName}_chunk * ${nodeName}_core_id, (int32_t)${size});
-int32_t ${nodeName}_stop  = MIN(${nodeName}_start + ${nodeName}_chunk,   (int32_t)${size});
-
-if (${lazy_reset_grad}[0]) {
-    for (int32_t i = ${nodeName}_start; i < ${nodeName}_stop; i++) {
-        ${accum_buffer}[i] = ${gradient}[i];
-        ${data_out}[i] = ${gradient}[i];
-    }
-} else {
-    for (int32_t i = ${nodeName}_start; i < ${nodeName}_stop; i++) {
-        ${accum_buffer}[i] += ${gradient}[i];
-        ${data_out}[i] = ${accum_buffer}[i];
-    }
-}
-""")
-
-# Tiled variant: writes only to ${accum_buffer} (no ${data_out} write).
-# In the tiled context the optimizer reads the gradient directly from
-# accum_buffer's L2 address (input_4/input_5).  data_out's L2 address may
-# overlap with other live buffers, so writing to it via DMA would corrupt L2.
-# Omitting ${data_out} means we do not need a DMA egress for it at all.
-tiledReferenceTemplate = _PULPInPlaceAccumulatorV2Template("""
-// InPlaceAccumulatorV2 - tiled in-place (Name: ${nodeName}, Op: ${nodeOp})
-// Tiled variant: result written only to accum_buffer (egressed to L2 by DMA).
-// data_out is NOT written here — optimizer reads gradient from accum_buffer.
+// InPlaceAccumulatorV2 (Name: ${nodeName}, Op: ${nodeOp})
+// Writes result into accum_buffer (in-place).  data_out is an alias of
+// accum_buffer and is deliberately not written — it has no downstream
+// consumer, and emitting a write would trigger an L2 egress DMA whose
+// destination may overlap with live buffers in the tiled path.
 // Reset (lazy_reset_grad=1): accum_buffer  = gradient
 // Accum (lazy_reset_grad=0): accum_buffer += gradient
 int8_t ${nodeName}_core_id = pi_core_id();
diff --git a/Deeploy/Targets/PULPOpen/Tiler.py b/Deeploy/Targets/PULPOpen/Tiler.py
index b473d4be57..cc9b4e0ca4 100644
--- a/Deeploy/Targets/PULPOpen/Tiler.py
+++ b/Deeploy/Targets/PULPOpen/Tiler.py
@@ -16,7 +16,7 @@
 from Deeploy.Targets.Generic.TileConstraints.UnaryTileConstraint import UnaryTileConstraint
 from Deeploy.Targets.PULPOpen.Bindings import PULPAddBindings, PULPConcatBindings, PULPFloatConv2DBindings, \
     PULPFloatDWConv2DBindings, PULPFloatGELUBinding, PULPFloatGELUGradBinding, PULPFloatGEMMBindings, \
-    PULPGatherBindings, PULPiHardswishBindings, PULPInPlaceAccumulatorV2TiledBindings, PULPiRMSNormBindings, \
+    PULPGatherBindings, PULPiHardswishBindings, PULPInPlaceAccumulatorV2Bindings, PULPiRMSNormBindings, \
     PULPiRQSGELUBindings, PULPLayernormBinding, PULPLayernormGradBinding, PULPMatMulBindings, PULPMaxPool1DBindings, \
     PULPMaxPool2DBindings, PULPMulBindings, PULPReduceMeanBindings, PULPReduceSumBindings, PULPReluBinding, \
     PULPReshapeBindings, PULPRQAddBindings, PULPRQSBindings, PULPRQSConv1DBindings, PULPRQSConv2DBindings, \
@@ -159,7 +159,7 @@
                                                      tileConstraint = SGDTileConstraint())
 
 PULPInPlaceAccumulatorV2TilingReadyBindings = TilingReadyNodeBindings(
-    nodeBindings = PULPInPlaceAccumulatorV2TiledBindings, tileConstraint = InPlaceAccumulatorV2TileConstraint())
+    nodeBindings = PULPInPlaceAccumulatorV2Bindings, tileConstraint = InPlaceAccumulatorV2TileConstraint())
 
 PULPSliceTilingReadyBindings = TilingReadyNodeBindings(nodeBindings = PULPSliceBindings,
                                                        tileConstraint = SliceTileConstraint())

From b844fe181c5d26d1dddf855ccb30aaa736ddc235 Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 13:37:15 +0000
Subject: [PATCH 05/28] training-platform core: generalise SGD alias comment to
 L2 or L3

The weight/weight_updated alias + shared allocTemplate mechanism works for
any memory level weight lives in, not only L2.  Update the class docstring
and inline comment to say "L2 or L3" instead of hard-coding L2.  No code
change.
---
 Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py b/Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py
index 3be74c38d6..d31d2c2797 100644
--- a/Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py
+++ b/Deeploy/Targets/PULPOpen/Templates/SGDTemplate.py
@@ -11,9 +11,10 @@ class _PULPSGDTemplate(NodeTemplate):
     """In-place SGD template for PULP.
 
     weight_updated is aliased to weight so the memory allocator places them
-    at the same L2 address.  This ensures the tiled egress DMA writes the
-    updated weight back to weight's L2 buffer — the same buffer the training
-    network reads from on the next forward pass.
+    at the same address in whichever memory level weight lives in (L2 or L3).
+    This ensures the tiled egress DMA writes the updated weight back to
+    weight's buffer — the same buffer the training network reads from on the
+    next forward pass.
     """
 
     def __init__(self, templateStr):
@@ -29,8 +30,9 @@ def alignToContext(
         weight_updated.aliases.add(weight.name)
         weight_updated._alias = weight.name
 
-        # Make weight_updated share weight's L2 allocation (no separate malloc).
-        # The egress DMA then writes updated weights back to weight's L2 address.
+        # Make weight_updated share weight's allocation (no separate malloc),
+        # regardless of which memory level (L2 or L3) weight is placed in.
+        # The egress DMA then writes updated weights back to weight's address.
         weight_updated.allocTemplate = NodeTemplate(" ${name} = (${type.typeName}) " + str(weight._instance) + ";")
         weight_updated.deallocTemplate = NodeTemplate("")
         return ctxt, operatorRepresentation, []

From ceeb951c5311fe2aad2803e165e6265986dba008 Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 13:53:33 +0000
Subject: [PATCH 06/28] training-platform core: trim InPlaceAccumulatorV2 tile
 constraint comments

Drop restating-the-code comments and inline verbose "trick" narration; keep
only the class docstring and the one-line note about egressing through
data_out rather than accum_buffer.  Also compact the load-schedule loops
into list comprehensions.  No behaviour change.

Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every
step) in both non-tiled and tiled runs.
---
 .../InPlaceAccumulatorV2TileConstraint.py     | 48 +++++--------------
 1 file changed, 12 insertions(+), 36 deletions(-)

diff --git a/Deeploy/Targets/PULPOpen/TileConstraints/InPlaceAccumulatorV2TileConstraint.py b/Deeploy/Targets/PULPOpen/TileConstraints/InPlaceAccumulatorV2TileConstraint.py
index 2d3cfa4c3e..ec66c22b5e 100644
--- a/Deeploy/Targets/PULPOpen/TileConstraints/InPlaceAccumulatorV2TileConstraint.py
+++ b/Deeploy/Targets/PULPOpen/TileConstraints/InPlaceAccumulatorV2TileConstraint.py
@@ -19,8 +19,8 @@
 class InPlaceAccumulatorV2TileConstraint(BOPTileConstraint):
     """Tile constraint for InPlaceAccumulatorV2.
 
-    Tiles buffer and gradient together (same shape); lazy_reset_grad is a
-    scalar (1 element) and is not tiled.
+    Tiles accum_buffer and gradient together (same shape); lazy_reset_grad
+    is a scalar (1 element) and is not tiled.
     """
 
     dataIn1Name = 'accum_buffer'
@@ -29,14 +29,12 @@ class InPlaceAccumulatorV2TileConstraint(BOPTileConstraint):
 
     @classmethod
     def addGeometricalConstraint(cls, tilerModel: TilerModel, parseDict: Dict, ctxt: NetworkContext) -> TilerModel:
-        # Register buffer, gradient, data_out and add BOP equality constraints
         tilerModel = super().addGeometricalConstraint(tilerModel, parseDict, ctxt)
 
-        # Register lazy_reset_grad (scalar flag, not tiled): fix all dims to full size
+        # lazy_reset_grad is a scalar flag — pin full size so it is not tiled.
         lazyResetName = parseDict['lazy_reset_grad']
         tilerModel.addTensorDimToModel(ctxt, lazyResetName)
-        lazyResetTensor = ctxt.lookup(lazyResetName)
-        shape = lazyResetTensor.shape
+        shape = ctxt.lookup(lazyResetName).shape
         dims = [shape] if isinstance(shape, int) else shape
         for idx, dim in enumerate(dims):
             dimVar = tilerModel.getTensorDimVar(lazyResetName, idx)
@@ -51,50 +49,28 @@ def serializeTilingSolution(
             operatorRepresentation: OperatorRepresentation) -> Tuple[VariableReplacementScheme, TilingSchedule]:
         outputCubes = [cube.rectangle for cube in absoluteOutputCubes]
 
-        # Egress strategy: use data_out (the proper graph output, present in
-        # outputTensorMemoryConstraints) rather than accum_buffer (a graph input,
-        # only in inputTensorMemoryConstraints).  This avoids two core-class issues:
-        #   1. accum_buffer appearing in BOTH inputBaseOffsets and outputBaseOffsets
-        #      causes a duplicate-hoist KeyError in TilingVariableReplacement.
-        #   2. The egress DMA lookup uses outputTensorMemoryConstraints; accum_buffer
-        #      is not there and would raise a KeyError.
-        #
-        # The trick: force outputBaseOffsets[data_out] to the SAME L1 arena offset as
-        # inputBaseOffsets[accum_buffer].  Both data_out_ref and accum_buffer_ref then
-        # map to the same physical L1 address.  The tiled kernel writes to ${accum_buffer}
-        # (= accum_buffer_ref in L1); the egress DMA transfers data_out_ref (same L1
-        # bytes) to data_out's L2 address, which is what the optimizer reads.
+        # Use data_out as the egress target rather than accum_buffer (a graph input).
         addrNames = [cls.dataIn1Name, cls.dataIn2Name, cls.dataOutName, 'lazy_reset_grad']
         inputBaseOffsets, outputBaseOffsets = cls.extractBaseAddr(tilingSolution, targetMemLevel,
                                                                   operatorRepresentation, addrNames)
-
-        # Pin data_out's L1 tile to the same arena slot as accum_buffer's L1 tile.
         outputBaseOffsets[cls.dataOutName] = inputBaseOffsets[cls.dataIn1Name]
 
         replacements = {"size": []}
         replacementTypes = {"size": PointerClass(uint16_t)}
 
-        lazyResetName = operatorRepresentation['lazy_reset_grad']
-        lazyResetShape = ctxt.lookup(lazyResetName).shape
+        lazyResetShape = ctxt.lookup(operatorRepresentation['lazy_reset_grad']).shape
         lazyResetDims = (lazyResetShape,) if isinstance(lazyResetShape, int) else tuple(lazyResetShape)
         lazyResetCube = HyperRectangle((0,) * len(lazyResetDims), lazyResetDims)
 
-        inputLoadSchedule = []
-        outputLoadSchedule = []
+        inputLoadSchedule = [{
+            cls.dataIn1Name: cube,
+            cls.dataIn2Name: cube,
+            'lazy_reset_grad': lazyResetCube,
+        } for cube in outputCubes]
+        outputLoadSchedule = [{cls.dataOutName: out} for out in outputCubes]
 
         for cube in outputCubes:
             replacements["size"].append(int(np.prod(cube.dims)))
-            inputLoadSchedule.append({
-                cls.dataIn1Name: cube,
-                cls.dataIn2Name: cube,
-                'lazy_reset_grad': lazyResetCube,
-            })
-
-        for out in outputCubes:
-            # Egress: DMA from data_out_ref (same L1 slot as accum_buffer_ref) → data_out L2.
-            outputLoadSchedule.append({
-                cls.dataOutName: out,
-            })
 
         tilingSchedule = TilingSchedule(inputBaseOffsets, outputBaseOffsets, inputLoadSchedule, outputLoadSchedule)
         variableReplacementSchedule = VariableReplacementScheme(replacements, replacementTypes)

From ecbffa0c0d124a3e97be4044899518bbd5d4bc84 Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 13:55:36 +0000
Subject: [PATCH 07/28] training-platform core: drop leftover egress-target
 comment

---
 .../TileConstraints/InPlaceAccumulatorV2TileConstraint.py        | 1 -
 1 file changed, 1 deletion(-)

diff --git a/Deeploy/Targets/PULPOpen/TileConstraints/InPlaceAccumulatorV2TileConstraint.py b/Deeploy/Targets/PULPOpen/TileConstraints/InPlaceAccumulatorV2TileConstraint.py
index ec66c22b5e..fb2b4bde78 100644
--- a/Deeploy/Targets/PULPOpen/TileConstraints/InPlaceAccumulatorV2TileConstraint.py
+++ b/Deeploy/Targets/PULPOpen/TileConstraints/InPlaceAccumulatorV2TileConstraint.py
@@ -49,7 +49,6 @@ def serializeTilingSolution(
             operatorRepresentation: OperatorRepresentation) -> Tuple[VariableReplacementScheme, TilingSchedule]:
         outputCubes = [cube.rectangle for cube in absoluteOutputCubes]
 
-        # Use data_out as the egress target rather than accum_buffer (a graph input).
         addrNames = [cls.dataIn1Name, cls.dataIn2Name, cls.dataOutName, 'lazy_reset_grad']
         inputBaseOffsets, outputBaseOffsets = cls.extractBaseAddr(tilingSolution, targetMemLevel,
                                                                   operatorRepresentation, addrNames)

From 728c68f8c21e2d028f0ba1b09dde647ed05724b5 Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 14:00:04 +0000
Subject: [PATCH 08/28] training-platform core: drop stray
 ReluGradTileConstraint

ReluGradTileConstraint was an unused BOPTileConstraint subclass that got
pulled in with the initial training-platform cherry-pick.  Nothing in the
tree imports or registers it, simplemlp_train does not contain any Relu
or ReluGrad nodes, and Relu gradients are explicitly out of scope for
this PR.  Remove it.

Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every
step) in both non-tiled and tiled runs.
---
 .../Targets/PULPOpen/TileConstraints/SGDTileConstraint.py  | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/Deeploy/Targets/PULPOpen/TileConstraints/SGDTileConstraint.py b/Deeploy/Targets/PULPOpen/TileConstraints/SGDTileConstraint.py
index 951713d85d..b7757786e1 100644
--- a/Deeploy/Targets/PULPOpen/TileConstraints/SGDTileConstraint.py
+++ b/Deeploy/Targets/PULPOpen/TileConstraints/SGDTileConstraint.py
@@ -10,10 +10,3 @@ class SGDTileConstraint(BOPTileConstraint):
     dataIn1Name = 'weight'
     dataIn2Name = 'grad'
     dataOutName = 'weight_updated'
-
-
-class ReluGradTileConstraint(BOPTileConstraint):
-
-    dataIn1Name = 'grad_out'
-    dataIn2Name = 'data_in'
-    dataOutName = 'grad_in'

From fc24a844ec2337c679ff4660ac6bc2d13cedda4f Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 14:03:13 +0000
Subject: [PATCH 09/28] training-platform core: delete stray
 SoftmaxCrossEntropyLossDualOutputTileConstraint
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The canonicalisation commit 763b4647 moved the loss-patching wrapTilingSolution
logic into the base SoftmaxCrossEntropyTileConstraint (dataLossName = 'loss'
plus loss-output extension) and the commit message declared the dual-output
subclass "now deleted", but the git rm never actually happened — the file
was still sitting on disk with no importers anywhere in the tree.

Finish the deletion.

Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every
step) in both non-tiled and tiled runs.
---
 ...rossEntropyLossDualOutputTileConstraint.py | 71 -------------------
 1 file changed, 71 deletions(-)
 delete mode 100644 Deeploy/Targets/PULPOpen/TileConstraints/SoftmaxCrossEntropyLossDualOutputTileConstraint.py

diff --git a/Deeploy/Targets/PULPOpen/TileConstraints/SoftmaxCrossEntropyLossDualOutputTileConstraint.py b/Deeploy/Targets/PULPOpen/TileConstraints/SoftmaxCrossEntropyLossDualOutputTileConstraint.py
deleted file mode 100644
index a261869711..0000000000
--- a/Deeploy/Targets/PULPOpen/TileConstraints/SoftmaxCrossEntropyLossDualOutputTileConstraint.py
+++ /dev/null
@@ -1,71 +0,0 @@
-# SPDX-FileCopyrightText: 2025 ETH Zurich and University of Bologna
-#
-# SPDX-License-Identifier: Apache-2.0
-
-import copy
-from typing import List, Tuple
-
-from Deeploy.DeeployTypes import NetworkContext, OperatorRepresentation
-from Deeploy.Targets.PULPOpen.TileConstraints.SoftmaxCrossEntropyTileConstraint import SoftmaxCrossEntropyTileConstraint
-from Deeploy.TilingExtension.MemoryConstraints import NodeMemoryConstraint
-from Deeploy.TilingExtension.TileConstraint import TileConstraint
-from Deeploy.TilingExtension.TilingCodegen import HyperRectangle, TilingSchedule, VariableReplacementScheme
-
-
-class SoftmaxCrossEntropyLossDualOutputTileConstraint(SoftmaxCrossEntropyTileConstraint):
-    """TileConstraint for SoftmaxCrossEntropyLoss with 2 outputs:
-      - log_prob  : [batch, num_classes]  (primary output — same as single-output version)
-      - loss      : []  0-d scalar (scalar cross-entropy mean)
-
-    Both batch and num_classes are pinned to their full size by the inherited
-    addPolicyConstraint, so no actual tiling of SCE occurs.  The sole purpose of
-    this subclass is to override wrapTilingSolution so that the base-class
-    single-output assertion is bypassed, and the scalar loss buffer is included
-    in the DMA output schedule.
-    """
-
-    # Key in operatorRepresentation for the scalar loss output buffer name.
-    dataLossName = 'loss'
-
-    @classmethod
-    def wrapTilingSolution(
-            cls, tilingSolution: NodeMemoryConstraint, targetMemLevel: str, ctxt: NetworkContext,
-            operatorRepresentation: OperatorRepresentation) -> Tuple[VariableReplacementScheme, List[TilingSchedule]]:
-
-        logProbVar = operatorRepresentation[cls.dataOutName]  # e.g. "onnx::log_prob::3"
-        lossVar = operatorRepresentation.get(cls.dataLossName, '')
-
-        # If loss is absent (empty string — single-output fallback) or not in the
-        # memory constraint dict, delegate straight to the parent unchanged.
-        if not lossVar or lossVar not in tilingSolution.outputTensorMemoryConstraints:
-            return super().wrapTilingSolution(tilingSolution, targetMemLevel, ctxt, operatorRepresentation)
-
-        # Build a single-output copy of tilingSolution (log_prob only) so that
-        # the base-class assertion `len(outputTensorMemoryConstraints) == 1` passes.
-        singleOutputSolution = copy.deepcopy(tilingSolution)
-        singleOutputSolution.outputTensorMemoryConstraints = {
-            logProbVar: tilingSolution.outputTensorMemoryConstraints[logProbVar]
-        }
-
-        # Call the base-class wrapTilingSolution, which runs cube computation and
-        # calls serializeTilingSolution for log_prob.
-        varReplacement, tilingSchedules = super().wrapTilingSolution(singleOutputSolution, targetMemLevel, ctxt,
-                                                                     operatorRepresentation)
-
-        # Extend each TilingSchedule to include the scalar loss output.
-        # The loss tensor is always 1 element (0-d scalar represented as [1] for DMA).
-        lossAddr = TileConstraint.getBaseAddr(tilingSolution, targetMemLevel, lossVar)
-
-        # If the address is None (IO tensor with runtime-determined address, or tensor
-        # not allocated at this memory level), skip — same logic as sanitizeTilingSchedule.
-        if lossAddr == [None]:
-            return varReplacement, tilingSchedules
-
-        lossRect = HyperRectangle((0,), (1,))
-
-        for schedule in tilingSchedules:
-            schedule.outputBaseOffsets[cls.dataLossName] = lossAddr
-            for step in schedule.outputLoadSchedule:
-                step[cls.dataLossName] = lossRect
-
-        return varReplacement, tilingSchedules

From 12597be6977af0499e1e59a0e2915aac8a547a79 Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 14:17:39 +0000
Subject: [PATCH 10/28] training-platform core: simplify MiniMalloc alias-skip
 block

Collapse the in-place alias collection and resolution in Tiler.minimalloc()
into set/dict comprehensions, drop the verbose multi-line comment, and remove
the JUNGVI: attribution from a block that wasn't authored by Victor.  The
target lookup is now O(1) via a name-keyed dict instead of an inner linear
scan over memoryMap.  No behaviour change.

Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every
step) in both non-tiled and tiled runs.
---
 Deeploy/TilingExtension/TilerExtension.py | 45 ++++++++---------------
 1 file changed, 16 insertions(+), 29 deletions(-)

diff --git a/Deeploy/TilingExtension/TilerExtension.py b/Deeploy/TilingExtension/TilerExtension.py
index aa5a02aed9..a11979c5dc 100644
--- a/Deeploy/TilingExtension/TilerExtension.py
+++ b/Deeploy/TilingExtension/TilerExtension.py
@@ -399,23 +399,17 @@ def minimalloc(self, memoryMap, ctxt, nodeMemoryConstraint, capacity: int, memor
         environment variable to be set to the installation directory.
         """
 
-        blockNames = [block.name for block in memoryMap]
-
-        # In-place alias outputs are costless — their storage is
-        # already accounted for by the alias target.  This mirrors the
-        # zero-cost logic in _buildCostVector (MemoryScheduler.py) and the
-        # skip logic in _allocateStaticBuffer.
-        # We skip them from the MiniMalloc CSV (MiniMalloc does not accept
-        # size-0 entries) and resolve their addrSpace from the alias target
+        blockNames = {block.name for block in memoryMap}
+
+        # In-place alias outputs whose target is in the same memoryMap share
+        # storage with the target — skip them from the MiniMalloc CSV (it
+        # rejects size-0 entries) and copy their addrSpace from the target
         # after the solver runs.
-        # NOTE: Only skip when alias target is in the SAME memoryMap.
-        # When alias target is global (e.g. L2 weight) but we're allocating
-        # L1, the buffer still needs its own L1 space.
-        aliasBlocks = set()
-        for memoryBlock in memoryMap:
-            _buffer = ctxt.lookup(memoryBlock.name)
-            if hasattr(_buffer, "_alias") and _buffer._alias in blockNames:
-                aliasBlocks.add(memoryBlock.name)
+        aliasBlocks = {
+            block.name
+            for block in memoryMap
+            if getattr(ctxt.lookup(block.name), "_alias", None) in blockNames
+        }
 
         with open(f"{self._minimalloc_input}.csv", mode = "w", newline = "") as file:
             writer = csv.writer(file, lineterminator = "\n")
@@ -474,20 +468,13 @@ def minimalloc(self, memoryMap, ctxt, nodeMemoryConstraint, capacity: int, memor
                     if memoryBlock.name == row[0]:
                         memoryBlock._addrSpace = (int(row[-1]), int(row[-1]) + int(row[-2]))
 
-        # JUNGVI: Alias blocks were skipped in the MiniMalloc CSV.
-        # Resolve their addrSpace from their alias target so that
-        # downstream code can access it if needed.
+        # Resolve skipped alias blocks: copy addrSpace from the alias target.
+        targetBlocks = {block.name: block for block in memoryMap}
         for memoryBlock in memoryMap:
-            if memoryBlock.name in aliasBlocks:
-                _buffer = ctxt.lookup(memoryBlock.name)
-                aliasTarget = ctxt.dealiasBuffer(memoryBlock.name)
-                for targetBlock in memoryMap:
-                    if targetBlock.name == aliasTarget:
-                        memoryBlock._addrSpace = targetBlock._addrSpace
-                        break
-                else:
-                    # Alias target not in this memoryMap — use zero offset
-                    memoryBlock._addrSpace = (0, 0)
+            if memoryBlock.name not in aliasBlocks:
+                continue
+            target = targetBlocks.get(ctxt.dealiasBuffer(memoryBlock.name))
+            memoryBlock._addrSpace = target._addrSpace if target is not None else (0, 0)
 
         return memoryMap
 

From b42ea1db48701673f463f13fcacd704429338bce Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 14:24:36 +0000
Subject: [PATCH 11/28] training-platform core: restore per-layer { } block in
 generateInferenceCode
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Upstream PR #177 (13113deb, "Fix/tiling stack scoping and tiling
information corruption", Pu DENG) wraps each layer's emitted code in a
C block:

    layerCode = reduce(lambda a, b: a + b, sections, "")
    callStack += "{\n" + layerCode + "\n}\n"

so that per-layer call args become short-lived stack variables and
RunNetwork's overall stack footprint goes down.  cc1f68b7 silently
reverted this hunk during a merge from devel — the training-platform
branch was based on a pre-#177 snapshot and the conflict resolution
went the wrong way.  Restore the wrapping verbatim.

Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every
step) in both non-tiled and tiled runs.
---
 Deeploy/DeeployTypes.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Deeploy/DeeployTypes.py b/Deeploy/DeeployTypes.py
index 771f00c07d..4c647a3ab4 100644
--- a/Deeploy/DeeployTypes.py
+++ b/Deeploy/DeeployTypes.py
@@ -2800,7 +2800,8 @@ def generateInferenceCode(self) -> str:
             self.ctxt, code = node.generate(self.ctxt)
 
             sections = reduce(lambda a, b: a + b, code, [])
-            callStack += reduce(lambda a, b: a + b, sections, "")
+            layerCode = reduce(lambda a, b: a + b, sections, "")
+            callStack += "{\n" + layerCode + "\n}\n"
 
         return callStack
 

From 52850216d37af1c2d5793bc3e1d4cc259ab714b5 Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 14:29:36 +0000
Subject: [PATCH 12/28] training-platform core: restore upstream
 SoftmaxCrossEntropy kernel test fixtures

Revert the regenerated 2-output network.onnx and outputs.npz under
DeeployTest/Tests/Kernels/FP32/Softmax/CrossEntropy/ back to the
upstream-devel versions.  The training MLP path doesn't depend on these
fixtures, and keeping the original files avoids touching unrelated kernel
test data in this PR.

Verified on Siracusa: simplemlp_train still passes 0/4 (diff=0.000000 at
every step) in both non-tiled and tiled runs.
---
 .../FP32/Softmax/CrossEntropy/network.onnx      | Bin 204 -> 248 bytes
 .../FP32/Softmax/CrossEntropy/outputs.npz       | Bin 674 -> 430 bytes
 2 files changed, 0 insertions(+), 0 deletions(-)

diff --git a/DeeployTest/Tests/Kernels/FP32/Softmax/CrossEntropy/network.onnx b/DeeployTest/Tests/Kernels/FP32/Softmax/CrossEntropy/network.onnx
index bdd27350f43925ec9b7ba7fb0f81fa22f7907771..4e132a326b6bc5e445a44421602b0615b3ec9506 100644
GIT binary patch
delta 126
zcmX@Z_=7Q$gF{HVI6ti<H?bl<xhTK5I6gJ6q$t0jGCnsyB{j$D0wY%{7h6t#dS*#6
okWNfW%_$b*0P^Dtit>}Bq_FGF%TJl8Z77D_<cSI96IUw&09|k|LI3~&

delta 82
zcmeytc!p7%gTv||BUcC)TTXs@W=S!SPE1P8DHdYM$uBMz;sA=q7Zl|uNii2EPmDBV
f;Ve#0ow&eUScnUvfQyTRgHecui;07A;x#1z>kbxU

diff --git a/DeeployTest/Tests/Kernels/FP32/Softmax/CrossEntropy/outputs.npz b/DeeployTest/Tests/Kernels/FP32/Softmax/CrossEntropy/outputs.npz
index 991e9da29a1e1174feec0b47cd367add669db6fc..fede142f839de7f0a1ec73fd200cfc5616159db9 100644
GIT binary patch
delta 207
zcmZ3)x{g^Wz?+#xgaHB+8M?O|_MLcPjpyWv8VBC*>N%jY(*HopqRazMiv9=YIs_h&
zxP5VducyNSpD&B|7lv-wzwdqe0ls+t18FVa_Zt+c9@s0Wet_Xyz=7GfckB;4V|bu<
z@=-?p;OLgJ1K$mj4qRK4cHoEt#{oNM%>#eL!}pv0usQJ6-uS@2Z0`fdV!rJU@MdHZ
mVMcY00wV(h2=hSQ2%;Jo879{<iSscsFfv#H`Efvb1_l70p-Wf*

delta 350
zcmZ3-yol8#z?+#xgaHB+89s~unbiX1fG`Jx2t!VOaj{-rK_w%D09Xv9u7zQug)@_(
z&cp^=DFp=uu0wf>4ydMVUXf=qQD}|ll!+P#-tX!;pt92cK+2-b15S$m2j)2h9+0|y
zaetqu!vWtfi}x3WZrH!?efk0Zc>V+FE#LPW7O5WCC#Zga;ak9gIk$K04?AOcpl>oe
zqker<OWA>M21y65tw}p@M1kXgowMeFU*h5W&3@P%_+oE-U|+WPfuk|s_6K+~GKnzb
z@)O7h5YWH~qG3Kn*9G!}0#F4A^MDlUg9spKVB7~JbMn*U3ySiSK!FqB&B_LnVFJPe
KAT7fL;sF3|EM(#U


From 40e83397cd881d7c28caf954bfc67296c3a76e40 Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 14:33:17 +0000
Subject: [PATCH 13/28] training-platform core: drop legacy 1-output
 Softmax/CrossEntropy kernel test

The new SoftmaxCrossEntropyLossParser requires exactly 2 outputs (loss +
log_prob), so the upstream Tests/Kernels/FP32/Softmax/CrossEntropy fixture
(legacy 1-output ONNX) is no longer parseable.  Rather than regenerate it
in this PR, just remove the test fixture directory and its entry in
test_siracusa_config.py.  CrossEntropyGrad is unaffected and stays.

Verified on Siracusa: simplemlp_train still passes 0/4 (diff=0.000000 at
every step) in both non-tiled and tiled runs.
---
 .../Kernels/FP32/Softmax/CrossEntropy/inputs.npz  | Bin 702 -> 0 bytes
 .../FP32/Softmax/CrossEntropy/network.onnx        | Bin 248 -> 0 bytes
 .../Kernels/FP32/Softmax/CrossEntropy/outputs.npz | Bin 430 -> 0 bytes
 DeeployTest/test_siracusa_config.py               |   1 -
 4 files changed, 1 deletion(-)
 delete mode 100644 DeeployTest/Tests/Kernels/FP32/Softmax/CrossEntropy/inputs.npz
 delete mode 100644 DeeployTest/Tests/Kernels/FP32/Softmax/CrossEntropy/network.onnx
 delete mode 100644 DeeployTest/Tests/Kernels/FP32/Softmax/CrossEntropy/outputs.npz

diff --git a/DeeployTest/Tests/Kernels/FP32/Softmax/CrossEntropy/inputs.npz b/DeeployTest/Tests/Kernels/FP32/Softmax/CrossEntropy/inputs.npz
deleted file mode 100644
index b51a843019963b93b8b3fcce5386cdfdc0ed8320..0000000000000000000000000000000000000000
GIT binary patch
literal 0
HcmV?d00001

literal 702
zcmWIWW@Zs#fB;2?X0NAD6c`y8K$wd`gdrzCJ+q`(FR!4IkwE|~3Q`G@1%b(ap}ql;
zj0|NA)#@p!#mPnLRtoAiX(sAA3hHV3MI}XvdGYy0DXAcFx5S*{RG@fqMq)uKkgs8)
zqhM&DsiRP<KrZ0YSQ4>M>h7QYzU@K#)y(GZKbjt}-{@(?e!+AP`;Q!=`$MHB*&FO^
zwVz<TXFtF4#=SZ=iFQ{X-q~k)@8v!bhray_7ns^N-jv_p$tGg=r{l|BvE!lkUtBr&
zKbw-ed;9NgcHG%Ud*mxC?2Rgd_KWyE-X~P2YtL!*Yu}du#r;Rw7VJCp^zgpTay|A9
zn}zM=vm_jx_jlTPHD0%SFK%dC#=Oiv03~c67PahK01Pz{28V58Qfdx7Y(b(RumDWr
z3foKzBEy!VfCWYf7YEcGOi-E=O2eoCZ$>5&W?U%;riy`~0i0G4=?GmPYN#OVivjA)
Vfrn9mH!B-RmI(+8fV2hJVE`*6r-uLl

diff --git a/DeeployTest/Tests/Kernels/FP32/Softmax/CrossEntropy/network.onnx b/DeeployTest/Tests/Kernels/FP32/Softmax/CrossEntropy/network.onnx
deleted file mode 100644
index 4e132a326b6bc5e445a44421602b0615b3ec9506..0000000000000000000000000000000000000000
GIT binary patch
literal 0
HcmV?d00001

literal 248
zcmd<!5Rxv=Pb<kytcXu8$}cXCPt7YS$}gyl&&^Ls&9S<`$d$^)mXn{JSyBw76O&SN
ziiJ3U{P=>R{3IzU?0WNndX>b3A?7-RjCF;W>H}nIiEwcirKXf7mt^MWDY4|HCgv?*
zWX9!;^rFOqj3^118-)0{csLk^IJlTNSb&%-N(AmIE-nrZb|H``OOg~9+>PiuoLIOR
G1b6`(D@dRK

diff --git a/DeeployTest/Tests/Kernels/FP32/Softmax/CrossEntropy/outputs.npz b/DeeployTest/Tests/Kernels/FP32/Softmax/CrossEntropy/outputs.npz
deleted file mode 100644
index fede142f839de7f0a1ec73fd200cfc5616159db9..0000000000000000000000000000000000000000
GIT binary patch
literal 0
HcmV?d00001

literal 430
zcmWIWW@Zs#fB;2??k$IX6&M*9K$wR?gdrzCJ-(nQKS?jIppub604xqt3z7$c$$p`}
z0g;RhWenBoDXGQDMe0@x>NaU6>N*PQY57GZMTvRw`9&$IAaS?EoZ?iVcyUHzK`M~1
zVWOj8XrQU1P^&;L;F>&9<G}k}JqJ`)`X5MHlzG5O(f`0)hrk08w=eGR^>jGk^JVe=
z!q5%-_q|U)z!%SdAg$&5euE;_1A7J44={WSI57M6j{RY03=i~9WjJ70C3s+7)6M;p
zT3ilP<tQDP7tDR2;i=MrS(+#ID_#scaD9&50fpsn_lH$E9*Ax!JMi5g>A<x$X$Oud
za2&96);#b>Jbb^|51RvD?TruY%l1BSEauz(0B=Sn5oTQB3kgFI*#Hht<j@ZAW@Q7(
NFan_!kd6bZ006vagOC6K

diff --git a/DeeployTest/test_siracusa_config.py b/DeeployTest/test_siracusa_config.py
index 8fa105d9f4..7e7893b5f5 100644
--- a/DeeployTest/test_siracusa_config.py
+++ b/DeeployTest/test_siracusa_config.py
@@ -8,7 +8,6 @@
 
 KERNEL_TESTS = [
     "Kernels/FP32/ReLU",
-    "Kernels/FP32/Softmax/CrossEntropy",
     "Kernels/FP32/Softmax/CrossEntropyGrad",
     "Kernels/FP32/Softmax/Grad",
     "Kernels/FP32/Softmax/Regular",

From 91931cfabae9f4c0f165762037b219d98b33d5ea Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 14:40:07 +0000
Subject: [PATCH 14/28] training-platform core: propagate loss verification
 result + drop dead code
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The Siracusa training harness was computing loss_err_count, printing
"Errors: N out of M", and then unconditionally returning 0.  Numerical
mismatches were therefore non-blocking — the simulator exit code stayed
green even when the training graph diverged from the reference.  Make the
return value reflect the comparison result:

    return loss_err_count == 0 ? 0 : 1;

Also drop dead code that was never reached or never used:
- training_cycles / optimizer_cycles locals (only commented-out printf)
- connect_optimizer_buffers() (body was just (void)0; never called)
- Several commented-out section header blocks and helper-call lines

No control-flow change apart from the return value.

Verified on Siracusa: simplemlp_train still passes 0/4 (diff=0.000000 at
every step) in both non-tiled and tiled runs.
---
 .../Platforms/Siracusa/src/deeploytraintest.c | 39 +------------------
 1 file changed, 1 insertion(+), 38 deletions(-)

diff --git a/DeeployTest/Platforms/Siracusa/src/deeploytraintest.c b/DeeployTest/Platforms/Siracusa/src/deeploytraintest.c
index 6b324ca7ad..00efca649e 100644
--- a/DeeployTest/Platforms/Siracusa/src/deeploytraintest.c
+++ b/DeeployTest/Platforms/Siracusa/src/deeploytraintest.c
@@ -108,16 +108,6 @@ struct pi_device cluster_dev;
 #define TOTAL_FWD_PASSES (N_TRAIN_STEPS * N_ACCUM_STEPS)
 static float stored_losses[TOTAL_FWD_PASSES];
 
-/* -------------------------------------------------------------------------
- * Optimizer buffer connection
- *
- * Connect DeeployOptNetwork_inputs[]/outputs[] to the training network's
- * weight and grad acc buffers via memcpy.
- *
- * Optimizer ONNX input order: [w0, g0, w1, g1, ...]  (interleaved pairs)
- * Optimizer ONNX output order: [w0_updated, w1_updated, ...]
- * ---------------------------------------------------------------------- */
-
 /* -------------------------------------------------------------------------
  * L3-aware memory transfer: handles all combinations of L2/L3 src and dst
  * ---------------------------------------------------------------------- */
@@ -140,16 +130,6 @@ static void l3_aware_copy(void *dst, const void *src, uint32_t bytes) {
   }
 }
 
-static void connect_optimizer_buffers(void) {
-#if defined(TRAINING_NUM_WEIGHT_INPUTS) && (TRAINING_NUM_WEIGHT_INPUTS > 0)
-  /* Nothing to pre-allocate — InitOptimizerNetwork() already allocated the
-   * optimizer's static buffers and set DeeployOptNetwork_inputs[]/outputs[].
-   * We only need to sync data at each optimizer step (see run_optimizer_step).
-   */
-  (void)0;
-#endif
-}
-
 static void run_optimizer_step(void) {
 #if defined(TRAINING_NUM_WEIGHT_INPUTS) && (TRAINING_NUM_WEIGHT_INPUTS > 0)
   /* --- Step A: copy current weights + grad acc → optimizer input buffers ---
@@ -253,10 +233,6 @@ int main(void) {
          (unsigned)N_TRAIN_STEPS, (unsigned)N_ACCUM_STEPS,
          (unsigned)TRAINING_NUM_DATA_INPUTS);
 
-  //   /* ------------------------------------------------------------------
-  //    * Cluster bring-up
-  //    * ------------------------------------------------------------------ */
-
   struct pi_cluster_conf conf;
   pi_cluster_conf_init(&conf);
   conf.id = 0;
@@ -313,12 +289,6 @@ int main(void) {
   cluster_task.slave_stack_size = SLAVESTACKSIZE;
   pi_cluster_send_task_to_cl(&cluster_dev, &cluster_task);
 
-  //   connect_optimizer_buffers();
-
-  //   /* ------------------------------------------------------------------
-  //    * lazy_reset_grad is the last input of the training network.
-  //    * ------------------------------------------------------------------ */
-
   uint32_t reset_idx = DeeployNetwork_num_inputs - 1;
 
   /* ------------------------------------------------------------------
@@ -338,9 +308,6 @@ int main(void) {
   printf("Starting training (%u optimizer steps x %u accum steps)...\r\n",
          (unsigned)N_TRAIN_STEPS, (unsigned)N_ACCUM_STEPS);
 
-  uint32_t training_cycles = 0;
-  uint32_t optimizer_cycles = 0;
-
   for (uint32_t update_step = 0; update_step < N_TRAIN_STEPS; update_step++) {
 
     for (uint32_t accum_step = 0; accum_step < N_ACCUM_STEPS; accum_step++) {
@@ -393,10 +360,6 @@ int main(void) {
 
   } /* end update_step loop */
 
-  // printf("Training complete.\r\n");
-  // printf("Total training cycles  : %u\r\n", training_cycles);
-  // printf("Total optimizer cycles : %u\r\n", optimizer_cycles);
-
   /* ------------------------------------------------------------------
    * Numerical verification — run on cluster (FC has no FPU)
    * ------------------------------------------------------------------ */
@@ -417,5 +380,5 @@ int main(void) {
   printf("Errors: %u out of %u\r\n", (unsigned)loss_err_count,
          (unsigned)total_loss_checks);
 
-  return 0;
+  return loss_err_count == 0 ? 0 : 1;
 }

From f177a5b57f7c54a886917b5fd8cbc2f7398d2a21 Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 14:44:12 +0000
Subject: [PATCH 15/28] training-platform core: label Step B in
 run_optimizer_step
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Step A and Step C in run_optimizer_step had explicit comment headers but
the cluster task call between them (which actually runs the optimizer
kernel) had none, so the labelling jumped A → C.  Add a one-line
"Step B: run optimizer kernel on cluster" header for symmetry.  Pure
documentation, no logic change.
---
 DeeployTest/Platforms/Siracusa/src/deeploytraintest.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/DeeployTest/Platforms/Siracusa/src/deeploytraintest.c b/DeeployTest/Platforms/Siracusa/src/deeploytraintest.c
index 00efca649e..50eb34d748 100644
--- a/DeeployTest/Platforms/Siracusa/src/deeploytraintest.c
+++ b/DeeployTest/Platforms/Siracusa/src/deeploytraintest.c
@@ -154,6 +154,7 @@ static void run_optimizer_step(void) {
     }
   }
 
+  /* --- Step B: run optimizer kernel on cluster --- */
   struct pi_cluster_task opt_task;
   pi_cluster_task(&opt_task, RunOptimizerNetwork, NULL);
   opt_task.stack_size = MAINSTACKSIZE;

From 0f853fc856b9aab18d482d7c7e6be19319eb346e Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 14:52:49 +0000
Subject: [PATCH 16/28] training-platform core: drop _augment_path PATH
 manipulation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The _augment_path helper prepended GVSOC_INSTALL_DIR/bin, LLVM_INSTALL_DIR/bin
and (if a venv was active) the venv's bin to PATH before invoking cmake /
gvsoc.  Two reasons to remove it from this PR:

1. On Siracusa it is a no-op — CMake's gvsoc_<target> custom command and
   the LLVM toolchain calls go through ${GVSOC_INSTALL_DIR}/... and
   ${LLVM_INSTALL_DIR}/... absolute paths and never depend on PATH.
2. The venv kconfigtool.py shebang fixup is GAP9-specific.  GAP9 is
   explicitly out of scope for this PR (the commit message of cc1f68b7
   lists "GAP9 platform port (separate PR, depends on this)" under the
   excluded scope), so this plumbing belongs in the GAP9 follow-up, not
   here.

Test runners silently mutating PATH is also a documentation/debug
hazard — if a user has intentionally pinned a different gvsoc/llvm in
their PATH, this helper would silently override it.  Removed both call
sites in cmake_configure() and run_simulation().

Verified on Siracusa: simplemlp_train still passes 0/4 (diff=0.000000 at
every step) in both non-tiled and tiled runs.
---
 DeeployTest/testUtils/core/execution.py | 29 ++-----------------------
 1 file changed, 2 insertions(+), 27 deletions(-)

diff --git a/DeeployTest/testUtils/core/execution.py b/DeeployTest/testUtils/core/execution.py
index 6073800980..1d77be0698 100644
--- a/DeeployTest/testUtils/core/execution.py
+++ b/DeeployTest/testUtils/core/execution.py
@@ -15,31 +15,6 @@
 from .output_parser import TestResult, parse_test_output
 
 
-def _augment_path(env: dict) -> dict:
-    """Prepend gvsoc/llvm bin dirs to PATH based on installed env vars.
-
-    The install dirs are already set as env vars (GVSOC_INSTALL_DIR,
-    LLVM_INSTALL_DIR) but their bin/ subdirectories may not be in PATH.
-
-    If a virtual environment is active (VIRTUAL_ENV is set), its bin dir
-    is prepended so that shebang-invoked scripts (kconfigtool.py, gapy)
-    resolve python3 to the venv interpreter, which has kconfiglib.
-    Without this, /usr/bin/python3 would be picked up instead, which
-    lacks kconfiglib and causes CMake kconfig setup to fail.
-    """
-    venv = env.get('VIRTUAL_ENV', '')
-    extra = [str(Path(venv) / 'bin')] if venv else ['/usr/bin']
-    for var in ('GVSOC_INSTALL_DIR', 'LLVM_INSTALL_DIR'):
-        install_dir = env.get(var, '')
-        if install_dir:
-            bin_dir = str(Path(install_dir) / 'bin')
-            current = env.get('PATH', '').split(':')
-            if bin_dir not in current:
-                extra.append(bin_dir)
-    env['PATH'] = ':'.join(extra) + ':' + env.get('PATH', '')
-    return env
-
-
 def _resolve_optimizer_dir(config: DeeployTestConfig) -> str:
     """Return the optimizer ONNX directory for this config.
 
@@ -323,7 +298,7 @@ def configure_cmake(config: DeeployTestConfig) -> None:
     script_dir = Path(__file__).parent.parent.parent
     cmd.append(str(script_dir.parent))
 
-    env = _augment_path(os.environ.copy())
+    env = os.environ.copy()
     if config.verbose >= 3:
         env["VERBOSE"] = "1"
 
@@ -379,7 +354,7 @@ def run_simulation(config: DeeployTestConfig, skip: bool = False) -> TestResult:
     if config.simulator == 'none':
         raise RuntimeError("No simulator specified!")
 
-    env = _augment_path(os.environ.copy())
+    env = os.environ.copy()
     if config.verbose >= 3:
         env["VERBOSE"] = "1"
 

From 55c91d04081445b87a3faf06aa5842eb0e7ee4bd Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 17:36:57 +0000
Subject: [PATCH 17/28] training-platform core: extract training codegen
 helpers to trainingUtils.py

generateTrainingNetwork.py and testMVPTraining.py carried byte-identical
copies of _load_reference_losses, _infer_num_data_inputs, _infer_total_mb,
_infer_data_size, _infer_n_accum and the _GRAD_ACC constant.
testMVPOptimizer.py and testMVPTraining.py each defined their own
_mockScheduler, also byte-identical.

Move all of these to a new testUtils/trainingUtils.py module and import
them from the three entry points so the next helper edit only has to
happen in one place.  No behaviour change; the three entry-point scripts
still exist as before (user asked to keep testMVP*.py and the two
deeployTrainingRunner_siracusa.py / _tiled_siracusa.py stubs).

Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every
step) in both non-tiled and tiled runs.
---
 DeeployTest/generateTrainingNetwork.py |  92 +------------------
 DeeployTest/testMVPOptimizer.py        |   6 +-
 DeeployTest/testMVPTraining.py         |  82 +----------------
 DeeployTest/testUtils/trainingUtils.py | 119 +++++++++++++++++++++++++
 4 files changed, 124 insertions(+), 175 deletions(-)
 create mode 100644 DeeployTest/testUtils/trainingUtils.py

diff --git a/DeeployTest/generateTrainingNetwork.py b/DeeployTest/generateTrainingNetwork.py
index bab1c33b36..48a158cb8c 100644
--- a/DeeployTest/generateTrainingNetwork.py
+++ b/DeeployTest/generateTrainingNetwork.py
@@ -12,6 +12,8 @@
 from testUtils.codeGenerate import generateTrainingTestNetwork
 from testUtils.platformMapping import mapDeployer, mapPlatform
 from testUtils.testRunner import TestGeneratorArgumentParser
+from testUtils.trainingUtils import _GRAD_ACC, _infer_data_size, _infer_n_accum, _infer_num_data_inputs, \
+    _infer_total_mb, _load_reference_losses
 from testUtils.typeMapping import inferTypeAndOffset
 
 from Deeploy.AbstractDataTypes import PointerClass
@@ -20,96 +22,6 @@
 from Deeploy.Logging import DEFAULT_LOGGER as log
 from Deeploy.Targets.PULPOpen.Platform import PULPClusterEngine
 
-_GRAD_ACC = "_grad.accumulation.buffer"
-
-
-def _load_reference_losses(train_dir: str) -> list:
-    """Load reference loss values from outputs.npz."""
-    outputs_path = os.path.join(train_dir, "outputs.npz")
-    if not os.path.exists(outputs_path):
-        log.warning(f"outputs.npz not found at {outputs_path} — loss comparison skipped")
-        return None
-
-    try:
-        outputs = np.load(outputs_path)
-    except Exception as e:
-        log.warning(f"Failed to load outputs.npz: {e} — loss comparison skipped")
-        return None
-
-    for key in outputs.files:
-        if 'loss' in key.lower():
-            vals = [float(v) for v in np.array(outputs[key]).flatten().tolist()]
-            log.info(f"Reference losses loaded from outputs.npz['{key}']: {vals}")
-            return vals
-
-    log.warning("No 'loss' key found in outputs.npz — loss comparison skipped")
-    return None
-
-
-def _infer_num_data_inputs(inputs_path: str) -> int:
-    """Auto-detect number of data inputs from inputs.npz.
-
-    Data inputs are the base arr_* entries that have per-mini-batch
-    variants (mb1_arr_*) in the npz — i.e. entries that actually change
-    across mini-batches.
-
-    Raises ValueError if no mb1 entries are found (single-mini-batch case)
-    where the data/weight boundary cannot be determined automatically.
-    """
-    inputs = np.load(inputs_path)
-    base_keys = sorted(k for k in inputs.files if not k.startswith('mb') and not k.startswith('meta_'))
-    count = sum(1 for k in base_keys if f'mb1_{k}' in inputs.files)
-    if count == 0:
-        raise ValueError("Cannot auto-detect num_data_inputs: inputs.npz has only one mini-batch "
-                         "(no mb1_arr_* entries found). Please pass --num-data-inputs explicitly.")
-    return count
-
-
-def _infer_total_mb(inputs_path: str) -> int:
-    """Count total mini-batches from inputs.npz.
-
-    New format: inputs.npz contains meta_n_batches (total training mini-batches)
-    and meta_data_size (number of unique samples stored; C harness cycles via modulo).
-
-    Legacy format: count 1 + number of unique mb* indices.
-    """
-    inputs = np.load(inputs_path)
-    if "meta_n_batches" in inputs.files:
-        return int(inputs["meta_n_batches"].flat[0])
-    mb_indices = set()
-    for key in inputs.files:
-        if key.startswith('mb'):
-            try:
-                idx = int(key.split('_')[0][2:])
-                mb_indices.add(idx)
-            except ValueError:
-                pass
-    return 1 + len(mb_indices)
-
-
-def _infer_data_size(inputs_path: str) -> int:
-    """Return the number of unique input samples stored in inputs.npz.
-
-    New format: reads meta_data_size.
-    Legacy format: same as _infer_total_mb (all batches were unique).
-    """
-    inputs = np.load(inputs_path)
-    if "meta_data_size" in inputs.files:
-        return int(inputs["meta_data_size"].flat[0])
-    return _infer_total_mb(inputs_path)
-
-
-def _infer_n_accum(inputs_path: str) -> int:
-    """Return the gradient accumulation step count stored in inputs.npz.
-
-    New format: reads meta_n_accum written by the exporter.
-    Legacy format: defaults to 1 (no gradient accumulation).
-    """
-    inputs = np.load(inputs_path)
-    if "meta_n_accum" in inputs.files:
-        return int(inputs["meta_n_accum"].flat[0])
-    return 1
-
 
 def generateTrainingNetwork(args):
     log.debug("Arguments: %s", args)
diff --git a/DeeployTest/testMVPOptimizer.py b/DeeployTest/testMVPOptimizer.py
index 3fdf4faae6..d89277690e 100644
--- a/DeeployTest/testMVPOptimizer.py
+++ b/DeeployTest/testMVPOptimizer.py
@@ -36,6 +36,7 @@
 from testUtils.platformMapping import mapDeployer, mapPlatform, setupMemoryPlatform
 from testUtils.testRunner import TestGeneratorArgumentParser
 from testUtils.tilingUtils import TrainingSBTiler
+from testUtils.trainingUtils import _mockScheduler
 
 from Deeploy.AbstractDataTypes import PointerClass
 from Deeploy.CommonExtensions.DataTypes import float32_t
@@ -49,11 +50,6 @@
 from Deeploy.TilingExtension.TilerExtension import TilerDeployerWrapper
 
 
-def _mockScheduler(graph: gs.Graph) -> List[List[gs.Node]]:
-    """Wrap every node in a singleton list for the Tiler pattern interface."""
-    return [[node] for node in graph.nodes]
-
-
 def generateTiledOptimizerNetwork(args) -> None:
     log.debug("Arguments: %s", args)
 
diff --git a/DeeployTest/testMVPTraining.py b/DeeployTest/testMVPTraining.py
index 438e6985ce..76b42225e8 100644
--- a/DeeployTest/testMVPTraining.py
+++ b/DeeployTest/testMVPTraining.py
@@ -15,6 +15,8 @@
 from testUtils.platformMapping import mapDeployer, mapPlatform, setupMemoryPlatform
 from testUtils.testRunner import TestGeneratorArgumentParser
 from testUtils.tilingUtils import TrainingSBTiler
+from testUtils.trainingUtils import _GRAD_ACC, _infer_data_size, _infer_n_accum, _infer_num_data_inputs, \
+    _infer_total_mb, _load_reference_losses, _mockScheduler
 from testUtils.typeMapping import inferTypeAndOffset
 
 from Deeploy.AbstractDataTypes import PointerClass
@@ -28,86 +30,6 @@
 from Deeploy.Targets.PULPOpen.Platform import PULPClusterEngine
 from Deeploy.TilingExtension.TilerExtension import TilerDeployerWrapper
 
-_GRAD_ACC = "_grad.accumulation.buffer"
-
-# ---------------------------------------------------------------------------
-# Helpers copied from generateTrainingNetwork.py
-# ---------------------------------------------------------------------------
-
-
-def _load_reference_losses(train_dir: str) -> list:
-    """Load reference loss values from outputs.npz."""
-    outputs_path = os.path.join(train_dir, "outputs.npz")
-    if not os.path.exists(outputs_path):
-        log.warning(f"outputs.npz not found at {outputs_path} — loss comparison skipped")
-        return None
-    try:
-        outputs = np.load(outputs_path)
-    except Exception as e:
-        log.warning(f"Failed to load outputs.npz: {e} — loss comparison skipped")
-        return None
-    for key in outputs.files:
-        if 'loss' in key.lower():
-            vals = [float(v) for v in np.array(outputs[key]).flatten().tolist()]
-            log.info(f"Reference losses loaded from outputs.npz['{key}']: {vals}")
-            return vals
-    log.warning("No 'loss' key found in outputs.npz — loss comparison skipped")
-    return None
-
-
-def _infer_num_data_inputs(inputs_path: str) -> int:
-    inputs = np.load(inputs_path)
-    base_keys = sorted(k for k in inputs.files if not k.startswith('mb') and not k.startswith('meta_'))
-    count = sum(1 for k in base_keys if f'mb1_{k}' in inputs.files)
-    if count == 0:
-        raise ValueError("Cannot auto-detect num_data_inputs: inputs.npz has only one mini-batch "
-                         "(no mb1_arr_* entries found). Please pass --num-data-inputs explicitly.")
-    return count
-
-
-def _infer_total_mb(inputs_path: str) -> int:
-    inputs = np.load(inputs_path)
-    if "meta_n_batches" in inputs.files:
-        return int(inputs["meta_n_batches"].flat[0])
-    mb_indices = set()
-    for key in inputs.files:
-        if key.startswith('mb'):
-            try:
-                idx = int(key.split('_')[0][2:])
-                mb_indices.add(idx)
-            except ValueError:
-                pass
-    return 1 + len(mb_indices)
-
-
-def _infer_data_size(inputs_path: str) -> int:
-    inputs = np.load(inputs_path)
-    if "meta_data_size" in inputs.files:
-        return int(inputs["meta_data_size"].flat[0])
-    return _infer_total_mb(inputs_path)
-
-
-def _infer_n_accum(inputs_path: str) -> int:
-    inputs = np.load(inputs_path)
-    if "meta_n_accum" in inputs.files:
-        return int(inputs["meta_n_accum"].flat[0])
-    return 1
-
-
-# ---------------------------------------------------------------------------
-# Mock scheduler (same as testMVP.py)
-# ---------------------------------------------------------------------------
-
-
-def _mockScheduler(graph: gs.Graph) -> List[List[gs.Node]]:
-    """Wrap every node in a singleton list for the Tiler pattern interface."""
-    return [[node] for node in graph.nodes]
-
-
-# ---------------------------------------------------------------------------
-# Main generation function
-# ---------------------------------------------------------------------------
-
 
 def generateTiledTrainingNetwork(args) -> None:
     log.debug("Arguments: %s", args)
diff --git a/DeeployTest/testUtils/trainingUtils.py b/DeeployTest/testUtils/trainingUtils.py
new file mode 100644
index 0000000000..155902ead7
--- /dev/null
+++ b/DeeployTest/testUtils/trainingUtils.py
@@ -0,0 +1,119 @@
+# SPDX-FileCopyrightText: 2025 ETH Zurich and University of Bologna
+#
+# SPDX-License-Identifier: Apache-2.0
+"""
+Shared helpers used by the training / optimizer code-generation entry points
+(generateTrainingNetwork.py, testMVPTraining.py, testMVPOptimizer.py).
+
+These helpers read metadata and reference values out of inputs.npz / outputs.npz
+produced by the training ONNX exporter, and provide the singleton-pattern
+"scheduler" the Tiler expects when each node is handled independently.
+"""
+
+import os
+from typing import List, Optional
+
+import numpy as np
+import onnx_graphsurgeon as gs
+
+from Deeploy.Logging import DEFAULT_LOGGER as log
+
+# Graph input name marker identifying gradient accumulation buffers.
+_GRAD_ACC = "_grad.accumulation.buffer"
+
+
+def _load_reference_losses(train_dir: str) -> Optional[list]:
+    """Load reference loss values from outputs.npz.
+
+    Returns the list of per-mini-batch loss values if any key in
+    outputs.npz contains 'loss', otherwise None (with a warning).
+    """
+    outputs_path = os.path.join(train_dir, "outputs.npz")
+    if not os.path.exists(outputs_path):
+        log.warning(f"outputs.npz not found at {outputs_path} — loss comparison skipped")
+        return None
+
+    try:
+        outputs = np.load(outputs_path)
+    except Exception as e:
+        log.warning(f"Failed to load outputs.npz: {e} — loss comparison skipped")
+        return None
+
+    for key in outputs.files:
+        if 'loss' in key.lower():
+            vals = [float(v) for v in np.array(outputs[key]).flatten().tolist()]
+            log.info(f"Reference losses loaded from outputs.npz['{key}']: {vals}")
+            return vals
+
+    log.warning("No 'loss' key found in outputs.npz — loss comparison skipped")
+    return None
+
+
+def _infer_num_data_inputs(inputs_path: str) -> int:
+    """Auto-detect number of data inputs from inputs.npz.
+
+    Data inputs are the base arr_* entries that have per-mini-batch
+    variants (mb1_arr_*) in the npz — i.e. entries that actually change
+    across mini-batches.
+
+    Raises ValueError if no mb1 entries are found (single-mini-batch case)
+    where the data/weight boundary cannot be determined automatically.
+    """
+    inputs = np.load(inputs_path)
+    base_keys = sorted(k for k in inputs.files if not k.startswith('mb') and not k.startswith('meta_'))
+    count = sum(1 for k in base_keys if f'mb1_{k}' in inputs.files)
+    if count == 0:
+        raise ValueError("Cannot auto-detect num_data_inputs: inputs.npz has only one mini-batch "
+                         "(no mb1_arr_* entries found). Please pass --num-data-inputs explicitly.")
+    return count
+
+
+def _infer_total_mb(inputs_path: str) -> int:
+    """Count total mini-batches from inputs.npz.
+
+    New format: inputs.npz contains meta_n_batches (total training mini-batches)
+    and meta_data_size (number of unique samples stored; C harness cycles via modulo).
+
+    Legacy format: count 1 + number of unique mb* indices.
+    """
+    inputs = np.load(inputs_path)
+    if "meta_n_batches" in inputs.files:
+        return int(inputs["meta_n_batches"].flat[0])
+    mb_indices = set()
+    for key in inputs.files:
+        if key.startswith('mb'):
+            try:
+                idx = int(key.split('_')[0][2:])
+                mb_indices.add(idx)
+            except ValueError:
+                pass
+    return 1 + len(mb_indices)
+
+
+def _infer_data_size(inputs_path: str) -> int:
+    """Return the number of unique input samples stored in inputs.npz.
+
+    New format: reads meta_data_size.
+    Legacy format: same as _infer_total_mb (all batches were unique).
+    """
+    inputs = np.load(inputs_path)
+    if "meta_data_size" in inputs.files:
+        return int(inputs["meta_data_size"].flat[0])
+    return _infer_total_mb(inputs_path)
+
+
+def _infer_n_accum(inputs_path: str) -> int:
+    """Return the gradient accumulation step count stored in inputs.npz.
+
+    New format: reads meta_n_accum written by the exporter.
+    Legacy format: defaults to 1 (no gradient accumulation).
+    """
+    inputs = np.load(inputs_path)
+    if "meta_n_accum" in inputs.files:
+        return int(inputs["meta_n_accum"].flat[0])
+    return 1
+
+
+def _mockScheduler(graph: gs.Graph) -> List[List[gs.Node]]:
+    """Wrap every node in a singleton list for the Tiler pattern interface."""
+    return [[node] for node in graph.nodes]

From 2d53fe2ae2ef69e3b3d20f739660c6ef603d4475 Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 17:41:41 +0000
Subject: [PATCH 18/28] training-platform core: collapse duplicate
 tiled/non-tiled training branches in generate_network

The training path in execution.generate_network had two sibling branches
(config.training and config.tiling / config.training) that were 90%
identical:

- same Step 1 (run training codegen script with --n-steps / --n-accum /
  --num-data-inputs / -v / --debug / gen_args)
- same training_meta.json read-back
- same Step 2 optimizer loop with passthrough args and --defaultMemLevel
  default

The only real differences were the two script names (testMVPTraining.py
vs generateTrainingNetwork.py and the corresponding optimizer pair), a
4-entry vs 8-entry passthrough list, and the "Tiled training" vs
"Training" error-message prefix.

Collapse into a single `if config.training:` branch that selects the
three variants up front and reuses one body.  The two inference branches
(`elif config.tiling:` and `else:`) are left untouched.

Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every
step) in both non-tiled and tiled runs.
---
 DeeployTest/testUtils/core/execution.py | 115 +++++-------------------
 1 file changed, 22 insertions(+), 93 deletions(-)

diff --git a/DeeployTest/testUtils/core/execution.py b/DeeployTest/testUtils/core/execution.py
index 1d77be0698..739d23c2fe 100644
--- a/DeeployTest/testUtils/core/execution.py
+++ b/DeeployTest/testUtils/core/execution.py
@@ -49,90 +49,23 @@ def generate_network(config: DeeployTestConfig, skip: bool = False) -> None:
 
     script_dir = Path(__file__).parent.parent.parent
 
-    if config.training and config.tiling:
-        # --- Tiled training: testMVPTraining.py (tiling pipeline + training init) ---
-        generation_script = script_dir / "testMVPTraining.py"
-        cmd = [
-            sys.executable,
-            str(generation_script),
-            "-d",
-            config.gen_dir,
-            "-t",
-            config.test_dir,
-            "-p",
-            config.platform,
-        ]
-        if config.n_train_steps is not None:
-            cmd.append(f"--n-steps={config.n_train_steps}")
-        if config.n_accum_steps is not None:
-            cmd.append(f"--n-accum={config.n_accum_steps}")
-        if config.training_num_data_inputs is not None:
-            cmd.append(f"--num-data-inputs={config.training_num_data_inputs}")
-        if config.verbose > 0:
-            cmd.append("-" + "v" * config.verbose)
-        if config.debug:
-            cmd.append("--debug")
-        cmd.extend(config.gen_args)
-
-        log.debug(f"[Execution] Tiled training generation command: {' '.join(cmd)}")
-        result = subprocess.run(cmd, check = False)
-        if result.returncode != 0:
-            raise RuntimeError(f"Tiled training network generation failed for {config.test_name}")
-
-        # Read back auto-detected values written by testMVPTraining.py
-        meta_path = Path(config.gen_dir) / "training_meta.json"
-        if meta_path.exists():
-            with open(meta_path) as f:
-                meta = json.load(f)
-            config.n_train_steps = meta["n_train_steps"]
-            config.n_accum_steps = meta["n_accum_steps"]
-            config.training_num_data_inputs = meta["training_num_data_inputs"]
-            log.info(f"[Execution] Training meta: {meta}")
-
-        # --- Step 2: Tiled optimizer network (SGD via testMVPOptimizer.py) ---
-        opt_dir = _resolve_optimizer_dir(config)
-        opt_script = script_dir / "testMVPOptimizer.py"
-
-        if not Path(opt_dir).exists():
-            log.warning(f"Optimizer directory not found: {opt_dir} — skipping optimizer codegen")
-        elif not opt_script.exists():
-            log.warning(f"testMVPOptimizer.py not found — skipping optimizer codegen")
+    if config.training:
+        if config.tiling:
+            training_script = script_dir / "testMVPTraining.py"
+            optimizer_script = script_dir / "testMVPOptimizer.py"
+            opt_passthrough = ("--cores", "--l1", "--l2", "--defaultMemLevel", "--memAllocStrategy",
+                               "--searchStrategy", "--plotMemAlloc", "--profileTiling")
+            stage = "Tiled training"
         else:
-            opt_cmd = [
-                sys.executable,
-                str(opt_script),
-                "-d",
-                config.gen_dir,
-                "-t",
-                opt_dir,
-                "-p",
-                config.platform,
-                f"--training-dir={config.test_dir}",
-            ]
-            _OPT_PASSTHROUGH = ("--cores", "--l1", "--l2", "--defaultMemLevel", "--memAllocStrategy",
-                                "--searchStrategy", "--plotMemAlloc", "--profileTiling")
-            for arg in config.gen_args:
-                if any(arg.startswith(p) for p in _OPT_PASSTHROUGH):
-                    opt_cmd.append(arg)
-            # If no --defaultMemLevel was passed through, default to L2
-            if not any(arg.startswith("--defaultMemLevel") for arg in opt_cmd):
-                opt_cmd.append("--defaultMemLevel=L2")
-            if config.verbose > 0:
-                opt_cmd.append("-" + "v" * config.verbose)
+            training_script = script_dir / "generateTrainingNetwork.py"
+            optimizer_script = script_dir / "generateOptimizerNetwork.py"
+            opt_passthrough = ("--cores", "--l1", "--l2", "--defaultMemLevel")
+            stage = "Training"
 
-            log.debug(f"[Execution] Tiled optimizer generation command: {' '.join(opt_cmd)}")
-            result = subprocess.run(opt_cmd, check = False)
-            if result.returncode != 0:
-                raise RuntimeError(f"Tiled optimizer network generation failed for {config.test_name}")
-
-        return  # early return — tiled training path complete
-
-    elif config.training:
         # --- Step 1: Training network (forward + backward + accumulation) ---
-        generation_script = script_dir / "generateTrainingNetwork.py"
         cmd = [
             sys.executable,
-            str(generation_script),
+            str(training_script),
             "-d",
             config.gen_dir,
             "-t",
@@ -140,26 +73,25 @@ def generate_network(config: DeeployTestConfig, skip: bool = False) -> None:
             "-p",
             config.platform,
         ]
-        # Only pass values when explicitly set; otherwise let the script auto-detect
+        # Only pass values when explicitly set; otherwise let the script auto-detect.
         if config.n_train_steps is not None:
             cmd.append(f"--n-steps={config.n_train_steps}")
         if config.n_accum_steps is not None:
             cmd.append(f"--n-accum={config.n_accum_steps}")
         if config.training_num_data_inputs is not None:
             cmd.append(f"--num-data-inputs={config.training_num_data_inputs}")
-
         if config.verbose > 0:
             cmd.append("-" + "v" * config.verbose)
         if config.debug:
             cmd.append("--debug")
         cmd.extend(config.gen_args)
 
-        log.debug(f"[Execution] Training generation command: {' '.join(cmd)}")
+        log.debug(f"[Execution] {stage} generation command: {' '.join(cmd)}")
         result = subprocess.run(cmd, check = False)
         if result.returncode != 0:
-            raise RuntimeError(f"Training network generation failed for {config.test_name}")
+            raise RuntimeError(f"{stage} network generation failed for {config.test_name}")
 
-        # Read back auto-detected values written by generateTrainingNetwork.py
+        # Read back auto-detected values written by the training generation script.
         meta_path = Path(config.gen_dir) / "training_meta.json"
         if meta_path.exists():
             with open(meta_path) as f:
@@ -171,16 +103,14 @@ def generate_network(config: DeeployTestConfig, skip: bool = False) -> None:
 
         # --- Step 2: Optimizer network (SGD) ---
         opt_dir = _resolve_optimizer_dir(config)
-        opt_script = script_dir / "generateOptimizerNetwork.py"
-
         if not Path(opt_dir).exists():
             log.warning(f"Optimizer directory not found: {opt_dir} — skipping optimizer codegen")
-        elif not opt_script.exists():
-            log.warning(f"generateOptimizerNetwork.py not found — skipping optimizer codegen")
+        elif not optimizer_script.exists():
+            log.warning(f"{optimizer_script.name} not found — skipping optimizer codegen")
         else:
             opt_cmd = [
                 sys.executable,
-                str(opt_script),
+                str(optimizer_script),
                 "-d",
                 config.gen_dir,
                 "-t",
@@ -189,19 +119,18 @@ def generate_network(config: DeeployTestConfig, skip: bool = False) -> None:
                 config.platform,
                 f"--training-dir={config.test_dir}",
             ]
-            _OPT_PASSTHROUGH = ("--cores", "--l1", "--l2", "--defaultMemLevel")
             for arg in config.gen_args:
-                if any(arg.startswith(p) for p in _OPT_PASSTHROUGH):
+                if any(arg.startswith(p) for p in opt_passthrough):
                     opt_cmd.append(arg)
             if not any(arg.startswith("--defaultMemLevel") for arg in opt_cmd):
                 opt_cmd.append("--defaultMemLevel=L2")
             if config.verbose > 0:
                 opt_cmd.append("-" + "v" * config.verbose)
 
-            log.debug(f"[Execution] Optimizer generation command: {' '.join(opt_cmd)}")
+            log.debug(f"[Execution] {stage} optimizer generation command: {' '.join(opt_cmd)}")
             result = subprocess.run(opt_cmd, check = False)
             if result.returncode != 0:
-                raise RuntimeError(f"Optimizer network generation failed for {config.test_name}")
+                raise RuntimeError(f"{stage} optimizer network generation failed for {config.test_name}")
 
         return  # early return — training path complete
 

From b689d3fe144ed4de3683d5eefff3bc5665dc59c9 Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 17:55:50 +0000
Subject: [PATCH 19/28] training-platform core: extract training codegen
 argparse builders to trainingUtils.py
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The four codegen entry points (generateTrainingNetwork.py, testMVPTraining.py,
generateOptimizerNetwork.py, testMVPOptimizer.py) each had a nearly identical
argparse block in their if __name__ == '__main__' tail, plus the same
try/except --shouldFail handshake.  Each addition of a training-side arg
meant editing up to four scripts in lockstep.

Move the shared argparse groups and the handshake runner to
testUtils/trainingUtils.py:

- add_cores_arg(parser)                (shared by all four)
- add_training_inference_args(parser)  (--num-data-inputs / --n-steps /
                                        --n-accum / --learning-rate / --tolerance;
                                        shared by both training scripts)
- add_memory_level_args(parser)        (--l1 / --l2 / --defaultMemLevel;
                                        shared by the tiled training and
                                        both optimizer scripts)
- add_tiling_solver_args(parser)       (--memAllocStrategy / --searchStrategy /
                                        --plotMemAlloc / --profileTiling;
                                        shared by the two tiled variants)
- add_optimizer_training_dir_arg(parser) (shared by both optimizer scripts)
- add_should_fail_arg(parser)          (shared by all four)
- run_with_shouldfail(fn, args, label) (try/except --shouldFail runner)

Each entry point's __main__ block shrinks to ~8 lines that compose the
groups it needs.  CLI behaviour is unchanged — every argument keeps the
same name, default, dest and help text.

Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every
step) in both non-tiled and tiled runs.
---
 DeeployTest/generateOptimizerNetwork.py |  43 +-----
 DeeployTest/generateTrainingNetwork.py  |  68 +---------
 DeeployTest/testMVPOptimizer.py         |  85 ++----------
 DeeployTest/testMVPTraining.py          | 115 ++--------------
 DeeployTest/testUtils/trainingUtils.py  | 166 +++++++++++++++++++++++-
 5 files changed, 190 insertions(+), 287 deletions(-)

diff --git a/DeeployTest/generateOptimizerNetwork.py b/DeeployTest/generateOptimizerNetwork.py
index 567f8e1a1e..2b484e0110 100644
--- a/DeeployTest/generateOptimizerNetwork.py
+++ b/DeeployTest/generateOptimizerNetwork.py
@@ -30,6 +30,8 @@
 from testUtils.codeGenerate import build_shared_buffer_maps, generateOptimizerTestNetwork
 from testUtils.platformMapping import mapDeployer, mapPlatform, setupMemoryPlatform
 from testUtils.testRunner import TestGeneratorArgumentParser
+from testUtils.trainingUtils import add_cores_arg, add_memory_level_args, add_optimizer_training_dir_arg, \
+    add_should_fail_arg, run_with_shouldfail
 
 from Deeploy.AbstractDataTypes import PointerClass
 from Deeploy.CommonExtensions.DataTypes import float32_t
@@ -117,47 +119,16 @@ def generateOptimizerNetwork(args):
 
 
 if __name__ == '__main__':
-
     parser = TestGeneratorArgumentParser(description = "Deeploy Optimizer Network Code Generation.")
-    parser.add_argument(
-        "--cores",
-        type = int,
-        default = 1,
-        help = "Number of cluster cores. Default: 1.",
-    )
+    add_cores_arg(parser)
     parser.add_argument(
         "--lr",
         type = float,
         default = 0.001,
         help = "Learning rate (informational only; embedded in optimizer ONNX attributes). Default: 0.001.",
     )
-    parser.add_argument("--defaultMemLevel",
-                        type = str,
-                        default = "L2",
-                        help = "Default memory level (L2 or L3). Must match the training graph. Default: L2.")
-    parser.add_argument("--l1", type = int, default = 64000, help = "L1 size in bytes. Default: 64000.")
-    parser.add_argument("--l2", type = int, default = 1024000, help = "L2 size in bytes. Default: 1024000.")
-    parser.add_argument(
-        "--training-dir",
-        type = str,
-        default = None,
-        help = "Directory containing the training network.onnx.  When provided, "
-        "weight and grad-acc buffers are shared with TrainingNetwork instead "
-        "of being allocated independently.",
-    )
-    parser.add_argument('--shouldFail', action = 'store_true')
-    parser.set_defaults(shouldFail = False)
-
+    add_memory_level_args(parser)
+    add_optimizer_training_dir_arg(parser)
+    add_should_fail_arg(parser)
     args = parser.parse_args()
-
-    try:
-        generateOptimizerNetwork(args)
-    except Exception as e:
-        if args.shouldFail:
-            print("\033[92mOptimizer network generation ended, failed as expected!\033[0m")
-            sys.exit(0)
-        else:
-            raise e
-
-    if args.shouldFail:
-        raise RuntimeError("Expected to fail!")
+    run_with_shouldfail(generateOptimizerNetwork, args, "Optimizer network generation")
diff --git a/DeeployTest/generateTrainingNetwork.py b/DeeployTest/generateTrainingNetwork.py
index 48a158cb8c..ab6ed5bff4 100644
--- a/DeeployTest/generateTrainingNetwork.py
+++ b/DeeployTest/generateTrainingNetwork.py
@@ -13,7 +13,8 @@
 from testUtils.platformMapping import mapDeployer, mapPlatform
 from testUtils.testRunner import TestGeneratorArgumentParser
 from testUtils.trainingUtils import _GRAD_ACC, _infer_data_size, _infer_n_accum, _infer_num_data_inputs, \
-    _infer_total_mb, _load_reference_losses
+    _infer_total_mb, _load_reference_losses, add_cores_arg, add_should_fail_arg, add_training_inference_args, \
+    run_with_shouldfail
 from testUtils.typeMapping import inferTypeAndOffset
 
 from Deeploy.AbstractDataTypes import PointerClass
@@ -216,66 +217,9 @@ def generateTrainingNetwork(args):
 
 
 if __name__ == '__main__':
-
     parser = TestGeneratorArgumentParser(description = "Deeploy Training Code Generation Utility.")
-    parser.add_argument(
-        "--cores",
-        type = int,
-        default = 1,
-        help = "Number of cores on which the network is run. "
-        "Currently required for im2col buffer sizing on Siracusa. Default: 1.",
-    )
-    parser.add_argument(
-        "--num-data-inputs",
-        type = int,
-        dest = "num_data_inputs",
-        default = None,
-        help = "Number of DATA inputs that change per mini-batch. "
-        "Auto-detected from ONNX graph if not specified.",
-    )
-    parser.add_argument(
-        "--n-steps",
-        type = int,
-        dest = "n_steps",
-        default = None,
-        help = "N_TRAIN_STEPS: number of gradient-accumulation update steps. "
-        "Auto-detected from inputs.npz mini-batch count if not specified.",
-    )
-    parser.add_argument(
-        "--n-accum",
-        type = int,
-        dest = "n_accum",
-        default = None,
-        help = "N_ACCUM_STEPS: number of mini-batches per update step. "
-        "Auto-detected from inputs.npz mini-batch count if not specified.",
-    )
-    parser.add_argument(
-        "--learning-rate",
-        type = float,
-        dest = "learning_rate",
-        default = 0.001,
-        help = "SGD learning rate emitted as TRAINING_LEARNING_RATE in testinputs.h. Default: 0.001.",
-    )
-    parser.add_argument(
-        "--tolerance",
-        type = float,
-        dest = "tolerance_abs",
-        default = 1e-3,
-        help = "Absolute loss tolerance emitted as TRAINING_TOLERANCE_ABS in testoutputs.h. Default: 1e-3.",
-    )
-    parser.add_argument('--shouldFail', action = 'store_true')
-    parser.set_defaults(shouldFail = False)
-
+    add_cores_arg(parser)
+    add_training_inference_args(parser)
+    add_should_fail_arg(parser)
     args = parser.parse_args()
-
-    try:
-        generateTrainingNetwork(args)
-    except Exception as e:
-        if args.shouldFail:
-            print("\033[92mTraining network generation ended, failed as expected!\033[0m")
-            sys.exit(0)
-        else:
-            raise e
-
-    if args.shouldFail:
-        raise RuntimeError("Expected to fail!")
+    run_with_shouldfail(generateTrainingNetwork, args, "Training network generation")
diff --git a/DeeployTest/testMVPOptimizer.py b/DeeployTest/testMVPOptimizer.py
index d89277690e..02804df243 100644
--- a/DeeployTest/testMVPOptimizer.py
+++ b/DeeployTest/testMVPOptimizer.py
@@ -36,7 +36,8 @@
 from testUtils.platformMapping import mapDeployer, mapPlatform, setupMemoryPlatform
 from testUtils.testRunner import TestGeneratorArgumentParser
 from testUtils.tilingUtils import TrainingSBTiler
-from testUtils.trainingUtils import _mockScheduler
+from testUtils.trainingUtils import _mockScheduler, add_cores_arg, add_memory_level_args, \
+    add_optimizer_training_dir_arg, add_should_fail_arg, add_tiling_solver_args, run_with_shouldfail
 
 from Deeploy.AbstractDataTypes import PointerClass
 from Deeploy.CommonExtensions.DataTypes import float32_t
@@ -146,87 +147,17 @@ def generateTiledOptimizerNetwork(args) -> None:
 
 
 if __name__ == '__main__':
-
     parser = TestGeneratorArgumentParser(description = "Deeploy Tiled Optimizer Network Code Generation.")
-
-    parser.add_argument(
-        "--cores",
-        type = int,
-        default = 1,
-        help = "Number of cluster cores. Default: 1.",
-    )
+    add_cores_arg(parser)
     parser.add_argument(
         "--lr",
         type = float,
         default = 0.001,
         help = "Learning rate (informational only; embedded in optimizer ONNX attributes). Default: 0.001.",
     )
-    parser.add_argument(
-        '--l1',
-        type = int,
-        dest = 'l1',
-        default = 64_000,
-        help = 'L1 size in bytes. Default: 64000.',
-    )
-    parser.add_argument(
-        '--l2',
-        type = int,
-        dest = 'l2',
-        default = 1_024_000,
-        help = 'L2 size in bytes. Default: 1024000.',
-    )
-    parser.add_argument(
-        '--defaultMemLevel',
-        type = str,
-        dest = 'defaultMemLevel',
-        default = "L2",
-        help = 'Default memory level for optimizer I/O buffers (L2 or L3). Must match the training graph. Default: L2.',
-    )
-    parser.add_argument(
-        '--memAllocStrategy',
-        type = str,
-        dest = 'memAllocStrategy',
-        default = "MiniMalloc",
-        help = 'Memory allocation strategy. Default: MiniMalloc.',
-    )
-    parser.add_argument(
-        '--searchStrategy',
-        type = str,
-        dest = 'searchStrategy',
-        default = "random-max",
-        help = 'CP solver search strategy. Default: random-max.',
-    )
-    parser.add_argument(
-        '--plotMemAlloc',
-        action = 'store_true',
-        help = 'Save memory allocation plots in the deeployStates folder.',
-    )
-    parser.add_argument(
-        '--profileTiling',
-        action = 'store_true',
-        help = 'Enable tiling profiling (inserts cycle counters around each tiled kernel).',
-    )
-    parser.add_argument(
-        "--training-dir",
-        type = str,
-        default = None,
-        help = "Directory containing the training network.onnx.  When provided, "
-        "weight and grad-acc buffers are shared with TrainingNetwork instead "
-        "of being allocated independently.",
-    )
-    parser.add_argument('--shouldFail', action = 'store_true')
-    parser.set_defaults(shouldFail = False)
-
+    add_memory_level_args(parser)
+    add_tiling_solver_args(parser)
+    add_optimizer_training_dir_arg(parser)
+    add_should_fail_arg(parser)
     args = parser.parse_args()
-
-    try:
-        generateTiledOptimizerNetwork(args)
-    except Exception as e:
-        if args.shouldFail:
-            print("\033[92mTiled optimizer network generation ended, failed as expected!\033[0m")
-            sys.exit(0)
-        else:
-            raise e
-
-    if args.shouldFail:
-        raise RuntimeError("Expected to fail!")
+    run_with_shouldfail(generateTiledOptimizerNetwork, args, "Tiled optimizer network generation")
diff --git a/DeeployTest/testMVPTraining.py b/DeeployTest/testMVPTraining.py
index 76b42225e8..b965e91a40 100644
--- a/DeeployTest/testMVPTraining.py
+++ b/DeeployTest/testMVPTraining.py
@@ -16,7 +16,8 @@
 from testUtils.testRunner import TestGeneratorArgumentParser
 from testUtils.tilingUtils import TrainingSBTiler
 from testUtils.trainingUtils import _GRAD_ACC, _infer_data_size, _infer_n_accum, _infer_num_data_inputs, \
-    _infer_total_mb, _load_reference_losses, _mockScheduler
+    _infer_total_mb, _load_reference_losses, _mockScheduler, add_cores_arg, add_memory_level_args, \
+    add_should_fail_arg, add_tiling_solver_args, add_training_inference_args, run_with_shouldfail
 from testUtils.typeMapping import inferTypeAndOffset
 
 from Deeploy.AbstractDataTypes import PointerClass
@@ -233,111 +234,11 @@ def generateTiledTrainingNetwork(args) -> None:
 # ---------------------------------------------------------------------------
 
 if __name__ == '__main__':
-
     parser = TestGeneratorArgumentParser(description = "Deeploy Tiled Training Code Generation Utility.")
-
-    # Training params (same as generateTrainingNetwork.py)
-    parser.add_argument(
-        "--cores",
-        type = int,
-        default = 1,
-        help = "Number of cores on which the network is run. Default: 1.",
-    )
-    parser.add_argument(
-        "--num-data-inputs",
-        type = int,
-        dest = "num_data_inputs",
-        default = None,
-        help = "Number of DATA inputs that change per mini-batch. Auto-detected if not specified.",
-    )
-    parser.add_argument(
-        "--n-steps",
-        type = int,
-        dest = "n_steps",
-        default = None,
-        help = "N_TRAIN_STEPS: number of gradient-accumulation update steps.",
-    )
-    parser.add_argument(
-        "--n-accum",
-        type = int,
-        dest = "n_accum",
-        default = None,
-        help = "N_ACCUM_STEPS: number of mini-batches per update step.",
-    )
-    parser.add_argument(
-        "--learning-rate",
-        type = float,
-        dest = "learning_rate",
-        default = 0.001,
-        help = "SGD learning rate emitted as TRAINING_LEARNING_RATE in testinputs.h. Default: 0.001.",
-    )
-
-    # Tiling params (same as testMVP.py)
-    parser.add_argument(
-        '--l1',
-        type = int,
-        dest = 'l1',
-        default = 64_000,
-        help = 'Set L1 size in bytes. Default: 64000.',
-    )
-    parser.add_argument(
-        '--l2',
-        type = int,
-        dest = 'l2',
-        default = 1_024_000,
-        help = 'Set L2 size in bytes. Default: 1024000.',
-    )
-    parser.add_argument(
-        '--defaultMemLevel',
-        type = str,
-        dest = 'defaultMemLevel',
-        default = "L2",
-        help = 'Default memory level for IO buffers. Default: L2.',
-    )
-    parser.add_argument(
-        '--memAllocStrategy',
-        type = str,
-        dest = 'memAllocStrategy',
-        default = "MiniMalloc",
-        help = 'Memory allocation strategy. Default: MiniMalloc.',
-    )
-    parser.add_argument(
-        '--searchStrategy',
-        type = str,
-        dest = 'searchStrategy',
-        default = "random-max",
-        help = 'CP solver search strategy. Default: random-max.',
-    )
-    parser.add_argument(
-        '--plotMemAlloc',
-        action = 'store_true',
-        help = 'Save memory allocation plots in the deeployStates folder.',
-    )
-    parser.add_argument(
-        '--profileTiling',
-        action = 'store_true',
-        help = 'Enable tiling profiling (inserts cycle counters around each tiled kernel).',
-    )
-    parser.add_argument(
-        '--tolerance',
-        type = float,
-        dest = 'tolerance_abs',
-        default = 1e-3,
-        help = 'Absolute loss tolerance emitted as TRAINING_TOLERANCE_ABS in testoutputs.h. Default: 1e-3.',
-    )
-    parser.add_argument('--shouldFail', action = 'store_true')
-    parser.set_defaults(shouldFail = False)
-
+    add_cores_arg(parser)
+    add_training_inference_args(parser)
+    add_memory_level_args(parser)
+    add_tiling_solver_args(parser)
+    add_should_fail_arg(parser)
     args = parser.parse_args()
-
-    try:
-        generateTiledTrainingNetwork(args)
-    except Exception as e:
-        if args.shouldFail:
-            print("\033[92mTiled training network generation ended, failed as expected!\033[0m")
-            sys.exit(0)
-        else:
-            raise e
-
-    if args.shouldFail:
-        raise RuntimeError("Expected to fail!")
+    run_with_shouldfail(generateTiledTrainingNetwork, args, "Tiled training network generation")
diff --git a/DeeployTest/testUtils/trainingUtils.py b/DeeployTest/testUtils/trainingUtils.py
index 155902ead7..08f3c1eea0 100644
--- a/DeeployTest/testUtils/trainingUtils.py
+++ b/DeeployTest/testUtils/trainingUtils.py
@@ -3,15 +3,22 @@
 # SPDX-License-Identifier: Apache-2.0
 """
 Shared helpers used by the training / optimizer code-generation entry points
-(generateTrainingNetwork.py, testMVPTraining.py, testMVPOptimizer.py).
+(generateTrainingNetwork.py, testMVPTraining.py, generateOptimizerNetwork.py,
+testMVPOptimizer.py).
 
-These helpers read metadata and reference values out of inputs.npz / outputs.npz
-produced by the training ONNX exporter, and provide the singleton-pattern
-"scheduler" the Tiler expects when each node is handled independently.
+Three kinds of helpers live here:
+
+1. inputs.npz / outputs.npz readers (``_load_reference_losses``, ``_infer_*``).
+2. The singleton ``_mockScheduler`` the Tiler expects for per-node tiling.
+3. argparse builders and the ``--shouldFail`` handshake runner that each
+   codegen entry point would otherwise have to duplicate verbatim in its
+   ``if __name__ == '__main__':`` block.
 """
 
+import argparse
 import os
-from typing import List, Optional
+import sys
+from typing import Callable, List, Optional
 
 import numpy as np
 import onnx_graphsurgeon as gs
@@ -117,3 +124,152 @@ def _infer_n_accum(inputs_path: str) -> int:
 def _mockScheduler(graph: gs.Graph) -> List[List[gs.Node]]:
     """Wrap every node in a singleton list for the Tiler pattern interface."""
     return [[node] for node in graph.nodes]
+
+
+# ---------------------------------------------------------------------------
+# argparse builders
+#
+# The four training / optimizer codegen entry points all define the same
+# arguments in their __main__ blocks.  These helpers add the shared groups
+# to an existing parser so each entry point only has to compose the groups
+# it actually needs.
+# ---------------------------------------------------------------------------
+
+
+def add_cores_arg(parser: argparse.ArgumentParser) -> None:
+    parser.add_argument(
+        "--cores",
+        type = int,
+        default = 1,
+        help = "Number of cores on which the network is run. Default: 1.",
+    )
+
+
+def add_training_inference_args(parser: argparse.ArgumentParser) -> None:
+    """Arguments consumed by both training codegen entry points."""
+    parser.add_argument(
+        "--num-data-inputs",
+        type = int,
+        dest = "num_data_inputs",
+        default = None,
+        help = "Number of DATA inputs that change per mini-batch. "
+        "Auto-detected if not specified.",
+    )
+    parser.add_argument(
+        "--n-steps",
+        type = int,
+        dest = "n_steps",
+        default = None,
+        help = "N_TRAIN_STEPS: number of gradient-accumulation update steps. "
+        "Auto-detected if not specified.",
+    )
+    parser.add_argument(
+        "--n-accum",
+        type = int,
+        dest = "n_accum",
+        default = None,
+        help = "N_ACCUM_STEPS: number of mini-batches per update step. "
+        "Auto-detected if not specified.",
+    )
+    parser.add_argument(
+        "--learning-rate",
+        type = float,
+        dest = "learning_rate",
+        default = 0.001,
+        help = "SGD learning rate emitted as TRAINING_LEARNING_RATE in testinputs.h. Default: 0.001.",
+    )
+    parser.add_argument(
+        "--tolerance",
+        type = float,
+        dest = "tolerance_abs",
+        default = 1e-3,
+        help = "Absolute loss tolerance emitted as TRAINING_TOLERANCE_ABS in testoutputs.h. Default: 1e-3.",
+    )
+
+
+def add_memory_level_args(parser: argparse.ArgumentParser) -> None:
+    """L1/L2 sizes and the default IO memory level."""
+    parser.add_argument(
+        "--l1",
+        type = int,
+        dest = "l1",
+        default = 64_000,
+        help = "Set L1 size in bytes. Default: 64000.",
+    )
+    parser.add_argument(
+        "--l2",
+        type = int,
+        dest = "l2",
+        default = 1_024_000,
+        help = "Set L2 size in bytes. Default: 1024000.",
+    )
+    parser.add_argument(
+        "--defaultMemLevel",
+        type = str,
+        dest = "defaultMemLevel",
+        default = "L2",
+        help = "Default memory level for IO buffers. Default: L2.",
+    )
+
+
+def add_tiling_solver_args(parser: argparse.ArgumentParser) -> None:
+    """Arguments specific to the tiled codegen path."""
+    parser.add_argument(
+        "--memAllocStrategy",
+        type = str,
+        dest = "memAllocStrategy",
+        default = "MiniMalloc",
+        help = "Memory allocation strategy. Default: MiniMalloc.",
+    )
+    parser.add_argument(
+        "--searchStrategy",
+        type = str,
+        dest = "searchStrategy",
+        default = "random-max",
+        help = "CP solver search strategy. Default: random-max.",
+    )
+    parser.add_argument(
+        "--plotMemAlloc",
+        action = "store_true",
+        help = "Save memory allocation plots in the deeployStates folder.",
+    )
+    parser.add_argument(
+        "--profileTiling",
+        action = "store_true",
+        help = "Enable tiling profiling (inserts cycle counters around each tiled kernel).",
+    )
+
+
+def add_optimizer_training_dir_arg(parser: argparse.ArgumentParser) -> None:
+    parser.add_argument(
+        "--training-dir",
+        type = str,
+        default = None,
+        help = "Directory containing the training network.onnx.  When provided, "
+        "weight and grad-acc buffers are shared with TrainingNetwork instead "
+        "of being allocated independently.",
+    )
+
+
+def add_should_fail_arg(parser: argparse.ArgumentParser) -> None:
+    parser.add_argument("--shouldFail", action = "store_true")
+    parser.set_defaults(shouldFail = False)
+
+
+def run_with_shouldfail(fn: Callable[[argparse.Namespace], None], args: argparse.Namespace,
+                        stage_label: str) -> None:
+    """Invoke ``fn(args)`` honouring the ``--shouldFail`` handshake.
+
+    On success with ``--shouldFail``: raises ``RuntimeError("Expected to fail!")``.
+    On exception with ``--shouldFail``: prints a green success banner and exits 0.
+    Otherwise: exception propagates, success returns normally.
+    """
+    try:
+        fn(args)
+    except Exception:
+        if args.shouldFail:
+            print(f"\033[92m{stage_label} ended, failed as expected!\033[0m")
+            sys.exit(0)
+        raise
+    if args.shouldFail:
+        raise RuntimeError("Expected to fail!")

From 4218ba15a138b9dd192194e7ff7a477ac1cfcf5b Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 18:15:19 +0000
Subject: [PATCH 20/28] training-platform core: lift execution.py training
 subprocess helpers to trainingUtils
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The training branch of generate_network and the training-flag block in
configure_cmake had four pieces of boilerplate worth extracting:

- building the `[python, script, -d gen_dir, -t path, -p platform]`
  prefix twice (training + optimizer step)
- the same `log.debug + subprocess.run + returncode check + raise` tail
  twice
- the opt-side gen_args passthrough filter (4-line for-loop)
- the six-line -DTRAINING=ON / -DN_TRAIN_STEPS=... / -DN_ACCUM_STEPS=... /
  -DTRAINING_NUM_DATA_INPUTS=... block in configure_cmake

Move all four into testUtils/trainingUtils.py as
build_codegen_cmd / run_codegen_subprocess / filter_passthrough_args /
add_training_cmake_flags.  They take primitive parameters only — no
DeeployTestConfig dependency — so trainingUtils does not gain a
back-edge into testUtils.core.

The inference branches of generate_network (`elif config.tiling:` /
`else:`) are left untouched.

Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every
step) in both non-tiled and tiled runs.
---
 DeeployTest/testUtils/core/execution.py | 55 ++++-----------------
 DeeployTest/testUtils/trainingUtils.py  | 66 ++++++++++++++++++++++++-
 2 files changed, 74 insertions(+), 47 deletions(-)

diff --git a/DeeployTest/testUtils/core/execution.py b/DeeployTest/testUtils/core/execution.py
index 739d23c2fe..d66255382d 100644
--- a/DeeployTest/testUtils/core/execution.py
+++ b/DeeployTest/testUtils/core/execution.py
@@ -11,6 +11,8 @@
 
 from Deeploy.Logging import DEFAULT_LOGGER as log
 
+from ..trainingUtils import add_training_cmake_flags, build_codegen_cmd, filter_passthrough_args, \
+    run_codegen_subprocess
 from .config import DeeployTestConfig
 from .output_parser import TestResult, parse_test_output
 
@@ -63,16 +65,7 @@ def generate_network(config: DeeployTestConfig, skip: bool = False) -> None:
             stage = "Training"
 
         # --- Step 1: Training network (forward + backward + accumulation) ---
-        cmd = [
-            sys.executable,
-            str(training_script),
-            "-d",
-            config.gen_dir,
-            "-t",
-            config.test_dir,
-            "-p",
-            config.platform,
-        ]
+        cmd = build_codegen_cmd(training_script, config.test_dir, config.gen_dir, config.platform)
         # Only pass values when explicitly set; otherwise let the script auto-detect.
         if config.n_train_steps is not None:
             cmd.append(f"--n-steps={config.n_train_steps}")
@@ -85,11 +78,7 @@ def generate_network(config: DeeployTestConfig, skip: bool = False) -> None:
         if config.debug:
             cmd.append("--debug")
         cmd.extend(config.gen_args)
-
-        log.debug(f"[Execution] {stage} generation command: {' '.join(cmd)}")
-        result = subprocess.run(cmd, check = False)
-        if result.returncode != 0:
-            raise RuntimeError(f"{stage} network generation failed for {config.test_name}")
+        run_codegen_subprocess(cmd, f"{stage} network generation", config.test_name)
 
         # Read back auto-detected values written by the training generation script.
         meta_path = Path(config.gen_dir) / "training_meta.json"
@@ -108,29 +97,14 @@ def generate_network(config: DeeployTestConfig, skip: bool = False) -> None:
         elif not optimizer_script.exists():
             log.warning(f"{optimizer_script.name} not found — skipping optimizer codegen")
         else:
-            opt_cmd = [
-                sys.executable,
-                str(optimizer_script),
-                "-d",
-                config.gen_dir,
-                "-t",
-                opt_dir,
-                "-p",
-                config.platform,
-                f"--training-dir={config.test_dir}",
-            ]
-            for arg in config.gen_args:
-                if any(arg.startswith(p) for p in opt_passthrough):
-                    opt_cmd.append(arg)
+            opt_cmd = build_codegen_cmd(optimizer_script, opt_dir, config.gen_dir, config.platform)
+            opt_cmd.append(f"--training-dir={config.test_dir}")
+            opt_cmd.extend(filter_passthrough_args(config.gen_args, opt_passthrough))
             if not any(arg.startswith("--defaultMemLevel") for arg in opt_cmd):
                 opt_cmd.append("--defaultMemLevel=L2")
             if config.verbose > 0:
                 opt_cmd.append("-" + "v" * config.verbose)
-
-            log.debug(f"[Execution] {stage} optimizer generation command: {' '.join(opt_cmd)}")
-            result = subprocess.run(opt_cmd, check = False)
-            if result.returncode != 0:
-                raise RuntimeError(f"{stage} optimizer network generation failed for {config.test_name}")
+            run_codegen_subprocess(opt_cmd, f"{stage} optimizer network generation", config.test_name)
 
         return  # early return — training path complete
 
@@ -212,17 +186,8 @@ def configure_cmake(config: DeeployTestConfig) -> None:
     else:
         cmd.append("-Dgvsoc_simulation=OFF")
 
-    if config.training:
-        cmd.append("-DTRAINING=ON")
-        # Only add cmake defines when the values are known (after codegen)
-        if config.n_train_steps is not None:
-            cmd.append(f"-DN_TRAIN_STEPS={config.n_train_steps}")
-        if config.n_accum_steps is not None:
-            cmd.append(f"-DN_ACCUM_STEPS={config.n_accum_steps}")
-        if config.training_num_data_inputs is not None:
-            cmd.append(f"-DTRAINING_NUM_DATA_INPUTS={config.training_num_data_inputs}")
-    else:
-        cmd.append("-DTRAINING=OFF")
+    add_training_cmake_flags(cmd, config.training, config.n_train_steps, config.n_accum_steps,
+                             config.training_num_data_inputs)
 
     script_dir = Path(__file__).parent.parent.parent
     cmd.append(str(script_dir.parent))
diff --git a/DeeployTest/testUtils/trainingUtils.py b/DeeployTest/testUtils/trainingUtils.py
index 08f3c1eea0..6b21834507 100644
--- a/DeeployTest/testUtils/trainingUtils.py
+++ b/DeeployTest/testUtils/trainingUtils.py
@@ -6,19 +6,28 @@
 (generateTrainingNetwork.py, testMVPTraining.py, generateOptimizerNetwork.py,
 testMVPOptimizer.py).
 
-Three kinds of helpers live here:
+Four kinds of helpers live here:
 
 1. inputs.npz / outputs.npz readers (``_load_reference_losses``, ``_infer_*``).
 2. The singleton ``_mockScheduler`` the Tiler expects for per-node tiling.
 3. argparse builders and the ``--shouldFail`` handshake runner that each
    codegen entry point would otherwise have to duplicate verbatim in its
    ``if __name__ == '__main__':`` block.
+4. Subprocess helpers (``build_codegen_cmd``, ``run_codegen_subprocess``,
+   ``filter_passthrough_args``, ``add_training_cmake_flags``) used by the
+   core test execution module to dispatch the training / optimizer codegen
+   scripts and assemble the training-side cmake defines.
+
+The subprocess helpers take primitive parameters (no ``DeeployTestConfig``
+dependency) so this module stays free of a back-edge to ``testUtils.core``.
 """
 
 import argparse
 import os
+import subprocess
 import sys
-from typing import Callable, List, Optional
+from pathlib import Path
+from typing import Callable, Iterable, List, Optional, Sequence, Tuple
 
 import numpy as np
 import onnx_graphsurgeon as gs
@@ -273,3 +282,56 @@ def run_with_shouldfail(fn: Callable[[argparse.Namespace], None], args: argparse
         raise
     if args.shouldFail:
         raise RuntimeError("Expected to fail!")
+
+
+# ---------------------------------------------------------------------------
+# Subprocess helpers for the test execution harness.
+#
+# These are used by testUtils/core/execution.py to dispatch the training /
+# optimizer codegen scripts.  Kept here (rather than as local helpers in
+# execution.py) so that every training-related helper lives in one module.
+# They take primitive parameters only — no DeeployTestConfig — to avoid
+# layering core → training back-edges.
+# ---------------------------------------------------------------------------
+
+
+def build_codegen_cmd(script: Path, test_path: str, gen_dir: str, platform: str) -> List[str]:
+    """Return the common ``[python, script, -d gen_dir, -t test_path, -p platform]`` prefix."""
+    return [
+        sys.executable,
+        str(script),
+        "-d",
+        gen_dir,
+        "-t",
+        test_path,
+        "-p",
+        platform,
+    ]
+
+
+def run_codegen_subprocess(cmd: Sequence[str], stage_label: str, test_name: str) -> None:
+    """Run ``cmd`` as a subprocess, log it, and raise with a stage/test-aware message on failure."""
+    log.debug(f"[Execution] {stage_label} command: {' '.join(cmd)}")
+    result = subprocess.run(list(cmd), check = False)
+    if result.returncode != 0:
+        raise RuntimeError(f"{stage_label} failed for {test_name}")
+
+
+def filter_passthrough_args(gen_args: Iterable[str], passthrough: Tuple[str, ...]) -> List[str]:
+    """Return the subset of ``gen_args`` whose entries start with any prefix in ``passthrough``."""
+    return [arg for arg in gen_args if any(arg.startswith(p) for p in passthrough)]
+
+
+def add_training_cmake_flags(cmd: List[str], training: bool, n_train_steps: Optional[int],
+                             n_accum_steps: Optional[int], training_num_data_inputs: Optional[int]) -> None:
+    """Append -DTRAINING=ON/OFF plus any known -DN_TRAIN_STEPS / -DN_ACCUM_STEPS /
+    -DTRAINING_NUM_DATA_INPUTS defines to ``cmd``.  In-place."""
+    cmd.append(f"-DTRAINING={'ON' if training else 'OFF'}")
+    if not training:
+        return
+    if n_train_steps is not None:
+        cmd.append(f"-DN_TRAIN_STEPS={n_train_steps}")
+    if n_accum_steps is not None:
+        cmd.append(f"-DN_ACCUM_STEPS={n_accum_steps}")
+    if training_num_data_inputs is not None:
+        cmd.append(f"-DTRAINING_NUM_DATA_INPUTS={training_num_data_inputs}")

From f5255d3962eada774bcb9ad2ba1809314b39c3c1 Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 18:22:10 +0000
Subject: [PATCH 21/28] training-platform core: lift _resolve_optimizer_dir to
 trainingUtils

The last training-specific helper left in execution.py was
_resolve_optimizer_dir, which derived the optimizer ONNX directory from
the training test's name (replacing `_train` with `_optimizer`) with a
config.optimizer_dir override.  Move it to trainingUtils.py alongside
the other test-harness helpers and change its signature to primitive
parameters (test_dir: str, optimizer_dir: Optional[str]) so trainingUtils
still has no back-edge to DeeployTestConfig.

Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every
step) in both non-tiled and tiled runs.
---
 DeeployTest/testUtils/core/execution.py | 20 ++------------------
 DeeployTest/testUtils/trainingUtils.py  | 16 ++++++++++++++++
 2 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/DeeployTest/testUtils/core/execution.py b/DeeployTest/testUtils/core/execution.py
index d66255382d..559ca84c18 100644
--- a/DeeployTest/testUtils/core/execution.py
+++ b/DeeployTest/testUtils/core/execution.py
@@ -12,27 +12,11 @@
 from Deeploy.Logging import DEFAULT_LOGGER as log
 
 from ..trainingUtils import add_training_cmake_flags, build_codegen_cmd, filter_passthrough_args, \
-    run_codegen_subprocess
+    resolve_optimizer_dir, run_codegen_subprocess
 from .config import DeeployTestConfig
 from .output_parser import TestResult, parse_test_output
 
 
-def _resolve_optimizer_dir(config: DeeployTestConfig) -> str:
-    """Return the optimizer ONNX directory for this config.
-
-    Falls back to <test_dir>/../<model>_optimizer if not explicitly set,
-    where <model> is derived by replacing the '_train' suffix of the test
-    directory name with '_optimizer' (e.g. simplemlp_train → simplemlp_optimizer,
-    sleepconvit_train → sleepconvit_optimizer).
-    """
-    if config.optimizer_dir:
-        return config.optimizer_dir
-    test_parent = Path(config.test_dir).parent
-    test_dir_name = Path(config.test_dir).name
-    optimizer_name = test_dir_name.replace("_train", "_optimizer")
-    return str(test_parent / optimizer_name)
-
-
 def generate_network(config: DeeployTestConfig, skip: bool = False) -> None:
     """
     Generate network code from ONNX model.
@@ -91,7 +75,7 @@ def generate_network(config: DeeployTestConfig, skip: bool = False) -> None:
             log.info(f"[Execution] Training meta: {meta}")
 
         # --- Step 2: Optimizer network (SGD) ---
-        opt_dir = _resolve_optimizer_dir(config)
+        opt_dir = resolve_optimizer_dir(config.test_dir, config.optimizer_dir)
         if not Path(opt_dir).exists():
             log.warning(f"Optimizer directory not found: {opt_dir} — skipping optimizer codegen")
         elif not optimizer_script.exists():
diff --git a/DeeployTest/testUtils/trainingUtils.py b/DeeployTest/testUtils/trainingUtils.py
index 6b21834507..63ada099a9 100644
--- a/DeeployTest/testUtils/trainingUtils.py
+++ b/DeeployTest/testUtils/trainingUtils.py
@@ -295,6 +295,22 @@ def run_with_shouldfail(fn: Callable[[argparse.Namespace], None], args: argparse
 # ---------------------------------------------------------------------------
 
 
+def resolve_optimizer_dir(test_dir: str, optimizer_dir: Optional[str]) -> str:
+    """Return the optimizer ONNX directory for a training test.
+
+    If ``optimizer_dir`` is explicitly set, it is returned as-is.  Otherwise
+    fall back to ``<test_dir>/../<model>_optimizer``, where ``<model>`` is
+    derived by replacing the ``_train`` suffix of the test directory's base
+    name with ``_optimizer`` (e.g. ``simplemlp_train`` → ``simplemlp_optimizer``,
+    ``sleepconvit_train`` → ``sleepconvit_optimizer``).
+    """
+    if optimizer_dir:
+        return optimizer_dir
+    test_path = Path(test_dir)
+    optimizer_name = test_path.name.replace("_train", "_optimizer")
+    return str(test_path.parent / optimizer_name)
+
+
 def build_codegen_cmd(script: Path, test_path: str, gen_dir: str, platform: str) -> List[str]:
     """Return the common ``[python, script, -d gen_dir, -t test_path, -p platform]`` prefix."""
     return [

From 5a839b10e0834e20942fd49cab989413b735bcd0 Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 18:32:18 +0000
Subject: [PATCH 22/28] training-platform core: decouple execution.py from
 training pipeline
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Minimize the footprint this PR leaves in testUtils/core/execution.py.
Previously the diff vs upstream devel was ~170 lines spread across
generate_network, configure_cmake, run_simulation, run_complete_test
and a variety of unrelated comment removals / inference-branch
restructuring.  Most of that was incidental and had nothing to do with
training.

Reset execution.py to the upstream-devel state and add only the three
surgical hooks the training path actually needs:

  1. One import line:
         from ..trainingUtils import add_training_cmake_flags, run_training_codegen

  2. A 3-line early-return block at the top of generate_network:
         if config.training:
             run_training_codegen(config, script_dir)
             return

  3. A one-line call inside configure_cmake to append the training
     cmake defines:
         add_training_cmake_flags(cmd, config.training, config.n_train_steps,
                                  config.n_accum_steps, config.training_num_data_inputs)

Everything else — inference's generate_network body, run_simulation,
run_complete_test, all upstream comments — is byte-identical to upstream.

The full two-stage training codegen pipeline (training-script subprocess,
training_meta.json readback, optimizer-script subprocess with passthrough
filter and skip checks) now lives in a new run_training_codegen() in
testUtils/trainingUtils.py, which is the sole module where training
test-harness logic is concentrated.

Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every
step) in both non-tiled and tiled runs.
---
 DeeployTest/testUtils/core/execution.py | 169 ++++++++----------------
 DeeployTest/testUtils/trainingUtils.py  |  79 +++++++++++
 2 files changed, 133 insertions(+), 115 deletions(-)

diff --git a/DeeployTest/testUtils/core/execution.py b/DeeployTest/testUtils/core/execution.py
index 559ca84c18..2fb1224c92 100644
--- a/DeeployTest/testUtils/core/execution.py
+++ b/DeeployTest/testUtils/core/execution.py
@@ -2,7 +2,6 @@
 #
 # SPDX-License-Identifier: Apache-2.0
 
-import json
 import os
 import shutil
 import subprocess
@@ -11,8 +10,7 @@
 
 from Deeploy.Logging import DEFAULT_LOGGER as log
 
-from ..trainingUtils import add_training_cmake_flags, build_codegen_cmd, filter_passthrough_args, \
-    resolve_optimizer_dir, run_codegen_subprocess
+from ..trainingUtils import add_training_cmake_flags, run_training_codegen
 from .config import DeeployTestConfig
 from .output_parser import TestResult, parse_test_output
 
@@ -21,11 +19,6 @@ def generate_network(config: DeeployTestConfig, skip: bool = False) -> None:
     """
     Generate network code from ONNX model.
 
-    In training mode, generates both TrainingNetwork (fwd+bwd) and
-    OptimizerNetwork (SGD) into the same gen_dir.  Auto-detected training
-    parameters (n_steps, n_accum, num_data_inputs) are written to
-    gen_dir/training_meta.json and read back into config after codegen.
-
     Raises:
         RuntimeError: If network generation fails
     """
@@ -36,91 +29,34 @@ def generate_network(config: DeeployTestConfig, skip: bool = False) -> None:
     script_dir = Path(__file__).parent.parent.parent
 
     if config.training:
-        if config.tiling:
-            training_script = script_dir / "testMVPTraining.py"
-            optimizer_script = script_dir / "testMVPOptimizer.py"
-            opt_passthrough = ("--cores", "--l1", "--l2", "--defaultMemLevel", "--memAllocStrategy",
-                               "--searchStrategy", "--plotMemAlloc", "--profileTiling")
-            stage = "Tiled training"
-        else:
-            training_script = script_dir / "generateTrainingNetwork.py"
-            optimizer_script = script_dir / "generateOptimizerNetwork.py"
-            opt_passthrough = ("--cores", "--l1", "--l2", "--defaultMemLevel")
-            stage = "Training"
-
-        # --- Step 1: Training network (forward + backward + accumulation) ---
-        cmd = build_codegen_cmd(training_script, config.test_dir, config.gen_dir, config.platform)
-        # Only pass values when explicitly set; otherwise let the script auto-detect.
-        if config.n_train_steps is not None:
-            cmd.append(f"--n-steps={config.n_train_steps}")
-        if config.n_accum_steps is not None:
-            cmd.append(f"--n-accum={config.n_accum_steps}")
-        if config.training_num_data_inputs is not None:
-            cmd.append(f"--num-data-inputs={config.training_num_data_inputs}")
-        if config.verbose > 0:
-            cmd.append("-" + "v" * config.verbose)
-        if config.debug:
-            cmd.append("--debug")
-        cmd.extend(config.gen_args)
-        run_codegen_subprocess(cmd, f"{stage} network generation", config.test_name)
-
-        # Read back auto-detected values written by the training generation script.
-        meta_path = Path(config.gen_dir) / "training_meta.json"
-        if meta_path.exists():
-            with open(meta_path) as f:
-                meta = json.load(f)
-            config.n_train_steps = meta["n_train_steps"]
-            config.n_accum_steps = meta["n_accum_steps"]
-            config.training_num_data_inputs = meta["training_num_data_inputs"]
-            log.info(f"[Execution] Training meta: {meta}")
-
-        # --- Step 2: Optimizer network (SGD) ---
-        opt_dir = resolve_optimizer_dir(config.test_dir, config.optimizer_dir)
-        if not Path(opt_dir).exists():
-            log.warning(f"Optimizer directory not found: {opt_dir} — skipping optimizer codegen")
-        elif not optimizer_script.exists():
-            log.warning(f"{optimizer_script.name} not found — skipping optimizer codegen")
-        else:
-            opt_cmd = build_codegen_cmd(optimizer_script, opt_dir, config.gen_dir, config.platform)
-            opt_cmd.append(f"--training-dir={config.test_dir}")
-            opt_cmd.extend(filter_passthrough_args(config.gen_args, opt_passthrough))
-            if not any(arg.startswith("--defaultMemLevel") for arg in opt_cmd):
-                opt_cmd.append("--defaultMemLevel=L2")
-            if config.verbose > 0:
-                opt_cmd.append("-" + "v" * config.verbose)
-            run_codegen_subprocess(opt_cmd, f"{stage} optimizer network generation", config.test_name)
-
-        return  # early return — training path complete
-
-    elif config.tiling:
+        run_training_codegen(config, script_dir)
+        return
+
+    if config.tiling:
         generation_script = script_dir / "testMVP.py"
-        cmd = [
-            sys.executable,
-            str(generation_script),
-            "-d",
-            config.gen_dir,
-            "-t",
-            config.test_dir,
-            "-p",
-            config.platform,
-        ]
     else:
         generation_script = script_dir / "generateNetwork.py"
-        cmd = [
-            sys.executable,
-            str(generation_script),
-            "-d",
-            config.gen_dir,
-            "-t",
-            config.test_dir,
-            "-p",
-            config.platform,
-        ]
 
+    cmd = [
+        "python",
+        str(generation_script),
+        "-d",
+        config.gen_dir,
+        "-t",
+        config.test_dir,
+        "-p",
+        config.platform,
+    ]
+
+    # Add verbosity flags
     if config.verbose > 0:
         cmd.append("-" + "v" * config.verbose)
+
+    # Add debug flag
     if config.debug:
         cmd.append("--debug")
+
+    # Add additional generation arguments
     cmd.extend(config.gen_args)
 
     log.debug(f"[Execution] Generation command: {' '.join(cmd)}")
@@ -141,6 +77,7 @@ def configure_cmake(config: DeeployTestConfig) -> None:
     if cmake_cmd == "cmake" and shutil.which("cmake") is None:
         raise RuntimeError("CMake not found. Please install CMake or set CMAKE environment variable")
 
+    # Build CMake command
     cmd = [
         cmake_cmd,
         f"-DTOOLCHAIN={config.toolchain}",
@@ -173,6 +110,7 @@ def configure_cmake(config: DeeployTestConfig) -> None:
     add_training_cmake_flags(cmd, config.training, config.n_train_steps, config.n_accum_steps,
                              config.training_num_data_inputs)
 
+    # Last argument is the source directory
     script_dir = Path(__file__).parent.parent.parent
     cmd.append(str(script_dir.parent))
 
@@ -232,50 +170,44 @@ def run_simulation(config: DeeployTestConfig, skip: bool = False) -> TestResult:
     if config.simulator == 'none':
         raise RuntimeError("No simulator specified!")
 
-    env = os.environ.copy()
-    if config.verbose >= 3:
-        env["VERBOSE"] = "1"
-
     if config.simulator == 'host':
+        # Run binary directly
         binary_path = Path(config.build_dir) / "bin" / config.test_name
         cmd = [str(binary_path)]
-
-    elif config.simulator == 'gvsoc':
+    else:
+        # Run via CMake target
         cmake_cmd = os.environ.get("CMAKE", "cmake")
-        cmd = [cmake_cmd, "--build", config.build_dir, "--target", f"gvsoc_{config.test_name}"]
+        cmd = [
+            cmake_cmd,
+            "--build",
+            config.build_dir,
+            "--target",
+            f"{config.simulator}_{config.test_name}",
+        ]
 
-    elif config.simulator == 'banshee':
+    env = os.environ.copy()
+    if config.verbose >= 3:
+        env["VERBOSE"] = "1"
+
+    if config.simulator == 'banshee':
         if config.verbose == 1:
             env["BANSHEE_LOG"] = "warn"
         elif config.verbose == 2:
             env["BANSHEE_LOG"] = "info"
         elif config.verbose >= 3:
             env["BANSHEE_LOG"] = "debug"
-        cmake_cmd = os.environ.get("CMAKE", "cmake")
-        cmd = [cmake_cmd, "--build", config.build_dir, "--target", f"{config.simulator}_{config.test_name}"]
-
-    else:
-        cmake_cmd = os.environ.get("CMAKE", "cmake")
-        cmd = [cmake_cmd, "--build", config.build_dir, "--target", f"{config.simulator}_{config.test_name}"]
 
     log.debug(f"[Execution] Simulation command: {' '.join(cmd)}")
 
-    # Stream output in real-time (line-buffered) and capture for parsing.
-    proc = subprocess.Popen(cmd,
-                            stdout = subprocess.PIPE,
-                            stderr = subprocess.STDOUT,
-                            text = True,
-                            env = env,
-                            bufsize = 1)
-    stdout_lines = []
-    for line in proc.stdout:
-        print(line, end = '', flush = True)
-        stdout_lines.append(line)
-    proc.stdout.close()
-    proc.wait()
-    stdout_output = ''.join(stdout_lines)
-
-    test_result = parse_test_output(stdout_output, '')
+    result = subprocess.run(cmd, capture_output = True, text = True, env = env)
+
+    if result.stdout:
+        print(result.stdout, end = '')
+    if result.stderr:
+        print(result.stderr, end = '', file = sys.stderr)
+
+    # Parse output for error count and cycles
+    test_result = parse_test_output(result.stdout, result.stderr)
 
     if not test_result.success and test_result.error_count == -1:
         log.warning(f"Could not parse error count from output")
@@ -289,9 +221,16 @@ def run_complete_test(config: DeeployTestConfig, skipgen: bool = False, skipsim:
     """
     log.info(f"################## Testing {config.test_name} on {config.platform} Platform ##################")
 
+    # Step 1: Generate network
     generate_network(config, skip = skipgen)
+
+    # Step 2: Configure CMake
     configure_cmake(config)
+
+    # Step 3: Build binary
     build_binary(config)
+
+    # Step 4: Run simulation
     result = run_simulation(config, skip = skipsim)
 
     return result
diff --git a/DeeployTest/testUtils/trainingUtils.py b/DeeployTest/testUtils/trainingUtils.py
index 63ada099a9..78f02e7218 100644
--- a/DeeployTest/testUtils/trainingUtils.py
+++ b/DeeployTest/testUtils/trainingUtils.py
@@ -23,6 +23,7 @@
 """
 
 import argparse
+import json
 import os
 import subprocess
 import sys
@@ -351,3 +352,81 @@ def add_training_cmake_flags(cmd: List[str], training: bool, n_train_steps: Opti
         cmd.append(f"-DN_ACCUM_STEPS={n_accum_steps}")
     if training_num_data_inputs is not None:
         cmd.append(f"-DTRAINING_NUM_DATA_INPUTS={training_num_data_inputs}")
+
+
+def run_training_codegen(config, script_dir: Path) -> None:
+    """Drive the two-stage training codegen pipeline for one test.
+
+    Runs the training network codegen script (generateTrainingNetwork.py or
+    testMVPTraining.py) followed by the matching optimizer codegen script
+    (generateOptimizerNetwork.py or testMVPOptimizer.py), and writes back
+    any auto-detected training parameters from ``training_meta.json`` into
+    ``config``.
+
+    The single entry point keeps ``testUtils.core.execution.generate_network``
+    oblivious to training internals — it only has to call this and return.
+
+    Parameters
+    ----------
+    config : DeeployTestConfig
+        The test configuration (must have ``training=True``).  Training
+        fields (``n_train_steps``, ``n_accum_steps``,
+        ``training_num_data_inputs``) may be updated in-place from the
+        training_meta.json written by the codegen script.
+    script_dir : Path
+        ``DeeployTest/`` — the directory that hosts the four codegen scripts.
+    """
+    if config.tiling:
+        training_script = script_dir / "testMVPTraining.py"
+        optimizer_script = script_dir / "testMVPOptimizer.py"
+        opt_passthrough = ("--cores", "--l1", "--l2", "--defaultMemLevel", "--memAllocStrategy",
+                           "--searchStrategy", "--plotMemAlloc", "--profileTiling")
+        stage = "Tiled training"
+    else:
+        training_script = script_dir / "generateTrainingNetwork.py"
+        optimizer_script = script_dir / "generateOptimizerNetwork.py"
+        opt_passthrough = ("--cores", "--l1", "--l2", "--defaultMemLevel")
+        stage = "Training"
+
+    # --- Step 1: Training network (forward + backward + accumulation) ---
+    cmd = build_codegen_cmd(training_script, config.test_dir, config.gen_dir, config.platform)
+    if config.n_train_steps is not None:
+        cmd.append(f"--n-steps={config.n_train_steps}")
+    if config.n_accum_steps is not None:
+        cmd.append(f"--n-accum={config.n_accum_steps}")
+    if config.training_num_data_inputs is not None:
+        cmd.append(f"--num-data-inputs={config.training_num_data_inputs}")
+    if config.verbose > 0:
+        cmd.append("-" + "v" * config.verbose)
+    if config.debug:
+        cmd.append("--debug")
+    cmd.extend(config.gen_args)
+    run_codegen_subprocess(cmd, f"{stage} network generation", config.test_name)
+
+    # Read back auto-detected values written by the training generation script.
+    meta_path = Path(config.gen_dir) / "training_meta.json"
+    if meta_path.exists():
+        with open(meta_path) as f:
+            meta = json.load(f)
+        config.n_train_steps = meta["n_train_steps"]
+        config.n_accum_steps = meta["n_accum_steps"]
+        config.training_num_data_inputs = meta["training_num_data_inputs"]
+        log.info(f"[Execution] Training meta: {meta}")
+
+    # --- Step 2: Optimizer network (SGD) ---
+    opt_dir = resolve_optimizer_dir(config.test_dir, config.optimizer_dir)
+    if not Path(opt_dir).exists():
+        log.warning(f"Optimizer directory not found: {opt_dir} — skipping optimizer codegen")
+        return
+    if not optimizer_script.exists():
+        log.warning(f"{optimizer_script.name} not found — skipping optimizer codegen")
+        return
+
+    opt_cmd = build_codegen_cmd(optimizer_script, opt_dir, config.gen_dir, config.platform)
+    opt_cmd.append(f"--training-dir={config.test_dir}")
+    opt_cmd.extend(filter_passthrough_args(config.gen_args, opt_passthrough))
+    if not any(arg.startswith("--defaultMemLevel") for arg in opt_cmd):
+        opt_cmd.append("--defaultMemLevel=L2")
+    if config.verbose > 0:
+        opt_cmd.append("-" + "v" * config.verbose)
+    run_codegen_subprocess(opt_cmd, f"{stage} optimizer network generation", config.test_name)

From 3d309b7d229a067bbb4d3615834a21220e36569a Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 18:39:44 +0000
Subject: [PATCH 23/28] training-platform core: extract training codegen
 helpers to codeGenerateTraining.py

testUtils/codeGenerate.py had accumulated ~880 lines of training / optimizer
codegen helpers piled on top of the upstream inference module, plus a dead
"Initialize all output buffers to zero" block that had crept into
generateTestNetworkImplementation (it declared local variables and looped
over graph outputs without ever emitting anything into retStr).  The net
effect was that this PR's diff in codeGenerate.py was +883 lines and an
interleaving that made the inference vs training boundary invisible to
reviewers.

Reset testUtils/codeGenerate.py to byte-identical upstream/devel and move
every training-side helper to a new testUtils/codeGenerateTraining.py:

  - generateTrainingTestInputsHeader / generateTrainingTestOutputsHeader
  - generateTrainingNetworkHeader / generateTrainingNetworkImplementation
  - generateTrainingTestNetwork
  - build_shared_buffer_maps / _patch_shared_buffers / _patch_shared_arenas
  - _ensure_training_l1_capacity
  - generateOptimizerNetworkHeader / generateOptimizerNetworkImplementation
  - generateOptimizerTestNetwork

codeGenerateTraining.py imports generateL3HexDump from codeGenerate (the
only inference helper the training functions reuse), which is a clean
forward dependency.  No back-edge from codeGenerate to training.

The four codegen entry points are updated to pull their training /
optimizer helpers from testUtils.codeGenerateTraining instead of
testUtils.codeGenerate.  No other files change.

Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every
step) in both non-tiled and tiled runs.
---
 DeeployTest/generateOptimizerNetwork.py       |   2 +-
 DeeployTest/generateTrainingNetwork.py        |   2 +-
 DeeployTest/testMVPOptimizer.py               |   2 +-
 DeeployTest/testMVPTraining.py                |   2 +-
 DeeployTest/testUtils/codeGenerate.py         | 884 +----------------
 DeeployTest/testUtils/codeGenerateTraining.py | 892 ++++++++++++++++++
 6 files changed, 897 insertions(+), 887 deletions(-)
 create mode 100644 DeeployTest/testUtils/codeGenerateTraining.py

diff --git a/DeeployTest/generateOptimizerNetwork.py b/DeeployTest/generateOptimizerNetwork.py
index 2b484e0110..a277a3a2a8 100644
--- a/DeeployTest/generateOptimizerNetwork.py
+++ b/DeeployTest/generateOptimizerNetwork.py
@@ -27,7 +27,7 @@
 
 import onnx
 import onnx_graphsurgeon as gs
-from testUtils.codeGenerate import build_shared_buffer_maps, generateOptimizerTestNetwork
+from testUtils.codeGenerateTraining import build_shared_buffer_maps, generateOptimizerTestNetwork
 from testUtils.platformMapping import mapDeployer, mapPlatform, setupMemoryPlatform
 from testUtils.testRunner import TestGeneratorArgumentParser
 from testUtils.trainingUtils import add_cores_arg, add_memory_level_args, add_optimizer_training_dir_arg, \
diff --git a/DeeployTest/generateTrainingNetwork.py b/DeeployTest/generateTrainingNetwork.py
index ab6ed5bff4..7ce3e5d35f 100644
--- a/DeeployTest/generateTrainingNetwork.py
+++ b/DeeployTest/generateTrainingNetwork.py
@@ -9,7 +9,7 @@
 import numpy as np
 import onnx
 import onnx_graphsurgeon as gs
-from testUtils.codeGenerate import generateTrainingTestNetwork
+from testUtils.codeGenerateTraining import generateTrainingTestNetwork
 from testUtils.platformMapping import mapDeployer, mapPlatform
 from testUtils.testRunner import TestGeneratorArgumentParser
 from testUtils.trainingUtils import _GRAD_ACC, _infer_data_size, _infer_n_accum, _infer_num_data_inputs, \
diff --git a/DeeployTest/testMVPOptimizer.py b/DeeployTest/testMVPOptimizer.py
index 02804df243..e90c20dd10 100644
--- a/DeeployTest/testMVPOptimizer.py
+++ b/DeeployTest/testMVPOptimizer.py
@@ -32,7 +32,7 @@
 
 import onnx
 import onnx_graphsurgeon as gs
-from testUtils.codeGenerate import build_shared_buffer_maps, generateOptimizerTestNetwork
+from testUtils.codeGenerateTraining import build_shared_buffer_maps, generateOptimizerTestNetwork
 from testUtils.platformMapping import mapDeployer, mapPlatform, setupMemoryPlatform
 from testUtils.testRunner import TestGeneratorArgumentParser
 from testUtils.tilingUtils import TrainingSBTiler
diff --git a/DeeployTest/testMVPTraining.py b/DeeployTest/testMVPTraining.py
index b965e91a40..71f44f81d9 100644
--- a/DeeployTest/testMVPTraining.py
+++ b/DeeployTest/testMVPTraining.py
@@ -11,7 +11,7 @@
 import numpy as np
 import onnx
 import onnx_graphsurgeon as gs
-from testUtils.codeGenerate import generateTrainingTestNetwork
+from testUtils.codeGenerateTraining import generateTrainingTestNetwork
 from testUtils.platformMapping import mapDeployer, mapPlatform, setupMemoryPlatform
 from testUtils.testRunner import TestGeneratorArgumentParser
 from testUtils.tilingUtils import TrainingSBTiler
diff --git a/DeeployTest/testUtils/codeGenerate.py b/DeeployTest/testUtils/codeGenerate.py
index aa18f155b2..39a44d9442 100644
--- a/DeeployTest/testUtils/codeGenerate.py
+++ b/DeeployTest/testUtils/codeGenerate.py
@@ -3,8 +3,7 @@
 # SPDX-License-Identifier: Apache-2.0
 
 import os
-import re
-from typing import Dict, List, Optional, Tuple
+from typing import List, Tuple
 
 import numpy as np
 
@@ -195,16 +194,6 @@ def generateTestNetworkImplementation(deployer: NetworkDeployer, verbosityCfg: C
         """
     retStr += deployer.generateEngineInitializationCode()
     retStr += deployer.generateBufferAllocationCode()
-
-    # Initialize all output buffers to zero
-    output_idx = 0
-    while deployer.ctxt.is_buffer(f'output_{output_idx}'):
-        output_buffer = deployer.ctxt.lookup(f'output_{output_idx}')
-        output_size = np.prod(output_buffer.shape) if hasattr(output_buffer,
-                                                              'shape') else output_buffer._type.referencedType.typeWidth
-        typeName = output_buffer._type.referencedType.typeName
-        output_idx += 1
-
     retStr += """
     }
     """
@@ -298,874 +287,3 @@ def generateTestNetwork(deployer: NetworkDeployer, test_inputs: List[np.ndarray]
     os.system(f'clang-format -i --style="{clang_format}" {dumpdir}/Network.h')
     os.system(f'clang-format -i --style="{clang_format}" {dumpdir}/testoutputs.h')
     os.system(f'clang-format -i --style="{clang_format}" {dumpdir}/testinputs.h')
-
-
-# ---------------------------------------------------------------------------
-# Training code-generation helpers
-# ---------------------------------------------------------------------------
-
-
-def generateTrainingTestInputsHeader(deployer: NetworkDeployer,
-                                     all_mb_data: List[List[np.ndarray]],
-                                     n_steps: int,
-                                     n_accum: int,
-                                     grad_buf_start_idx: int = 0,
-                                     num_grad_inputs: int = 0,
-                                     learning_rate: float = 0.001,
-                                     init_weights: List[np.ndarray] = None,
-                                     data_size: int = None) -> str:
-    """Generate testinputs.h for training tests.
-
-    Parameters
-    ----------
-    deployer : NetworkDeployer
-        Prepared deployer (used to look up buffer types).
-    all_mb_data : list of list of np.ndarray
-        Per-mini-batch DATA arrays: ``all_mb_data[mb][buf]`` is the array for
-        mini-batch *mb* and DATA buffer *buf*.  All mini-batches must have the
-        same number of buffers.
-    n_steps : int
-        N_TRAIN_STEPS macro value.
-    n_accum : int
-        N_ACCUM_STEPS macro value.
-    grad_buf_start_idx : int
-        Index of the first grad accumulation buffer in DeeployNetwork_inputs[].
-        Used to emit TRAINING_GRAD_BUF_START_IDX.  Pass 0 (and num_grad_inputs=0)
-        to suppress the define (e.g. when no grad bufs exist).
-    num_grad_inputs : int
-        Number of grad accumulation buffers.  Used to emit TRAINING_NUM_GRAD_INPUTS.
-
-    Returns
-    -------
-    str
-        C header string.
-    """
-    total_mb = n_steps * n_accum
-    num_data = len(all_mb_data[0]) if all_mb_data else 0
-    # data_size: number of unique samples stored in C arrays.
-    # C harness cycles: testDataVector[mb % TRAINING_DATA_SIZE].
-    # Defaults to total_mb (no cycling) for backward compatibility.
-    effective_data_size = data_size if (data_size is not None and data_size < total_mb) else total_mb
-
-    retStr = ""
-    retStr += f"#define N_TRAIN_STEPS {n_steps}\n"
-    retStr += f"#define N_ACCUM_STEPS {n_accum}\n"
-    retStr += f"#define TRAINING_DATA_SIZE {effective_data_size}\n"
-    retStr += f"#define TRAINING_NUM_DATA_INPUTS {num_data}\n"
-    if num_grad_inputs > 0:
-        retStr += f"#define TRAINING_GRAD_BUF_START_IDX {grad_buf_start_idx}\n"
-        retStr += f"#define TRAINING_NUM_GRAD_INPUTS {num_grad_inputs}\n"
-        num_weight_inputs = grad_buf_start_idx - num_data
-        retStr += f"#define TRAINING_NUM_WEIGHT_INPUTS {num_weight_inputs}\n"
-    retStr += f"#define TRAINING_LEARNING_RATE {learning_rate:.10g}f\n"
-    retStr += "\n"
-
-    # Emit per-mini-batch buffer arrays — only effective_data_size unique rows.
-    # all_mb_data must contain exactly effective_data_size rows.
-    for mb in range(effective_data_size):
-        mb_data = all_mb_data[mb] if mb < len(all_mb_data) else all_mb_data[-1]
-        row_entries = []
-        for buf_idx, arr in enumerate(mb_data):
-            values = arr.reshape(-1)
-
-            # Determine C type from deployer context (buffer "input_N").
-            input_key = f"input_{buf_idx}"
-            if deployer.ctxt.is_buffer(input_key):
-                buffer = deployer.ctxt.lookup(input_key)
-                typeName = buffer._type.referencedType.typeName
-                typeWidth = buffer._type.referencedType.typeWidth
-            else:
-                # Fallback: infer from numpy dtype
-                if arr.dtype == np.float32 or arr.dtype == np.float64:
-                    typeName = "float32_t"
-                    typeWidth = 32
-                elif arr.dtype == np.int64:
-                    typeName = "int64_t"
-                    typeWidth = 64
-                elif arr.dtype == np.bool_ or arr.dtype == bool:
-                    typeName = "uint8_t"
-                    typeWidth = 8
-                else:
-                    typeName = "int32_t"
-                    typeWidth = 32
-
-            buf_name = f"testData_mb{mb}_buf{buf_idx}"
-            row_entries.append(buf_name)
-
-            # Format values
-            if typeName == 'float32_t':
-                list_str = ", ".join(
-                    [f'{float(x)}f' if not (np.isinf(x) or np.isnan(x)) else str(x) for x in values.astype(np.float32)])
-            else:
-                list_str = ", ".join([str(x) for x in values])
-
-            # 4-byte alignment padding
-            total_bytes = (values.size * typeWidth) // 8
-            pad_bytes = (-total_bytes) % 4
-            if pad_bytes:
-                paddingElements = (pad_bytes * 8 + typeWidth - 1) // typeWidth
-                list_str += ", " + ", ".join("0" for _ in range(paddingElements))
-
-            retStr += f"{typeName} {buf_name}[] = {{{list_str}}};\n"
-
-        # Emit the row pointer array for this mini-batch
-        row_name = f"testDataRow{mb}"
-        retStr += f"void* {row_name}[] = {{{', '.join(f'(void*){e}' for e in row_entries)}}};\n"
-        retStr += "\n"
-
-    # Emit the top-level vector of row pointers (only unique samples; C harness cycles via modulo).
-    retStr += f"void** testDataVector[{effective_data_size}] = {{{', '.join(f'testDataRow{mb}' for mb in range(effective_data_size))}}};\n"
-
-    # Emit initial weight arrays (one per weight input, indices num_data..grad_buf_start_idx-1).
-    if init_weights:
-        retStr += "\n"
-        weight_entries = []
-        num_data = len(all_mb_data[0]) if all_mb_data else 0
-        for wi, arr in enumerate(init_weights):
-            buf_global_idx = num_data + wi
-            input_key = f"input_{buf_global_idx}"
-            if deployer.ctxt.is_buffer(input_key):
-                buffer = deployer.ctxt.lookup(input_key)
-                typeName = buffer._type.referencedType.typeName
-                typeWidth = buffer._type.referencedType.typeWidth
-            else:
-                typeName = "float32_t"
-                typeWidth = 32
-            values = arr.reshape(-1).astype(np.float32)
-            # Tile values to match Deeploy's internal (possibly sequence-length-tiled) shape.
-            if deployer.ctxt.is_buffer(input_key):
-                expected_nelems = int(np.prod(deployer.ctxt.lookup(input_key).shape))
-                if expected_nelems > len(values) and expected_nelems % len(values) == 0:
-                    values = np.tile(values, expected_nelems // len(values))
-            list_str = ", ".join([f'{float(x)}f' for x in values])
-            buf_name = f"testInitWeight_{wi}"
-            weight_entries.append(buf_name)
-            retStr += f"{typeName} {buf_name}[] = {{{list_str}}};\n"
-        retStr += f"void* testInitWeights[{len(weight_entries)}] = {{{', '.join(f'(void*){e}' for e in weight_entries)}}};\n"
-
-    return retStr
-
-
-def generateTrainingTestOutputsHeader(
-    reference_losses: List = None,
-    tolerance_abs: float = 1e-3,
-) -> str:
-    """Generate testoutputs.h for training tests — loss comparison only.
-
-    Parameters
-    ----------
-    reference_losses : list of float, optional
-        Reference loss value for each forward pass (one per mini-batch step).
-        If None, loss comparison is skipped.
-    tolerance_abs : float
-        Absolute comparison tolerance emitted as TRAINING_TOLERANCE_ABS.
-
-    Returns
-    -------
-    str
-        C header string.
-    """
-    has_loss = reference_losses is not None and len(reference_losses) > 0
-
-    retStr = "// testoutputs.h — Phase 2: loss verification\n"
-    retStr += f"#define TRAINING_TOLERANCE_ABS {tolerance_abs:.10g}f\n\n"
-
-    if has_loss:
-        n = len(reference_losses)
-        retStr += "// Expected loss for each forward pass (one per mini-batch)\n"
-        retStr += f"#define N_LOSS_REFS {n}\n"
-        vals = ", ".join(f"{float(v):.10g}f" for v in reference_losses)
-        retStr += f"float32_t testLossRef[{n}] = {{{vals}}};\n\n"
-    else:
-        retStr += "// No loss reference available — loss comparison skipped.\n"
-        retStr += "#define N_LOSS_REFS 0\n\n"
-
-    return retStr
-
-
-def generateTrainingNetworkHeader(deployer: NetworkDeployer) -> str:
-    """Generate TrainingNetwork.h — same as generateTestNetworkHeader but with
-    RunTrainingNetwork / InitTrainingNetwork function names and a distinct header guard.
-
-    Parameters
-    ----------
-    deployer : NetworkDeployer
-        Prepared deployer.
-
-    Returns
-    -------
-    str
-        C header string.
-    """
-    retStr = ""
-
-    retStr += """
-#ifndef __DEEPLOY_TRAINING_HEADER__
-#define __DEEPLOY_TRAINING_HEADER__
-#include <stdio.h>
-#include <stdint.h>
-#include <stdlib.h>
-"""
-    retStr += deployer.generateIncludeString()
-    if isinstance(deployer.Platform, (PULPPlatform, MemoryPULPPlatform, MemoryPULPPlatformWrapper)):
-        retStr += """
-void RunTrainingNetwork();
-void InitTrainingNetwork();
-
-"""
-    else:
-        retStr += """
-void RunTrainingNetwork(uint32_t core_id, uint32_t numThreads);
-void InitTrainingNetwork(uint32_t core_id, uint32_t numThread);
-
-"""
-
-    retStr += deployer.generateIOBufferInitializationCode()
-    retStr += """
-#endif
-"""
-
-    return retStr
-
-
-def generateTrainingNetworkImplementation(deployer: NetworkDeployer, verbosityCfg: CodeGenVerbosity) -> str:
-    """Generate TrainingNetwork.c — same as generateTestNetworkImplementation but with
-    RunTrainingNetwork / InitTrainingNetwork function names and including TrainingNetwork.h.
-
-    Parameters
-    ----------
-    deployer : NetworkDeployer
-        Prepared deployer.
-    verbosityCfg : CodeGenVerbosity
-        Verbosity configuration.
-
-    Returns
-    -------
-    str
-        C implementation string.
-    """
-    retStr = ""
-
-    retStr += """#include <stdio.h>
-#include <stdlib.h>
-#include <math.h>
-"""
-    retStr += deployer.generateIncludeString()
-    retStr += """
-
-#include "TrainingNetwork.h"
-
-"""
-
-    retStr += deployer.generateBufferInitializationCode()
-    retStr += deployer.generateGlobalDefinitionCode()
-
-    if isinstance(deployer.Platform, MemPoolPlatform):
-        retStr += deployer.generateInferenceInitializationCode()
-        retStr += """
-void RunTrainingNetwork(__attribute__((unused)) uint32_t core_id, __attribute__((unused)) uint32_t numThreads){
-"""
-    elif isinstance(deployer.Platform, (PULPPlatform, MemoryPULPPlatform, MemoryPULPPlatformWrapper)):
-        retStr += """
-void RunTrainingNetwork(){
-"""
-        retStr += deployer.generateInferenceInitializationCode()
-    else:
-        retStr += """
-void RunTrainingNetwork(__attribute__((unused)) uint32_t core_id, __attribute__((unused)) uint32_t numThreads){
-"""
-        retStr += deployer.generateInferenceInitializationCode()
-
-    retStr += deployer.generateFunction(verbosityCfg)
-    if isinstance(deployer.Platform, (PULPPlatform, MemoryPULPPlatform, MemoryPULPPlatformWrapper)):
-        retStr += """
-}
-
-void InitTrainingNetwork(){
-"""
-    else:
-        retStr += """
-}
-
-void InitTrainingNetwork(__attribute__((unused)) uint32_t core_id, __attribute__((unused)) uint32_t numThreads){
-"""
-    retStr += deployer.generateEngineInitializationCode()
-    retStr += deployer.generateBufferAllocationCode()
-    retStr += """
-}
-"""
-
-    return retStr
-
-
-def generateTrainingTestNetwork(deployer: NetworkDeployer,
-                                all_mb_data: List[List[np.ndarray]],
-                                dumpdir: str,
-                                verbosityCfg: CodeGenVerbosity,
-                                n_steps: int = 1,
-                                n_accum: int = 1,
-                                num_data_inputs: int = 2,
-                                grad_buf_start_idx: int = 0,
-                                num_grad_inputs: int = 0,
-                                learning_rate: float = 0.001,
-                                reference_losses: List = None,
-                                init_weights: List = None,
-                                data_size: int = None,
-                                tolerance_abs: float = 1e-3) -> None:
-    """Generate all training test files: testinputs.h, testoutputs.h, TrainingNetwork.h, TrainingNetwork.c.
-
-    Parameters
-    ----------
-    deployer : NetworkDeployer
-        Prepared deployer (ctxt.name must already be set to "DeeployTrainingNetwork").
-    all_mb_data : list of list of np.ndarray
-        Per-mini-batch DATA arrays: ``all_mb_data[mb][buf]`` is the array for
-        mini-batch *mb* and DATA buffer *buf*.
-    dumpdir : str
-        Output directory for generated files.
-    verbosityCfg : CodeGenVerbosity
-        Verbosity configuration.
-    n_steps : int
-        N_TRAIN_STEPS value.
-    n_accum : int
-        N_ACCUM_STEPS value.
-    num_data_inputs : int
-        Number of data inputs (TRAINING_NUM_DATA_INPUTS).
-    grad_buf_start_idx : int
-        Index of the first grad accumulation buffer in DeeployNetwork_inputs[].
-    num_grad_inputs : int
-        Number of grad accumulation buffers (TRAINING_NUM_GRAD_INPUTS).
-    """
-    assert deployer.prepared, "An unprepared deployer was given"
-
-    os.makedirs(dumpdir, exist_ok = True)
-
-    # testinputs.h
-    testInputStr = generateTrainingTestInputsHeader(deployer,
-                                                    all_mb_data,
-                                                    n_steps,
-                                                    n_accum,
-                                                    grad_buf_start_idx,
-                                                    num_grad_inputs,
-                                                    learning_rate,
-                                                    init_weights = init_weights,
-                                                    data_size = data_size)
-    with open(f'{dumpdir}/testinputs.h', 'w') as f:
-        f.write(testInputStr)
-
-    # testoutputs.h
-    testOutputStr = generateTrainingTestOutputsHeader(
-        reference_losses = reference_losses,
-        tolerance_abs = tolerance_abs,
-    )
-    with open(f'{dumpdir}/testoutputs.h', 'w') as f:
-        f.write(testOutputStr)
-
-    # TrainingNetwork.h
-    headerStr = generateTrainingNetworkHeader(deployer)
-    with open(f'{dumpdir}/TrainingNetwork.h', 'w') as f:
-        f.write(headerStr)
-
-    # TrainingNetwork.c
-    implStr = generateTrainingNetworkImplementation(deployer, verbosityCfg)
-    with open(f'{dumpdir}/TrainingNetwork.c', 'w') as f:
-        f.write(implStr)
-
-    clang_format = "{BasedOnStyle: llvm, IndentWidth: 2, ColumnLimit: 160}"
-    for fname in ['TrainingNetwork.c', 'TrainingNetwork.h', 'testinputs.h', 'testoutputs.h']:
-        os.system(f'clang-format -i --style="{clang_format}" {dumpdir}/{fname}')
-
-    # Build initial-value list for every input_N buffer so that L3 hex files
-    # can be written.  The list must cover all N where "input_N" exists in the
-    # deployer context.  Layout (must match DeeployNetwork_inputs[] order):
-    #   [0 .. num_data_inputs-1]              → first mini-batch data
-    #   [num_data_inputs .. grad_start-1]     → initial weights
-    #   [grad_start .. grad_start+num_grad-1] → zeros  (grad acc bufs)
-    #   [last]                                → lazy_reset_grad = 1 (uint8)
-    l3_initial_inputs: List[np.ndarray] = []
-    # Count how many input_N buffers exist in the deployer context
-    n_total_inputs = sum(
-        1 for name in deployer.ctxt.globalObjects if name.startswith("input_") and name[len("input_"):].isdigit())
-    for i in range(n_total_inputs):
-        if all_mb_data and i < len(all_mb_data[0]):
-            # Data / label input
-            l3_initial_inputs.append(all_mb_data[0][i])
-        elif (init_weights is not None and grad_buf_start_idx > 0 and num_data_inputs <= i < grad_buf_start_idx):
-            # Weight input
-            wi = i - num_data_inputs
-            l3_initial_inputs.append(init_weights[wi] if wi <
-                                     len(init_weights) else np.array([0.0], dtype = np.float32))
-        elif (grad_buf_start_idx > 0 and num_grad_inputs > 0
-              and grad_buf_start_idx <= i < grad_buf_start_idx + num_grad_inputs):
-            # Gradient accumulation buffer — zero-initialised
-            buf = deployer.ctxt.globalObjects.get(f"input_{i}")
-            shape = buf.shape if (buf is not None and hasattr(buf, 'shape')) else (1,)
-            l3_initial_inputs.append(np.zeros(shape, dtype = np.float32))
-        else:
-            # lazy_reset_grad (last input) or any unknown slot — default 1 / uint8
-            buf = deployer.ctxt.globalObjects.get(f"input_{i}")
-            shape = buf.shape if (buf is not None and hasattr(buf, 'shape')) else (1,)
-            l3_initial_inputs.append(np.ones(shape, dtype = np.uint8))
-
-    generateL3HexDump(deployer, os.path.join(dumpdir, 'hex'), l3_initial_inputs, [])
-
-
-# ---------------------------------------------------------------------------
-# Optimizer network code-generation helpers
-# ---------------------------------------------------------------------------
-
-_OPT_PREFIX = "DeeployOptNetwork_"
-_TRAIN_PREFIX = "DeeployNetwork_"
-
-
-def build_shared_buffer_maps(train_onnx_path: str, opt_onnx_model) -> Tuple[Dict[int, int], Dict[int, int]]:
-    """Build optimizer→training index maps for tensors shared between the two graphs.
-
-    The optimizer ONNX inputs are interleaved weight/grad pairs that have the
-    same tensor names as inputs in the training ONNX graph.  We match by name
-    so that ``InitOptimizerNetwork`` can reference the already-allocated
-    ``DeeployNetwork_input_N`` pointers instead of allocating fresh buffers.
-
-    Parameters
-    ----------
-    train_onnx_path : str
-        Path to the training ``network.onnx``.
-    opt_onnx_model :
-        Already-loaded optimizer ONNX model (``onnx.ModelProto``).
-
-    Returns
-    -------
-    shared_input_map : Dict[int, int]
-        opt_input_idx → train_input_idx
-    shared_output_map : Dict[int, int]
-        opt_output_idx → train_input_idx  (SGD outputs == updated weights,
-        same physical buffer as the weight input)
-    """
-    import onnx as _onnx
-    train_model = _onnx.load_model(train_onnx_path)
-    train_names = [inp.name for inp in train_model.graph.input]
-    train_name_to_idx = {name: i for i, name in enumerate(train_names)}
-
-    opt_input_names = [inp.name for inp in opt_onnx_model.graph.input]
-    opt_output_names = [out.name for out in opt_onnx_model.graph.output]
-
-    shared_input_map: Dict[int, int] = {}
-    for opt_idx, name in enumerate(opt_input_names):
-        if name in train_name_to_idx:
-            shared_input_map[opt_idx] = train_name_to_idx[name]
-
-    shared_output_map: Dict[int, int] = {}
-    for opt_idx, name in enumerate(opt_output_names):
-        # Try exact match first; then strip the '_updated' suffix that the SGD
-        # node appends to output tensor names (e.g. 'conv1_weight_updated' → 'conv1_weight').
-        lookup_name = name
-        if lookup_name not in train_name_to_idx and lookup_name.endswith('_updated'):
-            lookup_name = lookup_name[:-len('_updated')]
-        if lookup_name in train_name_to_idx:
-            shared_output_map[opt_idx] = train_name_to_idx[lookup_name]
-
-    return shared_input_map, shared_output_map
-
-
-def _patch_shared_buffers(retStr: str, shared_input_map: Dict[int, int], shared_output_map: Dict[int, int]) -> str:
-    """Redirect optimizer I/O buffers to Training's already-allocated buffers.
-
-    Must be called AFTER the _TRAIN_PREFIX → _OPT_PREFIX substitution so that
-    the generated symbols already carry the ``DeeployOptNetwork_`` prefix.
-
-    Handles two allocation styles produced by Deeploy:
-
-    *Non-tiled* (per-buffer malloc)::
-
-        DeeployOptNetwork_input_N = (SomeType *)pi_l2_malloc(sizeof(...));
-
-    *Tiled* (single arena with offsets)::
-
-        DeeployOptNetwork_input_N = (float32_t *)((char *)DeeployOptNetwork_MEMORYARENA_L2 + OFFSET);
-
-    Both are replaced with direct pointers into the TrainingNetwork arenas::
-
-        DeeployOptNetwork_input_N = (float32_t *)DeeployNetwork_input_M;
-
-    After all I/O pointers are redirected, if a ``MEMORYARENA_L2`` or
-    ``MEMORYARENA_L3`` allocation is no longer referenced anywhere in the Init
-    body (i.e., the shared buffers consumed the entire arena), the now-unused
-    malloc is also removed to reclaim the L2/L3 memory.
-
-    Parameters
-    ----------
-    retStr : str
-        The already-prefix-substituted C source string.
-    shared_input_map : Dict[int, int]
-        Optimizer input index → training input index.
-    shared_output_map : Dict[int, int]
-        Optimizer output index → training input index (in-place update).
-
-    Returns
-    -------
-    str
-        Patched C source string.
-    """
-    if not shared_input_map and not shared_output_map:
-        return retStr
-
-    # ------------------------------------------------------------------
-    # Pattern 1 (non-tiled): individual pi_*_malloc per buffer
-    # ------------------------------------------------------------------
-    _malloc_pat = re.compile(
-        r'(DeeployOptNetwork_(input|output)_(\d+))\s*=\s*\([^)]+\s*\*\s*\)\s*pi_\w+_malloc\([^;]+\);')
-
-    # ------------------------------------------------------------------
-    # Pattern 2 (tiled): arena-offset assignment
-    #   DeeployOptNetwork_input_N = (Type *)((char *)DeeployOptNetwork_MEMORYARENA_Lx + OFFSET);
-    # ------------------------------------------------------------------
-    _arena_pat = re.compile(r'(DeeployOptNetwork_(input|output)_(\d+))\s*=\s*\([^)]+\s*\*\s*\)'
-                            r'\s*\(\s*\(char\s*\*\)\s*DeeployOptNetwork_MEMORYARENA_L\w+\s*\+\s*\d+\s*\)\s*;')
-
-    def _make_replacement(symbol: str, kind: str, idx: int) -> Optional[str]:
-        if kind == "input" and idx in shared_input_map:
-            train_idx = shared_input_map[idx]
-            return f'{symbol} = (float32_t *){_TRAIN_PREFIX}input_{train_idx};  /* shared with TrainingNetwork */'
-        if kind == "output" and idx in shared_output_map:
-            train_idx = shared_output_map[idx]
-            return f'{symbol} = (float32_t *){_TRAIN_PREFIX}input_{train_idx};  /* in-place, shared with TrainingNetwork */'
-        return None
-
-    def _replace(m: re.Match) -> str:
-        replacement = _make_replacement(m.group(1), m.group(2), int(m.group(3)))
-        return replacement if replacement is not None else m.group(0)
-
-    retStr = _malloc_pat.sub(_replace, retStr)
-    retStr = _arena_pat.sub(_replace, retStr)
-
-    # ------------------------------------------------------------------
-    # Arena elimination: if a MEMORYARENA_Lx is no longer used for any
-    # pointer arithmetic after the redirects, its malloc is dead and can
-    # be removed to reclaim L2/L3.  The global declaration is left in
-    # place (harmless; the variable will be NULL at runtime).
-    # ------------------------------------------------------------------
-    for level in ('L2', 'L3'):
-        arena_sym = f'DeeployOptNetwork_MEMORYARENA_{level}'
-        # Pattern for the malloc assignment line itself
-        malloc_line_pat = re.compile(rf'[^\n]*{re.escape(arena_sym)}\s*=\s*\([^)]+\)\s*pi_\w+_malloc\([^;]+\);\s*\n')
-        # Pattern for any use of the arena in pointer arithmetic:
-        #   (char *)ARENA + OFFSET  or  (void *)ARENA  etc.
-        arena_use_pat = re.compile(rf'\(\s*(?:char|void|int8_t)\s*\*\s*\)\s*{re.escape(arena_sym)}')
-        if not arena_use_pat.search(retStr):
-            # No remaining pointer arithmetic — the malloc is dead
-            retStr = malloc_line_pat.sub('', retStr)
-
-    # ------------------------------------------------------------------
-    # Inject TrainingNetwork header so DeeployNetwork_input_N symbols resolve
-    # ------------------------------------------------------------------
-    retStr = retStr.replace(
-        '#include "OptimizerNetwork.h"',
-        '#include "OptimizerNetwork.h"\n#include "TrainingNetwork.h"',
-    )
-    return retStr
-
-
-def _patch_shared_arenas(retStr: str, train_c_source: str) -> str:
-    """Redirect optimizer L1/L2 arena allocations to reuse training network's arenas.
-
-    TrainingNetwork and OptimizerNetwork run strictly sequentially: RunTrainingNetwork()
-    completes before RunOptimizerNetwork() starts.  Their L1/L2 tile-working arenas
-    therefore never overlap in time and can share the same physical memory.
-
-    Only the L1 arena is shared: it is pure tile-compute scratch whose content is
-    dead after each kernel returns.  The L2 arena is NOT shared because it may hold
-    persistent tensor data (weights, activations) at fixed offsets in non-tiled mode;
-    sharing it would let the optimizer's L2 staging buffers overwrite that data.
-
-    Must be called AFTER the _TRAIN_PREFIX → _OPT_PREFIX substitution.
-
-    Parameters
-    ----------
-    retStr : str
-        The already-prefix-substituted C source string for the optimizer.
-    train_c_source : str
-        The full text of TrainingNetwork.c (used to confirm the arena symbols exist).
-
-    Returns
-    -------
-    str
-        Patched C source string.
-    """
-    for level in ('L1',):
-        train_sym = f'DeeployNetwork_MEMORYARENA_{level}'
-        # Only alias if the training network actually has this arena
-        if train_sym not in train_c_source:
-            continue
-
-        opt_sym = f'DeeployOptNetwork_MEMORYARENA_{level}'
-        opt_malloc_pat = re.compile(rf'({re.escape(opt_sym)})\s*=\s*\([^)]+\)\s*\w+\(sizeof\([^)]+\)\s*\*\s*\d+\)\s*;')
-        if not opt_malloc_pat.search(retStr):
-            continue
-
-        replacement = f'{opt_sym} = (int8_t *){train_sym};  /* shared with TrainingNetwork */'
-        retStr = opt_malloc_pat.sub(replacement, retStr)
-
-    # Inject TrainingNetwork header if not already present
-    # (_patch_shared_buffers may have already added it; guard against duplicates)
-    if '#include "TrainingNetwork.h"' not in retStr:
-        retStr = retStr.replace(
-            '#include "OptimizerNetwork.h"',
-            '#include "OptimizerNetwork.h"\n#include "TrainingNetwork.h"',
-        )
-
-    return retStr
-
-
-def _ensure_training_l1_capacity(dumpdir: str, train_c_source: str, opt_alloc_code: str) -> str:
-    """Enlarge TrainingNetwork's L1 arena to cover the optimizer's L1 needs.
-
-    Since the two networks share the same L1 arena, TrainingNetwork must allocate
-    at least max(train_L1, opt_L1) bytes.  When the optimizer needs more L1 than
-    training (rare but possible, e.g. autoencoder), this function patches
-    TrainingNetwork.c and TrainingNetwork.h in-place and returns the updated
-    TrainingNetwork.c source string.
-
-    Parameters
-    ----------
-    dumpdir : str
-        Directory containing TrainingNetwork.c and TrainingNetwork.h.
-    train_c_source : str
-        Current content of TrainingNetwork.c.
-    opt_alloc_code : str
-        Optimizer buffer-allocation code after _TRAIN_PREFIX → _OPT_PREFIX
-        substitution (used to extract the optimizer's L1 size).
-
-    Returns
-    -------
-    str
-        (Possibly updated) TrainingNetwork.c source string.
-    """
-    m_opt = re.search(
-        r'DeeployOptNetwork_MEMORYARENA_L1\s*=\s*\([^)]+\)\s*pmsis_l1_malloc\(sizeof\([^)]+\)\s*\*\s*(\d+)\)',
-        opt_alloc_code,
-    )
-    if not m_opt:
-        return train_c_source
-
-    opt_l1 = int(m_opt.group(1))
-
-    m_train = re.search(
-        r'(DeeployNetwork_MEMORYARENA_L1\s*=\s*\([^)]+\)\s*pmsis_l1_malloc\(sizeof\([^)]+\)\s*\*\s*)(\d+)(\))',
-        train_c_source,
-    )
-    if not m_train:
-        return train_c_source
-
-    train_l1 = int(m_train.group(2))
-    if opt_l1 <= train_l1:
-        return train_c_source  # Already large enough
-
-    new_l1 = opt_l1
-
-    # Patch TrainingNetwork.c malloc size
-    train_c_new = train_c_source.replace(
-        m_train.group(0),
-        f'{m_train.group(1)}{new_l1}{m_train.group(3)}',
-        1,
-    )
-    train_c_path = os.path.join(dumpdir, 'TrainingNetwork.c')
-    with open(train_c_path, 'w') as f:
-        f.write(train_c_new)
-
-    # Patch TrainingNetwork.h _len constant
-    train_h_path = os.path.join(dumpdir, 'TrainingNetwork.h')
-    if os.path.exists(train_h_path):
-        train_h = open(train_h_path).read()
-        train_h_new = re.sub(
-            r'(DeeployNetwork_MEMORYARENA_L1_len\s*=\s*)\d+',
-            rf'\g<1>{new_l1}',
-            train_h,
-        )
-        with open(train_h_path, 'w') as f:
-            f.write(train_h_new)
-
-    return train_c_new
-
-
-def generateOptimizerNetworkHeader(deployer: NetworkDeployer) -> str:
-    """Generate OptimizerNetwork.h.
-
-    Reuses the Deeploy deployer's output and applies two transformations:
-      1. Replace the buffer prefix ``DeeployNetwork_`` → ``DeeployOptNetwork_``
-      2. Inject ``RunOptimizerNetwork`` / ``InitOptimizerNetwork`` function declarations.
-
-    Parameters
-    ----------
-    deployer : NetworkDeployer
-        Prepared deployer for the optimizer ONNX graph.
-
-    Returns
-    -------
-    str
-        C header string.
-    """
-    retStr = ""
-    retStr += """
-#ifndef __DEEPLOY_OPTIMIZER_HEADER__
-#define __DEEPLOY_OPTIMIZER_HEADER__
-#include <stdint.h>
-#include <stdio.h>
-#include <stdlib.h>
-"""
-    retStr += deployer.generateIncludeString()
-    if isinstance(deployer.Platform, (PULPPlatform, MemoryPULPPlatform, MemoryPULPPlatformWrapper)):
-        retStr += """
-void RunOptimizerNetwork();
-void InitOptimizerNetwork();
-
-"""
-    else:
-        retStr += """
-void RunOptimizerNetwork(uint32_t core_id, uint32_t numThreads);
-void InitOptimizerNetwork(uint32_t core_id, uint32_t numThreads);
-
-"""
-    retStr += deployer.generateIOBufferInitializationCode()
-    retStr += """
-#endif
-"""
-    # Prefix substitution: all Deeploy-generated DeeployNetwork_ → DeeployOptNetwork_
-    retStr = retStr.replace(_TRAIN_PREFIX, _OPT_PREFIX)
-    return retStr
-
-
-def generateOptimizerNetworkImplementation(deployer: NetworkDeployer,
-                                           verbosityCfg: CodeGenVerbosity,
-                                           shared_input_map: Optional[Dict[int, int]] = None,
-                                           shared_output_map: Optional[Dict[int, int]] = None,
-                                           train_c_source: Optional[str] = None) -> str:
-    """Generate OptimizerNetwork.c.
-
-    Parameters
-    ----------
-    deployer : NetworkDeployer
-        Prepared deployer for the optimizer ONNX graph.
-    verbosityCfg : CodeGenVerbosity
-        Verbosity configuration.
-    shared_input_map : Dict[int, int], optional
-        Optimizer input index → training input index for shared weight/grad buffers.
-        When provided, those malloc calls are replaced with references to the
-        already-allocated TrainingNetwork buffers.
-    shared_output_map : Dict[int, int], optional
-        Optimizer output index → training input index for in-place shared outputs.
-    train_c_source : str, optional
-        Full text of TrainingNetwork.c.  When provided, the optimizer's L1/L2 arena
-        malloc calls are replaced with direct pointers to the training arenas,
-        saving one L1 and one L2 allocation (safe because the two networks run
-        strictly sequentially).
-
-    Returns
-    -------
-    str
-        C implementation string.
-    """
-    retStr = ""
-    retStr += """#include <math.h>
-#include <stdio.h>
-#include <stdlib.h>
-"""
-    retStr += deployer.generateIncludeString()
-    retStr += """
-#include "OptimizerNetwork.h"
-
-"""
-    retStr += deployer.generateBufferInitializationCode()
-    retStr += deployer.generateGlobalDefinitionCode()
-
-    if isinstance(deployer.Platform, MemPoolPlatform):
-        retStr += deployer.generateInferenceInitializationCode()
-        retStr += """
-void RunOptimizerNetwork(__attribute__((unused)) uint32_t core_id, __attribute__((unused)) uint32_t numThreads){
-"""
-    elif isinstance(deployer.Platform, (PULPPlatform, MemoryPULPPlatform, MemoryPULPPlatformWrapper)):
-        retStr += """
-void RunOptimizerNetwork(){
-"""
-        retStr += deployer.generateInferenceInitializationCode()
-    else:
-        retStr += """
-void RunOptimizerNetwork(__attribute__((unused)) uint32_t core_id, __attribute__((unused)) uint32_t numThreads){
-"""
-        retStr += deployer.generateInferenceInitializationCode()
-
-    retStr += deployer.generateFunction(verbosityCfg)
-
-    if isinstance(deployer.Platform, (PULPPlatform, MemoryPULPPlatform, MemoryPULPPlatformWrapper)):
-        retStr += """
-}
-
-void InitOptimizerNetwork(){
-"""
-    else:
-        retStr += """
-}
-
-void InitOptimizerNetwork(__attribute__((unused)) uint32_t core_id, __attribute__((unused)) uint32_t numThreads){
-"""
-    retStr += deployer.generateEngineInitializationCode()
-    retStr += deployer.generateBufferAllocationCode()
-    retStr += """
-}
-"""
-    # Prefix substitution
-    retStr = retStr.replace(_TRAIN_PREFIX, _OPT_PREFIX)
-    # Replace malloc calls for shared weight/grad buffers with Training pointers
-    retStr = _patch_shared_buffers(retStr, shared_input_map or {}, shared_output_map or {})
-    # Redirect optimizer L1/L2 arena mallocs to reuse training arenas
-    if train_c_source:
-        retStr = _patch_shared_arenas(retStr, train_c_source)
-    return retStr
-
-
-def generateOptimizerTestNetwork(deployer: NetworkDeployer,
-                                 dumpdir: str,
-                                 verbosityCfg: CodeGenVerbosity,
-                                 shared_input_map: Optional[Dict[int, int]] = None,
-                                 shared_output_map: Optional[Dict[int, int]] = None) -> None:
-    """Generate OptimizerNetwork.h and OptimizerNetwork.c.
-
-    Parameters
-    ----------
-    deployer : NetworkDeployer
-        Prepared deployer for the optimizer ONNX graph.
-    dumpdir : str
-        Output directory for generated files.
-    verbosityCfg : CodeGenVerbosity
-        Verbosity configuration.
-    shared_input_map : Dict[int, int], optional
-        Optimizer input index → training input index for shared weight/grad buffers.
-    shared_output_map : Dict[int, int], optional
-        Optimizer output index → training input index for in-place shared outputs.
-    """
-    assert deployer.prepared, "An unprepared deployer was given"
-
-    os.makedirs(dumpdir, exist_ok = True)
-
-    train_c_path = os.path.join(dumpdir, 'TrainingNetwork.c')
-    train_c_source: Optional[str] = None
-    if os.path.exists(train_c_path):
-        with open(train_c_path, 'r') as f:
-            train_c_source = f.read()
-
-    # Enlarge training L1 arena if optimizer needs more (so unconditional L1 sharing is safe)
-    if train_c_source:
-        opt_alloc_preview = deployer.generateBufferAllocationCode().replace(_TRAIN_PREFIX, _OPT_PREFIX)
-        train_c_source = _ensure_training_l1_capacity(dumpdir, train_c_source, opt_alloc_preview)
-
-    headerStr = generateOptimizerNetworkHeader(deployer)
-    with open(f'{dumpdir}/OptimizerNetwork.h', 'w') as f:
-        f.write(headerStr)
-
-    implStr = generateOptimizerNetworkImplementation(deployer, verbosityCfg, shared_input_map, shared_output_map,
-                                                     train_c_source)
-    with open(f'{dumpdir}/OptimizerNetwork.c', 'w') as f:
-        f.write(implStr)
-
-    clang_format = "{BasedOnStyle: llvm, IndentWidth: 2, ColumnLimit: 160}"
-    for fname in ['OptimizerNetwork.c', 'OptimizerNetwork.h']:
-        os.system(f'clang-format -i --style="{clang_format}" {dumpdir}/{fname}')
diff --git a/DeeployTest/testUtils/codeGenerateTraining.py b/DeeployTest/testUtils/codeGenerateTraining.py
new file mode 100644
index 0000000000..4ef9a9fd8a
--- /dev/null
+++ b/DeeployTest/testUtils/codeGenerateTraining.py
@@ -0,0 +1,892 @@
+# SPDX-FileCopyrightText: 2025 ETH Zurich and University of Bologna
+#
+# SPDX-License-Identifier: Apache-2.0
+"""
+Code-generation helpers for the training / optimizer test harness.
+
+These functions emit the C source, header and data files for training tests
+that drive both a TrainingNetwork (forward + backward + gradient accumulation)
+and an OptimizerNetwork (SGD weight update) on the target platform.
+
+Kept as a separate module from testUtils.codeGenerate (which handles plain
+inference codegen) so this PR's training-side additions touch the inference
+helpers only through imports, not by interleaving with inference definitions.
+"""
+
+import os
+import re
+from typing import Dict, List, Optional, Tuple
+
+import numpy as np
+
+from Deeploy.DeeployTypes import CodeGenVerbosity, NetworkDeployer
+from Deeploy.Targets.MemPool.Platform import MemPoolPlatform
+from Deeploy.Targets.PULPOpen.Platform import MemoryPULPPlatform, MemoryPULPPlatformWrapper, PULPPlatform
+
+from .codeGenerate import generateL3HexDump
+
+
+def generateTrainingTestInputsHeader(deployer: NetworkDeployer,
+                                     all_mb_data: List[List[np.ndarray]],
+                                     n_steps: int,
+                                     n_accum: int,
+                                     grad_buf_start_idx: int = 0,
+                                     num_grad_inputs: int = 0,
+                                     learning_rate: float = 0.001,
+                                     init_weights: List[np.ndarray] = None,
+                                     data_size: int = None) -> str:
+    """Generate testinputs.h for training tests.
+
+    Parameters
+    ----------
+    deployer : NetworkDeployer
+        Prepared deployer (used to look up buffer types).
+    all_mb_data : list of list of np.ndarray
+        Per-mini-batch DATA arrays: ``all_mb_data[mb][buf]`` is the array for
+        mini-batch *mb* and DATA buffer *buf*.  All mini-batches must have the
+        same number of buffers.
+    n_steps : int
+        N_TRAIN_STEPS macro value.
+    n_accum : int
+        N_ACCUM_STEPS macro value.
+    grad_buf_start_idx : int
+        Index of the first grad accumulation buffer in DeeployNetwork_inputs[].
+        Used to emit TRAINING_GRAD_BUF_START_IDX.  Pass 0 (and num_grad_inputs=0)
+        to suppress the define (e.g. when no grad bufs exist).
+    num_grad_inputs : int
+        Number of grad accumulation buffers.  Used to emit TRAINING_NUM_GRAD_INPUTS.
+
+    Returns
+    -------
+    str
+        C header string.
+    """
+    total_mb = n_steps * n_accum
+    num_data = len(all_mb_data[0]) if all_mb_data else 0
+    # data_size: number of unique samples stored in C arrays.
+    # C harness cycles: testDataVector[mb % TRAINING_DATA_SIZE].
+    # Defaults to total_mb (no cycling) for backward compatibility.
+    effective_data_size = data_size if (data_size is not None and data_size < total_mb) else total_mb
+
+    retStr = ""
+    retStr += f"#define N_TRAIN_STEPS {n_steps}\n"
+    retStr += f"#define N_ACCUM_STEPS {n_accum}\n"
+    retStr += f"#define TRAINING_DATA_SIZE {effective_data_size}\n"
+    retStr += f"#define TRAINING_NUM_DATA_INPUTS {num_data}\n"
+    if num_grad_inputs > 0:
+        retStr += f"#define TRAINING_GRAD_BUF_START_IDX {grad_buf_start_idx}\n"
+        retStr += f"#define TRAINING_NUM_GRAD_INPUTS {num_grad_inputs}\n"
+        num_weight_inputs = grad_buf_start_idx - num_data
+        retStr += f"#define TRAINING_NUM_WEIGHT_INPUTS {num_weight_inputs}\n"
+    retStr += f"#define TRAINING_LEARNING_RATE {learning_rate:.10g}f\n"
+    retStr += "\n"
+
+    # Emit per-mini-batch buffer arrays — only effective_data_size unique rows.
+    # all_mb_data must contain exactly effective_data_size rows.
+    for mb in range(effective_data_size):
+        mb_data = all_mb_data[mb] if mb < len(all_mb_data) else all_mb_data[-1]
+        row_entries = []
+        for buf_idx, arr in enumerate(mb_data):
+            values = arr.reshape(-1)
+
+            # Determine C type from deployer context (buffer "input_N").
+            input_key = f"input_{buf_idx}"
+            if deployer.ctxt.is_buffer(input_key):
+                buffer = deployer.ctxt.lookup(input_key)
+                typeName = buffer._type.referencedType.typeName
+                typeWidth = buffer._type.referencedType.typeWidth
+            else:
+                # Fallback: infer from numpy dtype
+                if arr.dtype == np.float32 or arr.dtype == np.float64:
+                    typeName = "float32_t"
+                    typeWidth = 32
+                elif arr.dtype == np.int64:
+                    typeName = "int64_t"
+                    typeWidth = 64
+                elif arr.dtype == np.bool_ or arr.dtype == bool:
+                    typeName = "uint8_t"
+                    typeWidth = 8
+                else:
+                    typeName = "int32_t"
+                    typeWidth = 32
+
+            buf_name = f"testData_mb{mb}_buf{buf_idx}"
+            row_entries.append(buf_name)
+
+            # Format values
+            if typeName == 'float32_t':
+                list_str = ", ".join(
+                    [f'{float(x)}f' if not (np.isinf(x) or np.isnan(x)) else str(x) for x in values.astype(np.float32)])
+            else:
+                list_str = ", ".join([str(x) for x in values])
+
+            # 4-byte alignment padding
+            total_bytes = (values.size * typeWidth) // 8
+            pad_bytes = (-total_bytes) % 4
+            if pad_bytes:
+                paddingElements = (pad_bytes * 8 + typeWidth - 1) // typeWidth
+                list_str += ", " + ", ".join("0" for _ in range(paddingElements))
+
+            retStr += f"{typeName} {buf_name}[] = {{{list_str}}};\n"
+
+        # Emit the row pointer array for this mini-batch
+        row_name = f"testDataRow{mb}"
+        retStr += f"void* {row_name}[] = {{{', '.join(f'(void*){e}' for e in row_entries)}}};\n"
+        retStr += "\n"
+
+    # Emit the top-level vector of row pointers (only unique samples; C harness cycles via modulo).
+    retStr += f"void** testDataVector[{effective_data_size}] = {{{', '.join(f'testDataRow{mb}' for mb in range(effective_data_size))}}};\n"
+
+    # Emit initial weight arrays (one per weight input, indices num_data..grad_buf_start_idx-1).
+    if init_weights:
+        retStr += "\n"
+        weight_entries = []
+        num_data = len(all_mb_data[0]) if all_mb_data else 0
+        for wi, arr in enumerate(init_weights):
+            buf_global_idx = num_data + wi
+            input_key = f"input_{buf_global_idx}"
+            if deployer.ctxt.is_buffer(input_key):
+                buffer = deployer.ctxt.lookup(input_key)
+                typeName = buffer._type.referencedType.typeName
+                typeWidth = buffer._type.referencedType.typeWidth
+            else:
+                typeName = "float32_t"
+                typeWidth = 32
+            values = arr.reshape(-1).astype(np.float32)
+            # Tile values to match Deeploy's internal (possibly sequence-length-tiled) shape.
+            if deployer.ctxt.is_buffer(input_key):
+                expected_nelems = int(np.prod(deployer.ctxt.lookup(input_key).shape))
+                if expected_nelems > len(values) and expected_nelems % len(values) == 0:
+                    values = np.tile(values, expected_nelems // len(values))
+            list_str = ", ".join([f'{float(x)}f' for x in values])
+            buf_name = f"testInitWeight_{wi}"
+            weight_entries.append(buf_name)
+            retStr += f"{typeName} {buf_name}[] = {{{list_str}}};\n"
+        retStr += f"void* testInitWeights[{len(weight_entries)}] = {{{', '.join(f'(void*){e}' for e in weight_entries)}}};\n"
+
+    return retStr
+
+
+def generateTrainingTestOutputsHeader(
+    reference_losses: List = None,
+    tolerance_abs: float = 1e-3,
+) -> str:
+    """Generate testoutputs.h for training tests — loss comparison only.
+
+    Parameters
+    ----------
+    reference_losses : list of float, optional
+        Reference loss value for each forward pass (one per mini-batch step).
+        If None, loss comparison is skipped.
+    tolerance_abs : float
+        Absolute comparison tolerance emitted as TRAINING_TOLERANCE_ABS.
+
+    Returns
+    -------
+    str
+        C header string.
+    """
+    has_loss = reference_losses is not None and len(reference_losses) > 0
+
+    retStr = "// testoutputs.h — Phase 2: loss verification\n"
+    retStr += f"#define TRAINING_TOLERANCE_ABS {tolerance_abs:.10g}f\n\n"
+
+    if has_loss:
+        n = len(reference_losses)
+        retStr += "// Expected loss for each forward pass (one per mini-batch)\n"
+        retStr += f"#define N_LOSS_REFS {n}\n"
+        vals = ", ".join(f"{float(v):.10g}f" for v in reference_losses)
+        retStr += f"float32_t testLossRef[{n}] = {{{vals}}};\n\n"
+    else:
+        retStr += "// No loss reference available — loss comparison skipped.\n"
+        retStr += "#define N_LOSS_REFS 0\n\n"
+
+    return retStr
+
+
+def generateTrainingNetworkHeader(deployer: NetworkDeployer) -> str:
+    """Generate TrainingNetwork.h — same as generateTestNetworkHeader but with
+    RunTrainingNetwork / InitTrainingNetwork function names and a distinct header guard.
+
+    Parameters
+    ----------
+    deployer : NetworkDeployer
+        Prepared deployer.
+
+    Returns
+    -------
+    str
+        C header string.
+    """
+    retStr = ""
+
+    retStr += """
+#ifndef __DEEPLOY_TRAINING_HEADER__
+#define __DEEPLOY_TRAINING_HEADER__
+#include <stdio.h>
+#include <stdint.h>
+#include <stdlib.h>
+"""
+    retStr += deployer.generateIncludeString()
+    if isinstance(deployer.Platform, (PULPPlatform, MemoryPULPPlatform, MemoryPULPPlatformWrapper)):
+        retStr += """
+void RunTrainingNetwork();
+void InitTrainingNetwork();
+
+"""
+    else:
+        retStr += """
+void RunTrainingNetwork(uint32_t core_id, uint32_t numThreads);
+void InitTrainingNetwork(uint32_t core_id, uint32_t numThread);
+
+"""
+
+    retStr += deployer.generateIOBufferInitializationCode()
+    retStr += """
+#endif
+"""
+
+    return retStr
+
+
+def generateTrainingNetworkImplementation(deployer: NetworkDeployer, verbosityCfg: CodeGenVerbosity) -> str:
+    """Generate TrainingNetwork.c — same as generateTestNetworkImplementation but with
+    RunTrainingNetwork / InitTrainingNetwork function names and including TrainingNetwork.h.
+
+    Parameters
+    ----------
+    deployer : NetworkDeployer
+        Prepared deployer.
+    verbosityCfg : CodeGenVerbosity
+        Verbosity configuration.
+
+    Returns
+    -------
+    str
+        C implementation string.
+    """
+    retStr = ""
+
+    retStr += """#include <stdio.h>
+#include <stdlib.h>
+#include <math.h>
+"""
+    retStr += deployer.generateIncludeString()
+    retStr += """
+
+#include "TrainingNetwork.h"
+
+"""
+
+    retStr += deployer.generateBufferInitializationCode()
+    retStr += deployer.generateGlobalDefinitionCode()
+
+    if isinstance(deployer.Platform, MemPoolPlatform):
+        retStr += deployer.generateInferenceInitializationCode()
+        retStr += """
+void RunTrainingNetwork(__attribute__((unused)) uint32_t core_id, __attribute__((unused)) uint32_t numThreads){
+"""
+    elif isinstance(deployer.Platform, (PULPPlatform, MemoryPULPPlatform, MemoryPULPPlatformWrapper)):
+        retStr += """
+void RunTrainingNetwork(){
+"""
+        retStr += deployer.generateInferenceInitializationCode()
+    else:
+        retStr += """
+void RunTrainingNetwork(__attribute__((unused)) uint32_t core_id, __attribute__((unused)) uint32_t numThreads){
+"""
+        retStr += deployer.generateInferenceInitializationCode()
+
+    retStr += deployer.generateFunction(verbosityCfg)
+    if isinstance(deployer.Platform, (PULPPlatform, MemoryPULPPlatform, MemoryPULPPlatformWrapper)):
+        retStr += """
+}
+
+void InitTrainingNetwork(){
+"""
+    else:
+        retStr += """
+}
+
+void InitTrainingNetwork(__attribute__((unused)) uint32_t core_id, __attribute__((unused)) uint32_t numThreads){
+"""
+    retStr += deployer.generateEngineInitializationCode()
+    retStr += deployer.generateBufferAllocationCode()
+    retStr += """
+}
+"""
+
+    return retStr
+
+
+def generateTrainingTestNetwork(deployer: NetworkDeployer,
+                                all_mb_data: List[List[np.ndarray]],
+                                dumpdir: str,
+                                verbosityCfg: CodeGenVerbosity,
+                                n_steps: int = 1,
+                                n_accum: int = 1,
+                                num_data_inputs: int = 2,
+                                grad_buf_start_idx: int = 0,
+                                num_grad_inputs: int = 0,
+                                learning_rate: float = 0.001,
+                                reference_losses: List = None,
+                                init_weights: List = None,
+                                data_size: int = None,
+                                tolerance_abs: float = 1e-3) -> None:
+    """Generate all training test files: testinputs.h, testoutputs.h, TrainingNetwork.h, TrainingNetwork.c.
+
+    Parameters
+    ----------
+    deployer : NetworkDeployer
+        Prepared deployer (ctxt.name must already be set to "DeeployTrainingNetwork").
+    all_mb_data : list of list of np.ndarray
+        Per-mini-batch DATA arrays: ``all_mb_data[mb][buf]`` is the array for
+        mini-batch *mb* and DATA buffer *buf*.
+    dumpdir : str
+        Output directory for generated files.
+    verbosityCfg : CodeGenVerbosity
+        Verbosity configuration.
+    n_steps : int
+        N_TRAIN_STEPS value.
+    n_accum : int
+        N_ACCUM_STEPS value.
+    num_data_inputs : int
+        Number of data inputs (TRAINING_NUM_DATA_INPUTS).
+    grad_buf_start_idx : int
+        Index of the first grad accumulation buffer in DeeployNetwork_inputs[].
+    num_grad_inputs : int
+        Number of grad accumulation buffers (TRAINING_NUM_GRAD_INPUTS).
+    """
+    assert deployer.prepared, "An unprepared deployer was given"
+
+    os.makedirs(dumpdir, exist_ok = True)
+
+    # testinputs.h
+    testInputStr = generateTrainingTestInputsHeader(deployer,
+                                                    all_mb_data,
+                                                    n_steps,
+                                                    n_accum,
+                                                    grad_buf_start_idx,
+                                                    num_grad_inputs,
+                                                    learning_rate,
+                                                    init_weights = init_weights,
+                                                    data_size = data_size)
+    with open(f'{dumpdir}/testinputs.h', 'w') as f:
+        f.write(testInputStr)
+
+    # testoutputs.h
+    testOutputStr = generateTrainingTestOutputsHeader(
+        reference_losses = reference_losses,
+        tolerance_abs = tolerance_abs,
+    )
+    with open(f'{dumpdir}/testoutputs.h', 'w') as f:
+        f.write(testOutputStr)
+
+    # TrainingNetwork.h
+    headerStr = generateTrainingNetworkHeader(deployer)
+    with open(f'{dumpdir}/TrainingNetwork.h', 'w') as f:
+        f.write(headerStr)
+
+    # TrainingNetwork.c
+    implStr = generateTrainingNetworkImplementation(deployer, verbosityCfg)
+    with open(f'{dumpdir}/TrainingNetwork.c', 'w') as f:
+        f.write(implStr)
+
+    clang_format = "{BasedOnStyle: llvm, IndentWidth: 2, ColumnLimit: 160}"
+    for fname in ['TrainingNetwork.c', 'TrainingNetwork.h', 'testinputs.h', 'testoutputs.h']:
+        os.system(f'clang-format -i --style="{clang_format}" {dumpdir}/{fname}')
+
+    # Build initial-value list for every input_N buffer so that L3 hex files
+    # can be written.  The list must cover all N where "input_N" exists in the
+    # deployer context.  Layout (must match DeeployNetwork_inputs[] order):
+    #   [0 .. num_data_inputs-1]              → first mini-batch data
+    #   [num_data_inputs .. grad_start-1]     → initial weights
+    #   [grad_start .. grad_start+num_grad-1] → zeros  (grad acc bufs)
+    #   [last]                                → lazy_reset_grad = 1 (uint8)
+    l3_initial_inputs: List[np.ndarray] = []
+    # Count how many input_N buffers exist in the deployer context
+    n_total_inputs = sum(
+        1 for name in deployer.ctxt.globalObjects if name.startswith("input_") and name[len("input_"):].isdigit())
+    for i in range(n_total_inputs):
+        if all_mb_data and i < len(all_mb_data[0]):
+            # Data / label input
+            l3_initial_inputs.append(all_mb_data[0][i])
+        elif (init_weights is not None and grad_buf_start_idx > 0 and num_data_inputs <= i < grad_buf_start_idx):
+            # Weight input
+            wi = i - num_data_inputs
+            l3_initial_inputs.append(init_weights[wi] if wi <
+                                     len(init_weights) else np.array([0.0], dtype = np.float32))
+        elif (grad_buf_start_idx > 0 and num_grad_inputs > 0
+              and grad_buf_start_idx <= i < grad_buf_start_idx + num_grad_inputs):
+            # Gradient accumulation buffer — zero-initialised
+            buf = deployer.ctxt.globalObjects.get(f"input_{i}")
+            shape = buf.shape if (buf is not None and hasattr(buf, 'shape')) else (1,)
+            l3_initial_inputs.append(np.zeros(shape, dtype = np.float32))
+        else:
+            # lazy_reset_grad (last input) or any unknown slot — default 1 / uint8
+            buf = deployer.ctxt.globalObjects.get(f"input_{i}")
+            shape = buf.shape if (buf is not None and hasattr(buf, 'shape')) else (1,)
+            l3_initial_inputs.append(np.ones(shape, dtype = np.uint8))
+
+    generateL3HexDump(deployer, os.path.join(dumpdir, 'hex'), l3_initial_inputs, [])
+
+
+# ---------------------------------------------------------------------------
+# Optimizer network code-generation helpers
+# ---------------------------------------------------------------------------
+
+_OPT_PREFIX = "DeeployOptNetwork_"
+_TRAIN_PREFIX = "DeeployNetwork_"
+
+
+def build_shared_buffer_maps(train_onnx_path: str, opt_onnx_model) -> Tuple[Dict[int, int], Dict[int, int]]:
+    """Build optimizer→training index maps for tensors shared between the two graphs.
+
+    The optimizer ONNX inputs are interleaved weight/grad pairs that have the
+    same tensor names as inputs in the training ONNX graph.  We match by name
+    so that ``InitOptimizerNetwork`` can reference the already-allocated
+    ``DeeployNetwork_input_N`` pointers instead of allocating fresh buffers.
+
+    Parameters
+    ----------
+    train_onnx_path : str
+        Path to the training ``network.onnx``.
+    opt_onnx_model :
+        Already-loaded optimizer ONNX model (``onnx.ModelProto``).
+
+    Returns
+    -------
+    shared_input_map : Dict[int, int]
+        opt_input_idx → train_input_idx
+    shared_output_map : Dict[int, int]
+        opt_output_idx → train_input_idx  (SGD outputs == updated weights,
+        same physical buffer as the weight input)
+    """
+    import onnx as _onnx
+    train_model = _onnx.load_model(train_onnx_path)
+    train_names = [inp.name for inp in train_model.graph.input]
+    train_name_to_idx = {name: i for i, name in enumerate(train_names)}
+
+    opt_input_names = [inp.name for inp in opt_onnx_model.graph.input]
+    opt_output_names = [out.name for out in opt_onnx_model.graph.output]
+
+    shared_input_map: Dict[int, int] = {}
+    for opt_idx, name in enumerate(opt_input_names):
+        if name in train_name_to_idx:
+            shared_input_map[opt_idx] = train_name_to_idx[name]
+
+    shared_output_map: Dict[int, int] = {}
+    for opt_idx, name in enumerate(opt_output_names):
+        # Try exact match first; then strip the '_updated' suffix that the SGD
+        # node appends to output tensor names (e.g. 'conv1_weight_updated' → 'conv1_weight').
+        lookup_name = name
+        if lookup_name not in train_name_to_idx and lookup_name.endswith('_updated'):
+            lookup_name = lookup_name[:-len('_updated')]
+        if lookup_name in train_name_to_idx:
+            shared_output_map[opt_idx] = train_name_to_idx[lookup_name]
+
+    return shared_input_map, shared_output_map
+
+
+def _patch_shared_buffers(retStr: str, shared_input_map: Dict[int, int], shared_output_map: Dict[int, int]) -> str:
+    """Redirect optimizer I/O buffers to Training's already-allocated buffers.
+
+    Must be called AFTER the _TRAIN_PREFIX → _OPT_PREFIX substitution so that
+    the generated symbols already carry the ``DeeployOptNetwork_`` prefix.
+
+    Handles two allocation styles produced by Deeploy:
+
+    *Non-tiled* (per-buffer malloc)::
+
+        DeeployOptNetwork_input_N = (SomeType *)pi_l2_malloc(sizeof(...));
+
+    *Tiled* (single arena with offsets)::
+
+        DeeployOptNetwork_input_N = (float32_t *)((char *)DeeployOptNetwork_MEMORYARENA_L2 + OFFSET);
+
+    Both are replaced with direct pointers into the TrainingNetwork arenas::
+
+        DeeployOptNetwork_input_N = (float32_t *)DeeployNetwork_input_M;
+
+    After all I/O pointers are redirected, if a ``MEMORYARENA_L2`` or
+    ``MEMORYARENA_L3`` allocation is no longer referenced anywhere in the Init
+    body (i.e., the shared buffers consumed the entire arena), the now-unused
+    malloc is also removed to reclaim the L2/L3 memory.
+
+    Parameters
+    ----------
+    retStr : str
+        The already-prefix-substituted C source string.
+    shared_input_map : Dict[int, int]
+        Optimizer input index → training input index.
+    shared_output_map : Dict[int, int]
+        Optimizer output index → training input index (in-place update).
+
+    Returns
+    -------
+    str
+        Patched C source string.
+    """
+    if not shared_input_map and not shared_output_map:
+        return retStr
+
+    # ------------------------------------------------------------------
+    # Pattern 1 (non-tiled): individual pi_*_malloc per buffer
+    # ------------------------------------------------------------------
+    _malloc_pat = re.compile(
+        r'(DeeployOptNetwork_(input|output)_(\d+))\s*=\s*\([^)]+\s*\*\s*\)\s*pi_\w+_malloc\([^;]+\);')
+
+    # ------------------------------------------------------------------
+    # Pattern 2 (tiled): arena-offset assignment
+    #   DeeployOptNetwork_input_N = (Type *)((char *)DeeployOptNetwork_MEMORYARENA_Lx + OFFSET);
+    # ------------------------------------------------------------------
+    _arena_pat = re.compile(r'(DeeployOptNetwork_(input|output)_(\d+))\s*=\s*\([^)]+\s*\*\s*\)'
+                            r'\s*\(\s*\(char\s*\*\)\s*DeeployOptNetwork_MEMORYARENA_L\w+\s*\+\s*\d+\s*\)\s*;')
+
+    def _make_replacement(symbol: str, kind: str, idx: int) -> Optional[str]:
+        if kind == "input" and idx in shared_input_map:
+            train_idx = shared_input_map[idx]
+            return f'{symbol} = (float32_t *){_TRAIN_PREFIX}input_{train_idx};  /* shared with TrainingNetwork */'
+        if kind == "output" and idx in shared_output_map:
+            train_idx = shared_output_map[idx]
+            return f'{symbol} = (float32_t *){_TRAIN_PREFIX}input_{train_idx};  /* in-place, shared with TrainingNetwork */'
+        return None
+
+    def _replace(m: re.Match) -> str:
+        replacement = _make_replacement(m.group(1), m.group(2), int(m.group(3)))
+        return replacement if replacement is not None else m.group(0)
+
+    retStr = _malloc_pat.sub(_replace, retStr)
+    retStr = _arena_pat.sub(_replace, retStr)
+
+    # ------------------------------------------------------------------
+    # Arena elimination: if a MEMORYARENA_Lx is no longer used for any
+    # pointer arithmetic after the redirects, its malloc is dead and can
+    # be removed to reclaim L2/L3.  The global declaration is left in
+    # place (harmless; the variable will be NULL at runtime).
+    # ------------------------------------------------------------------
+    for level in ('L2', 'L3'):
+        arena_sym = f'DeeployOptNetwork_MEMORYARENA_{level}'
+        # Pattern for the malloc assignment line itself
+        malloc_line_pat = re.compile(rf'[^\n]*{re.escape(arena_sym)}\s*=\s*\([^)]+\)\s*pi_\w+_malloc\([^;]+\);\s*\n')
+        # Pattern for any use of the arena in pointer arithmetic:
+        #   (char *)ARENA + OFFSET  or  (void *)ARENA  etc.
+        arena_use_pat = re.compile(rf'\(\s*(?:char|void|int8_t)\s*\*\s*\)\s*{re.escape(arena_sym)}')
+        if not arena_use_pat.search(retStr):
+            # No remaining pointer arithmetic — the malloc is dead
+            retStr = malloc_line_pat.sub('', retStr)
+
+    # ------------------------------------------------------------------
+    # Inject TrainingNetwork header so DeeployNetwork_input_N symbols resolve
+    # ------------------------------------------------------------------
+    retStr = retStr.replace(
+        '#include "OptimizerNetwork.h"',
+        '#include "OptimizerNetwork.h"\n#include "TrainingNetwork.h"',
+    )
+    return retStr
+
+
+def _patch_shared_arenas(retStr: str, train_c_source: str) -> str:
+    """Redirect optimizer L1/L2 arena allocations to reuse training network's arenas.
+
+    TrainingNetwork and OptimizerNetwork run strictly sequentially: RunTrainingNetwork()
+    completes before RunOptimizerNetwork() starts.  Their L1/L2 tile-working arenas
+    therefore never overlap in time and can share the same physical memory.
+
+    Only the L1 arena is shared: it is pure tile-compute scratch whose content is
+    dead after each kernel returns.  The L2 arena is NOT shared because it may hold
+    persistent tensor data (weights, activations) at fixed offsets in non-tiled mode;
+    sharing it would let the optimizer's L2 staging buffers overwrite that data.
+
+    Must be called AFTER the _TRAIN_PREFIX → _OPT_PREFIX substitution.
+
+    Parameters
+    ----------
+    retStr : str
+        The already-prefix-substituted C source string for the optimizer.
+    train_c_source : str
+        The full text of TrainingNetwork.c (used to confirm the arena symbols exist).
+
+    Returns
+    -------
+    str
+        Patched C source string.
+    """
+    for level in ('L1',):
+        train_sym = f'DeeployNetwork_MEMORYARENA_{level}'
+        # Only alias if the training network actually has this arena
+        if train_sym not in train_c_source:
+            continue
+
+        opt_sym = f'DeeployOptNetwork_MEMORYARENA_{level}'
+        opt_malloc_pat = re.compile(rf'({re.escape(opt_sym)})\s*=\s*\([^)]+\)\s*\w+\(sizeof\([^)]+\)\s*\*\s*\d+\)\s*;')
+        if not opt_malloc_pat.search(retStr):
+            continue
+
+        replacement = f'{opt_sym} = (int8_t *){train_sym};  /* shared with TrainingNetwork */'
+        retStr = opt_malloc_pat.sub(replacement, retStr)
+
+    # Inject TrainingNetwork header if not already present
+    # (_patch_shared_buffers may have already added it; guard against duplicates)
+    if '#include "TrainingNetwork.h"' not in retStr:
+        retStr = retStr.replace(
+            '#include "OptimizerNetwork.h"',
+            '#include "OptimizerNetwork.h"\n#include "TrainingNetwork.h"',
+        )
+
+    return retStr
+
+
+def _ensure_training_l1_capacity(dumpdir: str, train_c_source: str, opt_alloc_code: str) -> str:
+    """Enlarge TrainingNetwork's L1 arena to cover the optimizer's L1 needs.
+
+    Since the two networks share the same L1 arena, TrainingNetwork must allocate
+    at least max(train_L1, opt_L1) bytes.  When the optimizer needs more L1 than
+    training (rare but possible, e.g. autoencoder), this function patches
+    TrainingNetwork.c and TrainingNetwork.h in-place and returns the updated
+    TrainingNetwork.c source string.
+
+    Parameters
+    ----------
+    dumpdir : str
+        Directory containing TrainingNetwork.c and TrainingNetwork.h.
+    train_c_source : str
+        Current content of TrainingNetwork.c.
+    opt_alloc_code : str
+        Optimizer buffer-allocation code after _TRAIN_PREFIX → _OPT_PREFIX
+        substitution (used to extract the optimizer's L1 size).
+
+    Returns
+    -------
+    str
+        (Possibly updated) TrainingNetwork.c source string.
+    """
+    m_opt = re.search(
+        r'DeeployOptNetwork_MEMORYARENA_L1\s*=\s*\([^)]+\)\s*pmsis_l1_malloc\(sizeof\([^)]+\)\s*\*\s*(\d+)\)',
+        opt_alloc_code,
+    )
+    if not m_opt:
+        return train_c_source
+
+    opt_l1 = int(m_opt.group(1))
+
+    m_train = re.search(
+        r'(DeeployNetwork_MEMORYARENA_L1\s*=\s*\([^)]+\)\s*pmsis_l1_malloc\(sizeof\([^)]+\)\s*\*\s*)(\d+)(\))',
+        train_c_source,
+    )
+    if not m_train:
+        return train_c_source
+
+    train_l1 = int(m_train.group(2))
+    if opt_l1 <= train_l1:
+        return train_c_source  # Already large enough
+
+    new_l1 = opt_l1
+
+    # Patch TrainingNetwork.c malloc size
+    train_c_new = train_c_source.replace(
+        m_train.group(0),
+        f'{m_train.group(1)}{new_l1}{m_train.group(3)}',
+        1,
+    )
+    train_c_path = os.path.join(dumpdir, 'TrainingNetwork.c')
+    with open(train_c_path, 'w') as f:
+        f.write(train_c_new)
+
+    # Patch TrainingNetwork.h _len constant
+    train_h_path = os.path.join(dumpdir, 'TrainingNetwork.h')
+    if os.path.exists(train_h_path):
+        train_h = open(train_h_path).read()
+        train_h_new = re.sub(
+            r'(DeeployNetwork_MEMORYARENA_L1_len\s*=\s*)\d+',
+            rf'\g<1>{new_l1}',
+            train_h,
+        )
+        with open(train_h_path, 'w') as f:
+            f.write(train_h_new)
+
+    return train_c_new
+
+
+def generateOptimizerNetworkHeader(deployer: NetworkDeployer) -> str:
+    """Generate OptimizerNetwork.h.
+
+    Reuses the Deeploy deployer's output and applies two transformations:
+      1. Replace the buffer prefix ``DeeployNetwork_`` → ``DeeployOptNetwork_``
+      2. Inject ``RunOptimizerNetwork`` / ``InitOptimizerNetwork`` function declarations.
+
+    Parameters
+    ----------
+    deployer : NetworkDeployer
+        Prepared deployer for the optimizer ONNX graph.
+
+    Returns
+    -------
+    str
+        C header string.
+    """
+    retStr = ""
+    retStr += """
+#ifndef __DEEPLOY_OPTIMIZER_HEADER__
+#define __DEEPLOY_OPTIMIZER_HEADER__
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+"""
+    retStr += deployer.generateIncludeString()
+    if isinstance(deployer.Platform, (PULPPlatform, MemoryPULPPlatform, MemoryPULPPlatformWrapper)):
+        retStr += """
+void RunOptimizerNetwork();
+void InitOptimizerNetwork();
+
+"""
+    else:
+        retStr += """
+void RunOptimizerNetwork(uint32_t core_id, uint32_t numThreads);
+void InitOptimizerNetwork(uint32_t core_id, uint32_t numThreads);
+
+"""
+    retStr += deployer.generateIOBufferInitializationCode()
+    retStr += """
+#endif
+"""
+    # Prefix substitution: all Deeploy-generated DeeployNetwork_ → DeeployOptNetwork_
+    retStr = retStr.replace(_TRAIN_PREFIX, _OPT_PREFIX)
+    return retStr
+
+
+def generateOptimizerNetworkImplementation(deployer: NetworkDeployer,
+                                           verbosityCfg: CodeGenVerbosity,
+                                           shared_input_map: Optional[Dict[int, int]] = None,
+                                           shared_output_map: Optional[Dict[int, int]] = None,
+                                           train_c_source: Optional[str] = None) -> str:
+    """Generate OptimizerNetwork.c.
+
+    Parameters
+    ----------
+    deployer : NetworkDeployer
+        Prepared deployer for the optimizer ONNX graph.
+    verbosityCfg : CodeGenVerbosity
+        Verbosity configuration.
+    shared_input_map : Dict[int, int], optional
+        Optimizer input index → training input index for shared weight/grad buffers.
+        When provided, those malloc calls are replaced with references to the
+        already-allocated TrainingNetwork buffers.
+    shared_output_map : Dict[int, int], optional
+        Optimizer output index → training input index for in-place shared outputs.
+    train_c_source : str, optional
+        Full text of TrainingNetwork.c.  When provided, the optimizer's L1/L2 arena
+        malloc calls are replaced with direct pointers to the training arenas,
+        saving one L1 and one L2 allocation (safe because the two networks run
+        strictly sequentially).
+
+    Returns
+    -------
+    str
+        C implementation string.
+    """
+    retStr = ""
+    retStr += """#include <math.h>
+#include <stdio.h>
+#include <stdlib.h>
+"""
+    retStr += deployer.generateIncludeString()
+    retStr += """
+#include "OptimizerNetwork.h"
+
+"""
+    retStr += deployer.generateBufferInitializationCode()
+    retStr += deployer.generateGlobalDefinitionCode()
+
+    if isinstance(deployer.Platform, MemPoolPlatform):
+        retStr += deployer.generateInferenceInitializationCode()
+        retStr += """
+void RunOptimizerNetwork(__attribute__((unused)) uint32_t core_id, __attribute__((unused)) uint32_t numThreads){
+"""
+    elif isinstance(deployer.Platform, (PULPPlatform, MemoryPULPPlatform, MemoryPULPPlatformWrapper)):
+        retStr += """
+void RunOptimizerNetwork(){
+"""
+        retStr += deployer.generateInferenceInitializationCode()
+    else:
+        retStr += """
+void RunOptimizerNetwork(__attribute__((unused)) uint32_t core_id, __attribute__((unused)) uint32_t numThreads){
+"""
+        retStr += deployer.generateInferenceInitializationCode()
+
+    retStr += deployer.generateFunction(verbosityCfg)
+
+    if isinstance(deployer.Platform, (PULPPlatform, MemoryPULPPlatform, MemoryPULPPlatformWrapper)):
+        retStr += """
+}
+
+void InitOptimizerNetwork(){
+"""
+    else:
+        retStr += """
+}
+
+void InitOptimizerNetwork(__attribute__((unused)) uint32_t core_id, __attribute__((unused)) uint32_t numThreads){
+"""
+    retStr += deployer.generateEngineInitializationCode()
+    retStr += deployer.generateBufferAllocationCode()
+    retStr += """
+}
+"""
+    # Prefix substitution
+    retStr = retStr.replace(_TRAIN_PREFIX, _OPT_PREFIX)
+    # Replace malloc calls for shared weight/grad buffers with Training pointers
+    retStr = _patch_shared_buffers(retStr, shared_input_map or {}, shared_output_map or {})
+    # Redirect optimizer L1/L2 arena mallocs to reuse training arenas
+    if train_c_source:
+        retStr = _patch_shared_arenas(retStr, train_c_source)
+    return retStr
+
+
+def generateOptimizerTestNetwork(deployer: NetworkDeployer,
+                                 dumpdir: str,
+                                 verbosityCfg: CodeGenVerbosity,
+                                 shared_input_map: Optional[Dict[int, int]] = None,
+                                 shared_output_map: Optional[Dict[int, int]] = None) -> None:
+    """Generate OptimizerNetwork.h and OptimizerNetwork.c.
+
+    Parameters
+    ----------
+    deployer : NetworkDeployer
+        Prepared deployer for the optimizer ONNX graph.
+    dumpdir : str
+        Output directory for generated files.
+    verbosityCfg : CodeGenVerbosity
+        Verbosity configuration.
+    shared_input_map : Dict[int, int], optional
+        Optimizer input index → training input index for shared weight/grad buffers.
+    shared_output_map : Dict[int, int], optional
+        Optimizer output index → training input index for in-place shared outputs.
+    """
+    assert deployer.prepared, "An unprepared deployer was given"
+
+    os.makedirs(dumpdir, exist_ok = True)
+
+    train_c_path = os.path.join(dumpdir, 'TrainingNetwork.c')
+    train_c_source: Optional[str] = None
+    if os.path.exists(train_c_path):
+        with open(train_c_path, 'r') as f:
+            train_c_source = f.read()
+
+    # Enlarge training L1 arena if optimizer needs more (so unconditional L1 sharing is safe)
+    if train_c_source:
+        opt_alloc_preview = deployer.generateBufferAllocationCode().replace(_TRAIN_PREFIX, _OPT_PREFIX)
+        train_c_source = _ensure_training_l1_capacity(dumpdir, train_c_source, opt_alloc_preview)
+
+    headerStr = generateOptimizerNetworkHeader(deployer)
+    with open(f'{dumpdir}/OptimizerNetwork.h', 'w') as f:
+        f.write(headerStr)
+
+    implStr = generateOptimizerNetworkImplementation(deployer, verbosityCfg, shared_input_map, shared_output_map,
+                                                     train_c_source)
+    with open(f'{dumpdir}/OptimizerNetwork.c', 'w') as f:
+        f.write(implStr)
+
+    clang_format = "{BasedOnStyle: llvm, IndentWidth: 2, ColumnLimit: 160}"
+    for fname in ['OptimizerNetwork.c', 'OptimizerNetwork.h']:
+        os.system(f'clang-format -i --style="{clang_format}" {dumpdir}/{fname}')

From 969f593a786aa3d74f757be1354ddb0c3f3707cb Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 18:42:57 +0000
Subject: [PATCH 24/28] training-platform core: drop redundant top-level
 deeployTrainingRunner.py
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

DeeployTest/deeployTrainingRunner.py was a third training-runner entry
point that peeked at --tiled / -p ahead of testUtils.deeployTrainingRunner.main()
and forwarded tiling_enabled accordingly.  It duplicated what the two
explicit stubs already do by file name:

    deeployTrainingRunner_siracusa.py         → main(tiling_enabled=False)
    deeployTrainingRunner_tiled_siracusa.py   → main(tiling_enabled=True)

Nothing in the repo (py imports, CMake, CI workflows) references the
top-level file; only its own docstring and an offline AI_AGENT planning
doc mention it.  Remove it so the training CLI surface is just the two
file-name-scoped stubs, consistent with the preference to keep the
non-tiled and tiled variants visually split.

Verified on Siracusa: simplemlp_train still passes 0/4 (diff=0.000000 at
every step) in both non-tiled and tiled runs.
---
 DeeployTest/deeployTrainingRunner.py | 30 ----------------------------
 1 file changed, 30 deletions(-)
 delete mode 100644 DeeployTest/deeployTrainingRunner.py

diff --git a/DeeployTest/deeployTrainingRunner.py b/DeeployTest/deeployTrainingRunner.py
deleted file mode 100644
index 7dfc7d965d..0000000000
--- a/DeeployTest/deeployTrainingRunner.py
+++ /dev/null
@@ -1,30 +0,0 @@
-#!/usr/bin/env python
-# SPDX-FileCopyrightText: 2025 ETH Zurich and University of Bologna
-#
-# SPDX-License-Identifier: Apache-2.0
-"""
-CLI runner for training tests on Siracusa and GAP9.
-
-Usage:
-    python deeployTrainingRunner.py -t <test_dir> [-p Siracusa|GAP9] [--tiled] [options]
-
-Examples:
-    python deeployTrainingRunner.py -t Tests/Models/MLP_Train/simplemlp_train
-    python deeployTrainingRunner.py -t Tests/Models/MLP_Train/simplemlp_train -p GAP9
-    python deeployTrainingRunner.py -t Tests/Models/SmallTransformer/tinytransformer_train --tiled
-    python deeployTrainingRunner.py -t Tests/Models/SmallTransformer/tinytransformer_train --tiled -p GAP9
-"""
-
-import argparse
-import sys
-
-from testUtils.deeployTrainingRunner import main
-
-if __name__ == '__main__':
-    # Peek at --tiled and -p before passing to main(), which builds its own parser.
-    pre = argparse.ArgumentParser(add_help = False)
-    pre.add_argument('--tiled', action = 'store_true', default = False)
-    pre.add_argument('-p', '--platform', default = 'Siracusa')
-    known, _ = pre.parse_known_args()
-
-    sys.exit(main(tiling_enabled = known.tiled, default_platform = known.platform))

From ac4df5b150eda45ed4b2f5f4427fdf1da270d4f5 Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 18:55:00 +0000
Subject: [PATCH 25/28] training-platform core: drop non-training helper
 wrappers from trainingUtils
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Apply a strict "only training-specific helpers get a dedicated function"
rule to testUtils/trainingUtils.py.  The following wrappers were removed
because their content was not training-specific:

  - add_cores_arg          (--cores is generic codegen)
  - add_memory_level_args  (--l1 / --l2 / --defaultMemLevel — tiling generic)
  - add_tiling_solver_args (--memAllocStrategy / --searchStrategy /
                            --plotMemAlloc / --profileTiling — tiling generic)
  - add_should_fail_arg    (--shouldFail is used by all codegen scripts)
  - run_with_shouldfail    (the try/except shouldFail handshake is used
                            by all codegen scripts)
  - build_codegen_cmd      (generic [python, script, -d, -t, -p] prefix)
  - run_codegen_subprocess (generic log + subprocess.run + check)
  - filter_passthrough_args (generic list comprehension)

The first five are now inlined in each of the four training codegen
entry points (generateTrainingNetwork.py, testMVPTraining.py,
generateOptimizerNetwork.py, testMVPOptimizer.py), matching the style of
the upstream inference scripts which define these args inline as well.

The last three (subprocess helpers) were only called from inside
run_training_codegen() itself, so they become inline code there.

What stays in trainingUtils.py are only helpers whose content is
genuinely training-specific: the inputs.npz/outputs.npz readers, the
Tiler _mockScheduler, the training-argument argparse builders
(add_training_inference_args, add_optimizer_training_dir_arg),
resolve_optimizer_dir, add_training_cmake_flags and run_training_codegen.

Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every
step) in both non-tiled and tiled runs.
---
 DeeployTest/generateOptimizerNetwork.py |  26 +++-
 DeeployTest/generateTrainingNetwork.py  |  19 ++-
 DeeployTest/testMVPOptimizer.py         |  41 +++++-
 DeeployTest/testMVPTraining.py          |  41 +++++-
 DeeployTest/testUtils/trainingUtils.py  | 186 ++++++------------------
 5 files changed, 146 insertions(+), 167 deletions(-)

diff --git a/DeeployTest/generateOptimizerNetwork.py b/DeeployTest/generateOptimizerNetwork.py
index a277a3a2a8..d13b29505e 100644
--- a/DeeployTest/generateOptimizerNetwork.py
+++ b/DeeployTest/generateOptimizerNetwork.py
@@ -30,8 +30,7 @@
 from testUtils.codeGenerateTraining import build_shared_buffer_maps, generateOptimizerTestNetwork
 from testUtils.platformMapping import mapDeployer, mapPlatform, setupMemoryPlatform
 from testUtils.testRunner import TestGeneratorArgumentParser
-from testUtils.trainingUtils import add_cores_arg, add_memory_level_args, add_optimizer_training_dir_arg, \
-    add_should_fail_arg, run_with_shouldfail
+from testUtils.trainingUtils import add_optimizer_training_dir_arg
 
 from Deeploy.AbstractDataTypes import PointerClass
 from Deeploy.CommonExtensions.DataTypes import float32_t
@@ -120,15 +119,30 @@ def generateOptimizerNetwork(args):
 
 if __name__ == '__main__':
     parser = TestGeneratorArgumentParser(description = "Deeploy Optimizer Network Code Generation.")
-    add_cores_arg(parser)
+    parser.add_argument("--cores", type = int, default = 1, help = "Number of cluster cores. Default: 1.")
     parser.add_argument(
         "--lr",
         type = float,
         default = 0.001,
         help = "Learning rate (informational only; embedded in optimizer ONNX attributes). Default: 0.001.",
     )
-    add_memory_level_args(parser)
+    parser.add_argument("--l1", type = int, default = 64_000, help = "L1 size in bytes. Default: 64000.")
+    parser.add_argument("--l2", type = int, default = 1_024_000, help = "L2 size in bytes. Default: 1024000.")
+    parser.add_argument("--defaultMemLevel",
+                        type = str,
+                        default = "L2",
+                        help = "Default memory level for IO buffers. Default: L2.")
     add_optimizer_training_dir_arg(parser)
-    add_should_fail_arg(parser)
+    parser.add_argument("--shouldFail", action = "store_true")
+    parser.set_defaults(shouldFail = False)
     args = parser.parse_args()
-    run_with_shouldfail(generateOptimizerNetwork, args, "Optimizer network generation")
+
+    try:
+        generateOptimizerNetwork(args)
+    except Exception:
+        if args.shouldFail:
+            print("\033[92mOptimizer network generation ended, failed as expected!\033[0m")
+            sys.exit(0)
+        raise
+    if args.shouldFail:
+        raise RuntimeError("Expected to fail!")
diff --git a/DeeployTest/generateTrainingNetwork.py b/DeeployTest/generateTrainingNetwork.py
index 7ce3e5d35f..dd0ce87718 100644
--- a/DeeployTest/generateTrainingNetwork.py
+++ b/DeeployTest/generateTrainingNetwork.py
@@ -13,8 +13,7 @@
 from testUtils.platformMapping import mapDeployer, mapPlatform
 from testUtils.testRunner import TestGeneratorArgumentParser
 from testUtils.trainingUtils import _GRAD_ACC, _infer_data_size, _infer_n_accum, _infer_num_data_inputs, \
-    _infer_total_mb, _load_reference_losses, add_cores_arg, add_should_fail_arg, add_training_inference_args, \
-    run_with_shouldfail
+    _infer_total_mb, _load_reference_losses, add_training_inference_args
 from testUtils.typeMapping import inferTypeAndOffset
 
 from Deeploy.AbstractDataTypes import PointerClass
@@ -218,8 +217,18 @@ def generateTrainingNetwork(args):
 
 if __name__ == '__main__':
     parser = TestGeneratorArgumentParser(description = "Deeploy Training Code Generation Utility.")
-    add_cores_arg(parser)
+    parser.add_argument("--cores", type = int, default = 1, help = "Number of cluster cores. Default: 1.")
     add_training_inference_args(parser)
-    add_should_fail_arg(parser)
+    parser.add_argument("--shouldFail", action = "store_true")
+    parser.set_defaults(shouldFail = False)
     args = parser.parse_args()
-    run_with_shouldfail(generateTrainingNetwork, args, "Training network generation")
+
+    try:
+        generateTrainingNetwork(args)
+    except Exception:
+        if args.shouldFail:
+            print("\033[92mTraining network generation ended, failed as expected!\033[0m")
+            sys.exit(0)
+        raise
+    if args.shouldFail:
+        raise RuntimeError("Expected to fail!")
diff --git a/DeeployTest/testMVPOptimizer.py b/DeeployTest/testMVPOptimizer.py
index e90c20dd10..f75fe4902e 100644
--- a/DeeployTest/testMVPOptimizer.py
+++ b/DeeployTest/testMVPOptimizer.py
@@ -36,8 +36,7 @@
 from testUtils.platformMapping import mapDeployer, mapPlatform, setupMemoryPlatform
 from testUtils.testRunner import TestGeneratorArgumentParser
 from testUtils.tilingUtils import TrainingSBTiler
-from testUtils.trainingUtils import _mockScheduler, add_cores_arg, add_memory_level_args, \
-    add_optimizer_training_dir_arg, add_should_fail_arg, add_tiling_solver_args, run_with_shouldfail
+from testUtils.trainingUtils import _mockScheduler, add_optimizer_training_dir_arg
 
 from Deeploy.AbstractDataTypes import PointerClass
 from Deeploy.CommonExtensions.DataTypes import float32_t
@@ -148,16 +147,44 @@ def generateTiledOptimizerNetwork(args) -> None:
 
 if __name__ == '__main__':
     parser = TestGeneratorArgumentParser(description = "Deeploy Tiled Optimizer Network Code Generation.")
-    add_cores_arg(parser)
+    parser.add_argument("--cores", type = int, default = 1, help = "Number of cluster cores. Default: 1.")
     parser.add_argument(
         "--lr",
         type = float,
         default = 0.001,
         help = "Learning rate (informational only; embedded in optimizer ONNX attributes). Default: 0.001.",
     )
-    add_memory_level_args(parser)
-    add_tiling_solver_args(parser)
+    parser.add_argument("--l1", type = int, default = 64_000, help = "L1 size in bytes. Default: 64000.")
+    parser.add_argument("--l2", type = int, default = 1_024_000, help = "L2 size in bytes. Default: 1024000.")
+    parser.add_argument("--defaultMemLevel",
+                        type = str,
+                        default = "L2",
+                        help = "Default memory level for IO buffers. Default: L2.")
+    parser.add_argument("--memAllocStrategy",
+                        type = str,
+                        default = "MiniMalloc",
+                        help = "Memory allocation strategy. Default: MiniMalloc.")
+    parser.add_argument("--searchStrategy",
+                        type = str,
+                        default = "random-max",
+                        help = "CP solver search strategy. Default: random-max.")
+    parser.add_argument("--plotMemAlloc",
+                        action = "store_true",
+                        help = "Save memory allocation plots in the deeployStates folder.")
+    parser.add_argument("--profileTiling",
+                        action = "store_true",
+                        help = "Enable tiling profiling (inserts cycle counters around each tiled kernel).")
     add_optimizer_training_dir_arg(parser)
-    add_should_fail_arg(parser)
+    parser.add_argument("--shouldFail", action = "store_true")
+    parser.set_defaults(shouldFail = False)
     args = parser.parse_args()
-    run_with_shouldfail(generateTiledOptimizerNetwork, args, "Tiled optimizer network generation")
+
+    try:
+        generateTiledOptimizerNetwork(args)
+    except Exception:
+        if args.shouldFail:
+            print("\033[92mTiled optimizer network generation ended, failed as expected!\033[0m")
+            sys.exit(0)
+        raise
+    if args.shouldFail:
+        raise RuntimeError("Expected to fail!")
diff --git a/DeeployTest/testMVPTraining.py b/DeeployTest/testMVPTraining.py
index 71f44f81d9..90beee5070 100644
--- a/DeeployTest/testMVPTraining.py
+++ b/DeeployTest/testMVPTraining.py
@@ -16,8 +16,7 @@
 from testUtils.testRunner import TestGeneratorArgumentParser
 from testUtils.tilingUtils import TrainingSBTiler
 from testUtils.trainingUtils import _GRAD_ACC, _infer_data_size, _infer_n_accum, _infer_num_data_inputs, \
-    _infer_total_mb, _load_reference_losses, _mockScheduler, add_cores_arg, add_memory_level_args, \
-    add_should_fail_arg, add_tiling_solver_args, add_training_inference_args, run_with_shouldfail
+    _infer_total_mb, _load_reference_losses, _mockScheduler, add_training_inference_args
 from testUtils.typeMapping import inferTypeAndOffset
 
 from Deeploy.AbstractDataTypes import PointerClass
@@ -235,10 +234,38 @@ def generateTiledTrainingNetwork(args) -> None:
 
 if __name__ == '__main__':
     parser = TestGeneratorArgumentParser(description = "Deeploy Tiled Training Code Generation Utility.")
-    add_cores_arg(parser)
+    parser.add_argument("--cores", type = int, default = 1, help = "Number of cluster cores. Default: 1.")
     add_training_inference_args(parser)
-    add_memory_level_args(parser)
-    add_tiling_solver_args(parser)
-    add_should_fail_arg(parser)
+    parser.add_argument("--l1", type = int, default = 64_000, help = "L1 size in bytes. Default: 64000.")
+    parser.add_argument("--l2", type = int, default = 1_024_000, help = "L2 size in bytes. Default: 1024000.")
+    parser.add_argument("--defaultMemLevel",
+                        type = str,
+                        default = "L2",
+                        help = "Default memory level for IO buffers. Default: L2.")
+    parser.add_argument("--memAllocStrategy",
+                        type = str,
+                        default = "MiniMalloc",
+                        help = "Memory allocation strategy. Default: MiniMalloc.")
+    parser.add_argument("--searchStrategy",
+                        type = str,
+                        default = "random-max",
+                        help = "CP solver search strategy. Default: random-max.")
+    parser.add_argument("--plotMemAlloc",
+                        action = "store_true",
+                        help = "Save memory allocation plots in the deeployStates folder.")
+    parser.add_argument("--profileTiling",
+                        action = "store_true",
+                        help = "Enable tiling profiling (inserts cycle counters around each tiled kernel).")
+    parser.add_argument("--shouldFail", action = "store_true")
+    parser.set_defaults(shouldFail = False)
     args = parser.parse_args()
-    run_with_shouldfail(generateTiledTrainingNetwork, args, "Tiled training network generation")
+
+    try:
+        generateTiledTrainingNetwork(args)
+    except Exception:
+        if args.shouldFail:
+            print("\033[92mTiled training network generation ended, failed as expected!\033[0m")
+            sys.exit(0)
+        raise
+    if args.shouldFail:
+        raise RuntimeError("Expected to fail!")
diff --git a/DeeployTest/testUtils/trainingUtils.py b/DeeployTest/testUtils/trainingUtils.py
index 78f02e7218..1f3e030032 100644
--- a/DeeployTest/testUtils/trainingUtils.py
+++ b/DeeployTest/testUtils/trainingUtils.py
@@ -6,20 +6,22 @@
 (generateTrainingNetwork.py, testMVPTraining.py, generateOptimizerNetwork.py,
 testMVPOptimizer.py).
 
-Four kinds of helpers live here:
+Four kinds of helpers live here, all strictly training-specific:
 
 1. inputs.npz / outputs.npz readers (``_load_reference_losses``, ``_infer_*``).
 2. The singleton ``_mockScheduler`` the Tiler expects for per-node tiling.
-3. argparse builders and the ``--shouldFail`` handshake runner that each
-   codegen entry point would otherwise have to duplicate verbatim in its
-   ``if __name__ == '__main__':`` block.
-4. Subprocess helpers (``build_codegen_cmd``, ``run_codegen_subprocess``,
-   ``filter_passthrough_args``, ``add_training_cmake_flags``) used by the
-   core test execution module to dispatch the training / optimizer codegen
-   scripts and assemble the training-side cmake defines.
-
-The subprocess helpers take primitive parameters (no ``DeeployTestConfig``
-dependency) so this module stays free of a back-edge to ``testUtils.core``.
+3. Training-only argparse builders (``add_training_inference_args``,
+   ``add_optimizer_training_dir_arg``).
+4. The core hooks invoked by ``testUtils.core.execution``
+   (``resolve_optimizer_dir``, ``run_training_codegen``,
+   ``add_training_cmake_flags``).
+
+Generic helpers (``--cores`` / ``--l1`` / ``--l2`` / ``--defaultMemLevel`` /
+``--memAllocStrategy`` / ``--searchStrategy`` / ``--plotMemAlloc`` /
+``--profileTiling`` / ``--shouldFail`` arg definitions and the ``shouldFail``
+try/except handshake) are deliberately *not* wrapped into functions here:
+they are not training-specific and belong inline in whichever entry point
+needs them, consistent with the upstream inference codegen scripts.
 """
 
 import argparse
@@ -28,7 +30,7 @@
 import subprocess
 import sys
 from pathlib import Path
-from typing import Callable, Iterable, List, Optional, Sequence, Tuple
+from typing import List, Optional
 
 import numpy as np
 import onnx_graphsurgeon as gs
@@ -146,15 +148,6 @@ def _mockScheduler(graph: gs.Graph) -> List[List[gs.Node]]:
 # ---------------------------------------------------------------------------
 
 
-def add_cores_arg(parser: argparse.ArgumentParser) -> None:
-    parser.add_argument(
-        "--cores",
-        type = int,
-        default = 1,
-        help = "Number of cores on which the network is run. Default: 1.",
-    )
-
-
 def add_training_inference_args(parser: argparse.ArgumentParser) -> None:
     """Arguments consumed by both training codegen entry points."""
     parser.add_argument(
@@ -197,59 +190,6 @@ def add_training_inference_args(parser: argparse.ArgumentParser) -> None:
     )
 
 
-def add_memory_level_args(parser: argparse.ArgumentParser) -> None:
-    """L1/L2 sizes and the default IO memory level."""
-    parser.add_argument(
-        "--l1",
-        type = int,
-        dest = "l1",
-        default = 64_000,
-        help = "Set L1 size in bytes. Default: 64000.",
-    )
-    parser.add_argument(
-        "--l2",
-        type = int,
-        dest = "l2",
-        default = 1_024_000,
-        help = "Set L2 size in bytes. Default: 1024000.",
-    )
-    parser.add_argument(
-        "--defaultMemLevel",
-        type = str,
-        dest = "defaultMemLevel",
-        default = "L2",
-        help = "Default memory level for IO buffers. Default: L2.",
-    )
-
-
-def add_tiling_solver_args(parser: argparse.ArgumentParser) -> None:
-    """Arguments specific to the tiled codegen path."""
-    parser.add_argument(
-        "--memAllocStrategy",
-        type = str,
-        dest = "memAllocStrategy",
-        default = "MiniMalloc",
-        help = "Memory allocation strategy. Default: MiniMalloc.",
-    )
-    parser.add_argument(
-        "--searchStrategy",
-        type = str,
-        dest = "searchStrategy",
-        default = "random-max",
-        help = "CP solver search strategy. Default: random-max.",
-    )
-    parser.add_argument(
-        "--plotMemAlloc",
-        action = "store_true",
-        help = "Save memory allocation plots in the deeployStates folder.",
-    )
-    parser.add_argument(
-        "--profileTiling",
-        action = "store_true",
-        help = "Enable tiling profiling (inserts cycle counters around each tiled kernel).",
-    )
-
-
 def add_optimizer_training_dir_arg(parser: argparse.ArgumentParser) -> None:
     parser.add_argument(
         "--training-dir",
@@ -261,41 +201,6 @@ def add_optimizer_training_dir_arg(parser: argparse.ArgumentParser) -> None:
     )
 
 
-def add_should_fail_arg(parser: argparse.ArgumentParser) -> None:
-    parser.add_argument("--shouldFail", action = "store_true")
-    parser.set_defaults(shouldFail = False)
-
-
-def run_with_shouldfail(fn: Callable[[argparse.Namespace], None], args: argparse.Namespace,
-                        stage_label: str) -> None:
-    """Invoke ``fn(args)`` honouring the ``--shouldFail`` handshake.
-
-    On success with ``--shouldFail``: raises ``RuntimeError("Expected to fail!")``.
-    On exception with ``--shouldFail``: prints a green success banner and exits 0.
-    Otherwise: exception propagates, success returns normally.
-    """
-    try:
-        fn(args)
-    except Exception:
-        if args.shouldFail:
-            print(f"\033[92m{stage_label} ended, failed as expected!\033[0m")
-            sys.exit(0)
-        raise
-    if args.shouldFail:
-        raise RuntimeError("Expected to fail!")
-
-
-# ---------------------------------------------------------------------------
-# Subprocess helpers for the test execution harness.
-#
-# These are used by testUtils/core/execution.py to dispatch the training /
-# optimizer codegen scripts.  Kept here (rather than as local helpers in
-# execution.py) so that every training-related helper lives in one module.
-# They take primitive parameters only — no DeeployTestConfig — to avoid
-# layering core → training back-edges.
-# ---------------------------------------------------------------------------
-
-
 def resolve_optimizer_dir(test_dir: str, optimizer_dir: Optional[str]) -> str:
     """Return the optimizer ONNX directory for a training test.
 
@@ -312,33 +217,6 @@ def resolve_optimizer_dir(test_dir: str, optimizer_dir: Optional[str]) -> str:
     return str(test_path.parent / optimizer_name)
 
 
-def build_codegen_cmd(script: Path, test_path: str, gen_dir: str, platform: str) -> List[str]:
-    """Return the common ``[python, script, -d gen_dir, -t test_path, -p platform]`` prefix."""
-    return [
-        sys.executable,
-        str(script),
-        "-d",
-        gen_dir,
-        "-t",
-        test_path,
-        "-p",
-        platform,
-    ]
-
-
-def run_codegen_subprocess(cmd: Sequence[str], stage_label: str, test_name: str) -> None:
-    """Run ``cmd`` as a subprocess, log it, and raise with a stage/test-aware message on failure."""
-    log.debug(f"[Execution] {stage_label} command: {' '.join(cmd)}")
-    result = subprocess.run(list(cmd), check = False)
-    if result.returncode != 0:
-        raise RuntimeError(f"{stage_label} failed for {test_name}")
-
-
-def filter_passthrough_args(gen_args: Iterable[str], passthrough: Tuple[str, ...]) -> List[str]:
-    """Return the subset of ``gen_args`` whose entries start with any prefix in ``passthrough``."""
-    return [arg for arg in gen_args if any(arg.startswith(p) for p in passthrough)]
-
-
 def add_training_cmake_flags(cmd: List[str], training: bool, n_train_steps: Optional[int],
                              n_accum_steps: Optional[int], training_num_data_inputs: Optional[int]) -> None:
     """Append -DTRAINING=ON/OFF plus any known -DN_TRAIN_STEPS / -DN_ACCUM_STEPS /
@@ -389,7 +267,16 @@ def run_training_codegen(config, script_dir: Path) -> None:
         stage = "Training"
 
     # --- Step 1: Training network (forward + backward + accumulation) ---
-    cmd = build_codegen_cmd(training_script, config.test_dir, config.gen_dir, config.platform)
+    cmd = [
+        sys.executable,
+        str(training_script),
+        "-d",
+        config.gen_dir,
+        "-t",
+        config.test_dir,
+        "-p",
+        config.platform,
+    ]
     if config.n_train_steps is not None:
         cmd.append(f"--n-steps={config.n_train_steps}")
     if config.n_accum_steps is not None:
@@ -401,7 +288,10 @@ def run_training_codegen(config, script_dir: Path) -> None:
     if config.debug:
         cmd.append("--debug")
     cmd.extend(config.gen_args)
-    run_codegen_subprocess(cmd, f"{stage} network generation", config.test_name)
+
+    log.debug(f"[Execution] {stage} network generation command: {' '.join(cmd)}")
+    if subprocess.run(cmd, check = False).returncode != 0:
+        raise RuntimeError(f"{stage} network generation failed for {config.test_name}")
 
     # Read back auto-detected values written by the training generation script.
     meta_path = Path(config.gen_dir) / "training_meta.json"
@@ -422,11 +312,23 @@ def run_training_codegen(config, script_dir: Path) -> None:
         log.warning(f"{optimizer_script.name} not found — skipping optimizer codegen")
         return
 
-    opt_cmd = build_codegen_cmd(optimizer_script, opt_dir, config.gen_dir, config.platform)
-    opt_cmd.append(f"--training-dir={config.test_dir}")
-    opt_cmd.extend(filter_passthrough_args(config.gen_args, opt_passthrough))
+    opt_cmd = [
+        sys.executable,
+        str(optimizer_script),
+        "-d",
+        config.gen_dir,
+        "-t",
+        opt_dir,
+        "-p",
+        config.platform,
+        f"--training-dir={config.test_dir}",
+    ]
+    opt_cmd.extend(arg for arg in config.gen_args if any(arg.startswith(p) for p in opt_passthrough))
     if not any(arg.startswith("--defaultMemLevel") for arg in opt_cmd):
         opt_cmd.append("--defaultMemLevel=L2")
     if config.verbose > 0:
         opt_cmd.append("-" + "v" * config.verbose)
-    run_codegen_subprocess(opt_cmd, f"{stage} optimizer network generation", config.test_name)
+
+    log.debug(f"[Execution] {stage} optimizer network generation command: {' '.join(opt_cmd)}")
+    if subprocess.run(opt_cmd, check = False).returncode != 0:
+        raise RuntimeError(f"{stage} optimizer network generation failed for {config.test_name}")

From 9d3445ce587c2b5d1d82531c91420ebd6244af60 Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 19:09:12 +0000
Subject: [PATCH 26/28] training-platform core: drop unused loop var and
 populate zero-sized input defaults
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two fixes to the inputTypes / inputOffsets build loop in both
generateTrainingNetwork.py and testMVPTraining.py (the two files share
this pattern verbatim).

1. The `name` loop variable in `for graph_idx, name in enumerate(...)`
   was dead — switched to `for graph_idx in range(len(...))` so the
   intent is explicit and the lint warning goes away.

2. The zero-sized-input branch (`np.prod(arr.shape) == 0`) previously
   `pass`ed without populating inputTypes[f"input_{idx}"] or
   inputOffsets[f"input_{idx}"].  ONNX permits optional placeholder
   inputs with shape like (0,) and the downstream deployer looks up
   every input by key, so a zero-sized input would later raise a
   confusing KeyError far from the cause.  Populate the entries with
   a trivial default (float32 pointer, offset 0) — the type does not
   matter for codegen since the buffer has no elements, but the key
   must exist.

Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every
step) in both non-tiled and tiled runs.
---
 DeeployTest/generateTrainingNetwork.py | 8 ++++++--
 DeeployTest/testMVPTraining.py         | 8 ++++++--
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/DeeployTest/generateTrainingNetwork.py b/DeeployTest/generateTrainingNetwork.py
index dd0ce87718..febd95afdb 100644
--- a/DeeployTest/generateTrainingNetwork.py
+++ b/DeeployTest/generateTrainingNetwork.py
@@ -84,7 +84,7 @@ def generateTrainingNetwork(args):
     inputOffsets = {}
 
     npz_idx = 0
-    for graph_idx, name in enumerate(graph_input_names):
+    for graph_idx in range(len(graph_input_names)):
         if graph_idx in grad_acc_set:
             inputTypes[f"input_{graph_idx}"] = PointerClass(float32_t)
             inputOffsets[f"input_{graph_idx}"] = 0
@@ -102,7 +102,11 @@ def generateTrainingNetwork(args):
                 inputTypes[f"input_{graph_idx}"] = PointerClass(float32_t)
                 inputOffsets[f"input_{graph_idx}"] = 0
             elif np.prod(arr.shape) == 0:
-                pass
+                # Zero-sized input (ONNX allows shape (0, ...) for optional
+                # placeholders).  No data to infer from, but downstream still
+                # looks up input_{idx} by key, so populate with a trivial default.
+                inputTypes[f"input_{graph_idx}"] = PointerClass(float32_t)
+                inputOffsets[f"input_{graph_idx}"] = 0
             else:
                 values = arr.reshape(-1).astype(np.float32)
                 _type, offset = inferTypeAndOffset(values, signProp = False)
diff --git a/DeeployTest/testMVPTraining.py b/DeeployTest/testMVPTraining.py
index 90beee5070..fc2afb231d 100644
--- a/DeeployTest/testMVPTraining.py
+++ b/DeeployTest/testMVPTraining.py
@@ -79,7 +79,7 @@ def generateTiledTrainingNetwork(args) -> None:
     inputOffsets = {}
 
     npz_idx = 0
-    for graph_idx, name in enumerate(graph_input_names):
+    for graph_idx in range(len(graph_input_names)):
         if graph_idx in grad_acc_set:
             inputTypes[f"input_{graph_idx}"] = PointerClass(float32_t)
             inputOffsets[f"input_{graph_idx}"] = 0
@@ -93,7 +93,11 @@ def generateTiledTrainingNetwork(args) -> None:
                 inputTypes[f"input_{graph_idx}"] = PointerClass(float32_t)
                 inputOffsets[f"input_{graph_idx}"] = 0
             elif np.prod(arr.shape) == 0:
-                pass
+                # Zero-sized input (ONNX allows shape (0, ...) for optional
+                # placeholders).  No data to infer from, but downstream still
+                # looks up input_{idx} by key, so populate with a trivial default.
+                inputTypes[f"input_{graph_idx}"] = PointerClass(float32_t)
+                inputOffsets[f"input_{graph_idx}"] = 0
             else:
                 values = arr.reshape(-1).astype(np.float32)
                 _type, offset = inferTypeAndOffset(values, signProp = False)

From 40c5da922b829deca202619d31f4be462736a9ff Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 20:04:24 +0000
Subject: [PATCH 27/28] training-platform core: apply pre-commit yapf +
 autoflake autofixes

CI pre-commit hook flagged:
- yapf: collapse 3-line set comprehension in Tiler.aliasBlocks and
  rewrap two-line arg lists in trainingUtils.add_training_cmake_flags
  and run_training_codegen.
- autoflake: drop unused `from typing import List` imports in
  testMVPTraining.py and testMVPOptimizer.py.

Reformat only, no semantic change. MLP non-tiled + tiled regression
still pass with diff=0.000000 on all four steps.
---
 Deeploy/TilingExtension/TilerExtension.py | 4 +---
 DeeployTest/testMVPOptimizer.py           | 1 -
 DeeployTest/testMVPTraining.py            | 1 -
 DeeployTest/testUtils/trainingUtils.py    | 8 ++++----
 4 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/Deeploy/TilingExtension/TilerExtension.py b/Deeploy/TilingExtension/TilerExtension.py
index a11979c5dc..3a583fd452 100644
--- a/Deeploy/TilingExtension/TilerExtension.py
+++ b/Deeploy/TilingExtension/TilerExtension.py
@@ -406,9 +406,7 @@ def minimalloc(self, memoryMap, ctxt, nodeMemoryConstraint, capacity: int, memor
         # rejects size-0 entries) and copy their addrSpace from the target
         # after the solver runs.
         aliasBlocks = {
-            block.name
-            for block in memoryMap
-            if getattr(ctxt.lookup(block.name), "_alias", None) in blockNames
+            block.name for block in memoryMap if getattr(ctxt.lookup(block.name), "_alias", None) in blockNames
         }
 
         with open(f"{self._minimalloc_input}.csv", mode = "w", newline = "") as file:
diff --git a/DeeployTest/testMVPOptimizer.py b/DeeployTest/testMVPOptimizer.py
index f75fe4902e..3a94bf8e48 100644
--- a/DeeployTest/testMVPOptimizer.py
+++ b/DeeployTest/testMVPOptimizer.py
@@ -28,7 +28,6 @@
 import os
 import sys
 from pathlib import Path
-from typing import List
 
 import onnx
 import onnx_graphsurgeon as gs
diff --git a/DeeployTest/testMVPTraining.py b/DeeployTest/testMVPTraining.py
index fc2afb231d..c0e4e7c2d8 100644
--- a/DeeployTest/testMVPTraining.py
+++ b/DeeployTest/testMVPTraining.py
@@ -6,7 +6,6 @@
 import json
 import os
 import sys
-from typing import List
 
 import numpy as np
 import onnx
diff --git a/DeeployTest/testUtils/trainingUtils.py b/DeeployTest/testUtils/trainingUtils.py
index 1f3e030032..a3386cd7ca 100644
--- a/DeeployTest/testUtils/trainingUtils.py
+++ b/DeeployTest/testUtils/trainingUtils.py
@@ -217,8 +217,8 @@ def resolve_optimizer_dir(test_dir: str, optimizer_dir: Optional[str]) -> str:
     return str(test_path.parent / optimizer_name)
 
 
-def add_training_cmake_flags(cmd: List[str], training: bool, n_train_steps: Optional[int],
-                             n_accum_steps: Optional[int], training_num_data_inputs: Optional[int]) -> None:
+def add_training_cmake_flags(cmd: List[str], training: bool, n_train_steps: Optional[int], n_accum_steps: Optional[int],
+                             training_num_data_inputs: Optional[int]) -> None:
     """Append -DTRAINING=ON/OFF plus any known -DN_TRAIN_STEPS / -DN_ACCUM_STEPS /
     -DTRAINING_NUM_DATA_INPUTS defines to ``cmd``.  In-place."""
     cmd.append(f"-DTRAINING={'ON' if training else 'OFF'}")
@@ -257,8 +257,8 @@ def run_training_codegen(config, script_dir: Path) -> None:
     if config.tiling:
         training_script = script_dir / "testMVPTraining.py"
         optimizer_script = script_dir / "testMVPOptimizer.py"
-        opt_passthrough = ("--cores", "--l1", "--l2", "--defaultMemLevel", "--memAllocStrategy",
-                           "--searchStrategy", "--plotMemAlloc", "--profileTiling")
+        opt_passthrough = ("--cores", "--l1", "--l2", "--defaultMemLevel", "--memAllocStrategy", "--searchStrategy",
+                           "--plotMemAlloc", "--profileTiling")
         stage = "Tiled training"
     else:
         training_script = script_dir / "generateTrainingNetwork.py"

From 191a30b19c6bab74670e2f0277426bdd81101347 Mon Sep 17 00:00:00 2001
From: runwangdl <samanthawangdl@gmail.com>
Date: Fri, 10 Apr 2026 20:05:03 +0000
Subject: [PATCH 28/28] training-platform core: drop CCT2_FT2 from siracusa
 tiled L3 CI list

CCT_Train/CCT2_FT2 ONNX still uses the legacy 1-output
SoftmaxCrossEntropyLoss / SoftmaxCrossEntropyLossGrad signature, which
this PR deprecated in favour of the canonical 2-output (loss + log_prob)
form in 763b4647. The parser now rejects the old form, so CCT2_FT2
fails network generation with 'Did not find adequate mapping for graph
... SoftmaxCrossEntropyLossParser: Exhausted backtracking' at the
L3-singlebuffer tiled check.

Regenerating CCT_Train fixtures to the canonical 2-output form is out
of scope for this PR; drop CCT2_FT2 from the L3 singlebuffer and L3
doublebuffer model lists. The other upstream coverage for
CCT_2_32_32_128 (inference-side CCT) is retained.
---
 DeeployTest/test_siracusa_tiled_config.py | 2 --
 1 file changed, 2 deletions(-)

diff --git a/DeeployTest/test_siracusa_tiled_config.py b/DeeployTest/test_siracusa_tiled_config.py
index a687d9a489..a9eefb6d3e 100644
--- a/DeeployTest/test_siracusa_tiled_config.py
+++ b/DeeployTest/test_siracusa_tiled_config.py
@@ -139,7 +139,6 @@
     "Models/Transformer": [60000, 30000, 15000],
     "Models/microLlama/microLlama1": [60000, 10000, 5000],
     "Models/CCT/FP32/CCT_2_32_32_128": [128000],
-    "Models/CCT_Train/CCT2_FT2": [128000],
     "Models/TinyViT/Demo": [4000],
 }
 
@@ -153,6 +152,5 @@
     "Models/microLlama/microLlama8": [60000, 20000, 10000],
     "Models/microLlama/microLlama8_parallel": [60000, 20000, 10000],
     "Models/CCT/FP32/CCT_2_32_32_128": [128000],
-    "Models/CCT_Train/CCT2_FT2": [128000],
     "Models/TinyViT/Demo": [4000],
 }