[GPU] Fix implicit concat fusing blocked for feature-axis and multi-user predecessors#36032
Merged
p-durandin merged 10 commits intoJun 3, 2026
Merged
Conversation
Contributor
|
build_jenkins |
Contributor
|
build_jenkins |
Contributor
|
build_jenkins |
Lyamin-Roman
reviewed
Jun 1, 2026
Contributor
Author
|
build_jenkins |
Contributor
Author
|
build_jenkins |
3a66690 to
01bd5f8
Compare
Contributor
Author
|
build_jenkins |
…r batch>1 The batch=2 feature-axis case had is_implicit_concat=false, reflecting the old prepare_buffer_fusing behaviour where batch > 1 unconditionally disabled in-place fusing on the oneDNN path. After tightening that guard to apply only when concat_axis_index == 0 (batch-axis concat), feature-axis concat at batch=2 is correctly fused. Update is_implicit_concat to true to match the corrected behaviour; the existing diff_count == 0 assertion in the test body verifies the output values are bit-exact. Signed-off-by: S, Deepak <deepak.s@intel.com>
Replace NaN/Inf output checks with element-wise comparison against a reference network running an explicit (non-fused) concat over the same random inputs. This catches buffer aliasing bugs where the implicit path produces numerically valid but incorrect values. The batch>1 test switches to a two-network implicit-vs-explicit comparison to avoid hard-coding layout assumptions for block formats at batch>1. The multi-user test adds force_implementations to pin the oneDNN fusing code path regardless of the runtime heuristic. Signed-off-by: S, Deepak <deepak.s@intel.com>
…check The available_pred lambda guards output-padding support: whether a node type's kernel can write output with buffer gaps for in-place aliasing. The multi-user predecessor check in concat_in_place_optimization reused that lambda to validate non-concat consumers, but the question there is different: whether the consumer's kernel reads input by coordinate and therefore skips padding correctly. Introduce reads_padded_input_safely with that explicit semantic and use it for the multi-user consumer check. The list is narrower than available_pred: reorder and permute are excluded because some of their implementations copy over raw buffer byte ranges and would include padding bytes. The types retained (convolution, deconvolution, pooling, activation, eltwise, quantize) address input by explicit tensor coordinate in all GPU kernel variants. Signed-off-by: S, Deepak <deepak.s@intel.com>
01bd5f8 to
418cf02
Compare
Replace uniform constant fill (1.0f / 2.0f) with sequential 0..N-1 values so any buffer-overlap or aliasing regression produces a wrong value at a specific index rather than going undetected. Update output assertions to match: concat halves checked element-wise, relu output equals input, clamp output is 0.0f at index 0 and 1.0f elsewhere. Signed-off-by: S, Deepak <deepak.s@intel.com>
Replace uniform constant fill (1.0f / 2.0f) with sequential 0..N-1 values so any buffer-overlap or aliasing bug produces a mismatch at a specific index rather than going undetected. Update output assertions: concat halves and conv output checked element-wise against float(i); relu output is identity for all non-negative inputs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
3b151cd to
2970ebc
Compare
Force both shared_r and conv to b_fs_yx_fsv16 + oneDNN. Since their formats already match, reorder_inputs inserts no intermediate reorder, so oneDNN conv reads directly from the padded predecessor buffer — confirming the safety of reads_padded_input_safely for oneDNN convolution. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
abs preserves the full natural-number range of the input (0..N-1), making buffer-overlap bugs immediately visible at every index. clamp(0,1) was masking all values above 1, reducing detectability. Also swaps concat input order (other_r first) and offsets d2 by 512 so the two halves of the concat output are unambiguously distinguishable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Swap concat input order (other_r first) so shared_r lands in the second slot, exercising a more asymmetric padding layout. Offset d2 by 512 so the two concat halves are unambiguously distinguishable — any buffer aliasing or ordering bug produces a clearly wrong value rather than going unnoticed when both halves held identical data. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reduce comments to a minimum, only for cases where the code is not obvious Signed-off-by: S, Deepak <deepak.s@intel.com>
Contributor
|
build_jenkins |
Lyamin-Roman
approved these changes
Jun 3, 2026
Contributor
|
build_jenkins |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
prepare_buffer_fusing was skipping in-place (zero-copy) concat fusing due to two overly conservative guards on the oneDNN static path.
First, concat_out_l.batch() > 1 was returning false unconditionally for any concat when the output batch size exceeded one. The check was intended for batch-axis concat, where block formats are not contiguous across batch elements. For feature-axis (axis=1) concat the 64-byte alignment check already in place is the correct correctness gate; the batch check is only needed when concat_axis_index == 0.
Second, get_users().size() > 2 rejected any predecessor with more than two consumers regardless of what those consumers were. The existing available_pred lambda already enumerates node types whose kernels never access padded regions (convolution, pooling, activation, eltwise, reorder, etc.). Fusing is safe whenever every non-concat user of the predecessor passes that check.
On YOLOv8s FP16 explicit concat kernel count drops from 15 to 2 at batch=8, reducing latency by ~36% at batch=8 on Intel iGPU.
Two unit tests are added to prepare_buffer_fusing_test.cpp covering the corrected code paths; all 37 existing tests continue to pass.
Tickets:
CVS-187256
AI Assistance:
*AI assistance used: yes
AI was used for test generation and Human reviewed and run all unit tests manually