[GPU] Fix implicit concat fusing blocked for feature-axis and multi-user predecessors by deepaks2 · Pull Request #36032 · openvinotoolkit/openvino

deepaks2 · 2026-05-21T15:16:33Z

prepare_buffer_fusing was skipping in-place (zero-copy) concat fusing due to two overly conservative guards on the oneDNN static path.

First, concat_out_l.batch() > 1 was returning false unconditionally for any concat when the output batch size exceeded one. The check was intended for batch-axis concat, where block formats are not contiguous across batch elements. For feature-axis (axis=1) concat the 64-byte alignment check already in place is the correct correctness gate; the batch check is only needed when concat_axis_index == 0.

Second, get_users().size() > 2 rejected any predecessor with more than two consumers regardless of what those consumers were. The existing available_pred lambda already enumerates node types whose kernels never access padded regions (convolution, pooling, activation, eltwise, reorder, etc.). Fusing is safe whenever every non-concat user of the predecessor passes that check.

On YOLOv8s FP16 explicit concat kernel count drops from 15 to 2 at batch=8, reducing latency by ~36% at batch=8 on Intel iGPU.

Two unit tests are added to prepare_buffer_fusing_test.cpp covering the corrected code paths; all 37 existing tests continue to pass.

Tickets:
CVS-187256

AI Assistance:
*AI assistance used: yes
AI was used for test generation and Human reviewed and run all unit tests manually

clee30 · 2026-05-22T01:56:01Z

build_jenkins

p-durandin · 2026-05-22T12:28:23Z

build_jenkins

clee30 · 2026-05-26T03:55:01Z

build_jenkins

deepaks2 · 2026-06-02T08:56:29Z

build_jenkins

deepaks2 · 2026-06-02T12:59:14Z

build_jenkins

deepaks2 · 2026-06-02T14:12:46Z

build_jenkins

…r batch>1 The batch=2 feature-axis case had is_implicit_concat=false, reflecting the old prepare_buffer_fusing behaviour where batch > 1 unconditionally disabled in-place fusing on the oneDNN path. After tightening that guard to apply only when concat_axis_index == 0 (batch-axis concat), feature-axis concat at batch=2 is correctly fused. Update is_implicit_concat to true to match the corrected behaviour; the existing diff_count == 0 assertion in the test body verifies the output values are bit-exact. Signed-off-by: S, Deepak <deepak.s@intel.com>

Replace NaN/Inf output checks with element-wise comparison against a reference network running an explicit (non-fused) concat over the same random inputs. This catches buffer aliasing bugs where the implicit path produces numerically valid but incorrect values. The batch>1 test switches to a two-network implicit-vs-explicit comparison to avoid hard-coding layout assumptions for block formats at batch>1. The multi-user test adds force_implementations to pin the oneDNN fusing code path regardless of the runtime heuristic. Signed-off-by: S, Deepak <deepak.s@intel.com>

…check The available_pred lambda guards output-padding support: whether a node type's kernel can write output with buffer gaps for in-place aliasing. The multi-user predecessor check in concat_in_place_optimization reused that lambda to validate non-concat consumers, but the question there is different: whether the consumer's kernel reads input by coordinate and therefore skips padding correctly. Introduce reads_padded_input_safely with that explicit semantic and use it for the multi-user consumer check. The list is narrower than available_pred: reorder and permute are excluded because some of their implementations copy over raw buffer byte ranges and would include padding bytes. The types retained (convolution, deconvolution, pooling, activation, eltwise, quantize) address input by explicit tensor coordinate in all GPU kernel variants. Signed-off-by: S, Deepak <deepak.s@intel.com>

Replace uniform constant fill (1.0f / 2.0f) with sequential 0..N-1 values so any buffer-overlap or aliasing regression produces a wrong value at a specific index rather than going undetected. Update output assertions to match: concat halves checked element-wise, relu output equals input, clamp output is 0.0f at index 0 and 1.0f elsewhere. Signed-off-by: S, Deepak <deepak.s@intel.com>

Replace uniform constant fill (1.0f / 2.0f) with sequential 0..N-1 values so any buffer-overlap or aliasing bug produces a mismatch at a specific index rather than going undetected. Update output assertions: concat halves and conv output checked element-wise against float(i); relu output is identity for all non-negative inputs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Force both shared_r and conv to b_fs_yx_fsv16 + oneDNN. Since their formats already match, reorder_inputs inserts no intermediate reorder, so oneDNN conv reads directly from the padded predecessor buffer — confirming the safety of reads_padded_input_safely for oneDNN convolution. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

abs preserves the full natural-number range of the input (0..N-1), making buffer-overlap bugs immediately visible at every index. clamp(0,1) was masking all values above 1, reducing detectability. Also swaps concat input order (other_r first) and offsets d2 by 512 so the two halves of the concat output are unambiguously distinguishable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Swap concat input order (other_r first) so shared_r lands in the second slot, exercising a more asymmetric padding layout. Offset d2 by 512 so the two concat halves are unambiguously distinguishable — any buffer aliasing or ordering bug produces a clearly wrong value rather than going unnoticed when both halves held identical data. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Reduce comments to a minimum, only for cases where the code is not obvious Signed-off-by: S, Deepak <deepak.s@intel.com>

p-durandin · 2026-06-03T07:21:09Z

build_jenkins

p-durandin · 2026-06-03T14:05:16Z

build_jenkins

deepaks2 requested review from a team as code owners May 21, 2026 15:16

github-actions Bot added the category: GPU OpenVINO GPU plugin label May 21, 2026

deepaks2 marked this pull request as draft May 21, 2026 15:16

sys-openvino-ci added the ExternalIntelPR External contributor from Intel label May 21, 2026

deepaks2 marked this pull request as ready for review May 22, 2026 05:25

Lyamin-Roman reviewed May 22, 2026

View reviewed changes

Comment thread src/plugins/intel_gpu/src/graph/graph_optimizer/prepare_buffer_fusing.cpp Outdated

Comment thread src/plugins/intel_gpu/src/graph/graph_optimizer/prepare_buffer_fusing.cpp

Lyamin-Roman reviewed May 22, 2026

View reviewed changes

Comment thread src/plugins/intel_gpu/tests/unit/passes/prepare_buffer_fusing_test.cpp Outdated

deepaks2 requested a review from Lyamin-Roman May 27, 2026 04:32

Lyamin-Roman reviewed Jun 1, 2026

View reviewed changes

deepaks2 requested a review from Lyamin-Roman June 2, 2026 08:56

deepaks2 force-pushed the gpu-optimize-concat-buffer-fusing branch 2 times, most recently from 3a66690 to 01bd5f8 Compare June 2, 2026 14:05

deepaks2 added 3 commits June 2, 2026 22:30

deepaks2 force-pushed the gpu-optimize-concat-buffer-fusing branch from 01bd5f8 to 418cf02 Compare June 2, 2026 14:39

deepaks2 and others added 2 commits June 2, 2026 22:51

deepaks2 force-pushed the gpu-optimize-concat-buffer-fusing branch from 3b151cd to 2970ebc Compare June 2, 2026 15:00

deepaks2 and others added 4 commits June 2, 2026 23:08

[GPU] Updating the comments in the source code

8dbc003

Reduce comments to a minimum, only for cases where the code is not obvious Signed-off-by: S, Deepak <deepak.s@intel.com>

Lyamin-Roman approved these changes Jun 3, 2026

View reviewed changes

p-durandin enabled auto-merge June 3, 2026 10:22

deepaks2 closed this Jun 3, 2026

auto-merge was automatically disabled June 3, 2026 12:40
Pull request was closed

deepaks2 deleted the gpu-optimize-concat-buffer-fusing branch June 3, 2026 12:40

deepaks2 restored the gpu-optimize-concat-buffer-fusing branch June 3, 2026 12:43

deepaks2 reopened this Jun 3, 2026

Merge branch 'master' into gpu-optimize-concat-buffer-fusing

255aeb0

p-durandin enabled auto-merge June 3, 2026 15:00

p-durandin added this pull request to the merge queue Jun 3, 2026

Merged via the queue into openvinotoolkit:master with commit d0940ab Jun 3, 2026
188 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU] Fix implicit concat fusing blocked for feature-axis and multi-user predecessors#36032

[GPU] Fix implicit concat fusing blocked for feature-axis and multi-user predecessors#36032
p-durandin merged 10 commits into
openvinotoolkit:masterfrom
deepaks2:gpu-optimize-concat-buffer-fusing

deepaks2 commented May 21, 2026 •

edited

Loading

Uh oh!

clee30 commented May 22, 2026

Uh oh!

p-durandin commented May 22, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clee30 commented May 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

deepaks2 commented Jun 2, 2026

Uh oh!

deepaks2 commented Jun 2, 2026

Uh oh!

deepaks2 commented Jun 2, 2026

Uh oh!

p-durandin commented Jun 3, 2026

Uh oh!

p-durandin commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

deepaks2 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clee30 commented May 22, 2026

Uh oh!

p-durandin commented May 22, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clee30 commented May 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

deepaks2 commented Jun 2, 2026

Uh oh!

deepaks2 commented Jun 2, 2026

Uh oh!

deepaks2 commented Jun 2, 2026

Uh oh!

p-durandin commented Jun 3, 2026

Uh oh!

p-durandin commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

deepaks2 commented May 21, 2026 •

edited

Loading