Skip to content

[GPU] Fix implicit concat fusing blocked for feature-axis and multi-user predecessors#36032

Merged
p-durandin merged 10 commits into
openvinotoolkit:masterfrom
deepaks2:gpu-optimize-concat-buffer-fusing
Jun 3, 2026
Merged

[GPU] Fix implicit concat fusing blocked for feature-axis and multi-user predecessors#36032
p-durandin merged 10 commits into
openvinotoolkit:masterfrom
deepaks2:gpu-optimize-concat-buffer-fusing

Conversation

@deepaks2
Copy link
Copy Markdown
Contributor

@deepaks2 deepaks2 commented May 21, 2026

prepare_buffer_fusing was skipping in-place (zero-copy) concat fusing due to two overly conservative guards on the oneDNN static path.

First, concat_out_l.batch() > 1 was returning false unconditionally for any concat when the output batch size exceeded one. The check was intended for batch-axis concat, where block formats are not contiguous across batch elements. For feature-axis (axis=1) concat the 64-byte alignment check already in place is the correct correctness gate; the batch check is only needed when concat_axis_index == 0.

Second, get_users().size() > 2 rejected any predecessor with more than two consumers regardless of what those consumers were. The existing available_pred lambda already enumerates node types whose kernels never access padded regions (convolution, pooling, activation, eltwise, reorder, etc.). Fusing is safe whenever every non-concat user of the predecessor passes that check.

On YOLOv8s FP16 explicit concat kernel count drops from 15 to 2 at batch=8, reducing latency by ~36% at batch=8 on Intel iGPU.

Two unit tests are added to prepare_buffer_fusing_test.cpp covering the corrected code paths; all 37 existing tests continue to pass.

Tickets:
CVS-187256

AI Assistance:
*AI assistance used: yes
AI was used for test generation and Human reviewed and run all unit tests manually

@deepaks2 deepaks2 requested review from a team as code owners May 21, 2026 15:16
@github-actions github-actions Bot added the category: GPU OpenVINO GPU plugin label May 21, 2026
@deepaks2 deepaks2 marked this pull request as draft May 21, 2026 15:16
@sys-openvino-ci sys-openvino-ci added the ExternalIntelPR External contributor from Intel label May 21, 2026
@clee30
Copy link
Copy Markdown
Contributor

clee30 commented May 22, 2026

build_jenkins

@deepaks2 deepaks2 marked this pull request as ready for review May 22, 2026 05:25
@p-durandin
Copy link
Copy Markdown
Contributor

build_jenkins

Comment thread src/plugins/intel_gpu/src/graph/graph_optimizer/prepare_buffer_fusing.cpp Outdated
Comment thread src/plugins/intel_gpu/tests/unit/passes/prepare_buffer_fusing_test.cpp Outdated
@clee30
Copy link
Copy Markdown
Contributor

clee30 commented May 26, 2026

build_jenkins

@deepaks2 deepaks2 requested a review from Lyamin-Roman May 27, 2026 04:32
Comment thread src/plugins/intel_gpu/tests/unit/test_cases/concatenation_gpu_test.cpp Outdated
Comment thread src/plugins/intel_gpu/src/graph/graph_optimizer/prepare_buffer_fusing.cpp Outdated
Comment thread src/plugins/intel_gpu/tests/unit/passes/prepare_buffer_fusing_test.cpp Outdated
Comment thread src/plugins/intel_gpu/tests/unit/passes/prepare_buffer_fusing_test.cpp Outdated
@deepaks2
Copy link
Copy Markdown
Contributor Author

deepaks2 commented Jun 2, 2026

build_jenkins

@deepaks2 deepaks2 requested a review from Lyamin-Roman June 2, 2026 08:56
@deepaks2
Copy link
Copy Markdown
Contributor Author

deepaks2 commented Jun 2, 2026

build_jenkins

@deepaks2 deepaks2 force-pushed the gpu-optimize-concat-buffer-fusing branch 2 times, most recently from 3a66690 to 01bd5f8 Compare June 2, 2026 14:05
@deepaks2
Copy link
Copy Markdown
Contributor Author

deepaks2 commented Jun 2, 2026

build_jenkins

deepaks2 added 3 commits June 2, 2026 22:30
…r batch>1

The batch=2 feature-axis case had is_implicit_concat=false, reflecting the
old prepare_buffer_fusing behaviour where batch > 1 unconditionally disabled
in-place fusing on the oneDNN path. After tightening that guard to apply only
when concat_axis_index == 0 (batch-axis concat), feature-axis concat at
batch=2 is correctly fused. Update is_implicit_concat to true to match the
corrected behaviour; the existing diff_count == 0 assertion in the test body
verifies the output values are bit-exact.

Signed-off-by: S, Deepak <deepak.s@intel.com>
Replace NaN/Inf output checks with element-wise comparison against a
reference network running an explicit (non-fused) concat over the same
random inputs. This catches buffer aliasing bugs where the implicit path
produces numerically valid but incorrect values.

The batch>1 test switches to a two-network implicit-vs-explicit comparison
to avoid hard-coding layout assumptions for block formats at batch>1. The
multi-user test adds force_implementations to pin the oneDNN fusing code
path regardless of the runtime heuristic.

Signed-off-by: S, Deepak <deepak.s@intel.com>
…check

The available_pred lambda guards output-padding support: whether a node
type's kernel can write output with buffer gaps for in-place aliasing. The
multi-user predecessor check in concat_in_place_optimization reused that
lambda to validate non-concat consumers, but the question there is different:
whether the consumer's kernel reads input by coordinate and therefore skips
padding correctly.

Introduce reads_padded_input_safely with that explicit semantic and use it
for the multi-user consumer check. The list is narrower than available_pred:
reorder and permute are excluded because some of their implementations copy
over raw buffer byte ranges and would include padding bytes. The types
retained (convolution, deconvolution, pooling, activation, eltwise, quantize)
address input by explicit tensor coordinate in all GPU kernel variants.

Signed-off-by: S, Deepak <deepak.s@intel.com>
@deepaks2 deepaks2 force-pushed the gpu-optimize-concat-buffer-fusing branch from 01bd5f8 to 418cf02 Compare June 2, 2026 14:39
deepaks2 and others added 2 commits June 2, 2026 22:51
Replace uniform constant fill (1.0f / 2.0f) with sequential 0..N-1 values so
any buffer-overlap or aliasing regression produces a wrong value at a specific
index rather than going undetected.

Update output assertions to match: concat halves checked element-wise, relu
output equals input, clamp output is 0.0f at index 0 and 1.0f elsewhere.

Signed-off-by: S, Deepak <deepak.s@intel.com>
Replace uniform constant fill (1.0f / 2.0f) with sequential 0..N-1 values
so any buffer-overlap or aliasing bug produces a mismatch at a specific index
rather than going undetected.

Update output assertions: concat halves and conv output checked element-wise
against float(i); relu output is identity for all non-negative inputs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@deepaks2 deepaks2 force-pushed the gpu-optimize-concat-buffer-fusing branch from 3b151cd to 2970ebc Compare June 2, 2026 15:00
deepaks2 and others added 4 commits June 2, 2026 23:08
Force both shared_r and conv to b_fs_yx_fsv16 + oneDNN. Since their formats
already match, reorder_inputs inserts no intermediate reorder, so oneDNN conv
reads directly from the padded predecessor buffer — confirming the safety of
reads_padded_input_safely for oneDNN convolution.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
abs preserves the full natural-number range of the input (0..N-1),
making buffer-overlap bugs immediately visible at every index.
clamp(0,1) was masking all values above 1, reducing detectability.

Also swaps concat input order (other_r first) and offsets d2 by 512
so the two halves of the concat output are unambiguously distinguishable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Swap concat input order (other_r first) so shared_r lands in the
second slot, exercising a more asymmetric padding layout. Offset d2
by 512 so the two concat halves are unambiguously distinguishable —
any buffer aliasing or ordering bug produces a clearly wrong value
rather than going unnoticed when both halves held identical data.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reduce comments to a minimum, only for cases where the code is not obvious

Signed-off-by: S, Deepak <deepak.s@intel.com>
@p-durandin
Copy link
Copy Markdown
Contributor

build_jenkins

@p-durandin p-durandin enabled auto-merge June 3, 2026 10:22
@deepaks2 deepaks2 closed this Jun 3, 2026
auto-merge was automatically disabled June 3, 2026 12:40

Pull request was closed

@deepaks2 deepaks2 deleted the gpu-optimize-concat-buffer-fusing branch June 3, 2026 12:40
@deepaks2 deepaks2 restored the gpu-optimize-concat-buffer-fusing branch June 3, 2026 12:43
@deepaks2 deepaks2 reopened this Jun 3, 2026
@p-durandin
Copy link
Copy Markdown
Contributor

build_jenkins

@p-durandin p-durandin enabled auto-merge June 3, 2026 15:00
@p-durandin p-durandin added this pull request to the merge queue Jun 3, 2026
Merged via the queue into openvinotoolkit:master with commit d0940ab Jun 3, 2026
188 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: GPU OpenVINO GPU plugin ExternalIntelPR External contributor from Intel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants