Deduplicate compute_number_of_edges CUB/Thrust kernels via explicit instantiation#5489
Deduplicate compute_number_of_edges CUB/Thrust kernels via explicit instantiation#5489bdice wants to merge 12 commits intorapidsai:mainfrom
Conversation
…nstantiation Move edge_partition_device_view_t::compute_number_of_edges* and compute_local_degrees* method bodies from inline definitions in edge_partition_device_view.cuh to a separate _impl.cuh file, with explicit template instantiations in dedicated .cu translation units. This prevents 40+ TUs from each compiling their own copy of the CUB DeviceReduce and Thrust transform_reduce kernels for these methods. Two prims headers that required rare cuda::transform_iterator types (using lambdas) had those lambdas replaced with named functors so the iterator types can be explicitly instantiated without leaking the heavy _impl.cuh back into broadly-included headers. Binary size reduction vs baseline: libcugraph.so: 475 MB -> 459 MB (-16 MB, -3.4%) libcugraph_mg.so: 576 MB -> 520 MB (-56 MB, -9.7%) Combined: 1051 MB -> 979 MB (-72 MB, -6.9%)
…via inline wrappers
…op_t and local_degree_with_mask_op_t into single functor
…st-cast overloads
seunghwak
left a comment
There was a problem hiding this comment.
Thanks for the contribution, this looks like a right approach and will be very helpful in cutting cuGraph binary size. I have some suggestions to better follow cuGraph coding conventions.
| { | ||
| return range_first + static_cast<vertex_t>(v_offset); | ||
| } | ||
| }; |
There was a problem hiding this comment.
We already have shift_right_t, no need to re-define here.
See
https://github.com/rapidsai/cugraph/blob/main/cpp/include/cugraph/utilities/device_functors.cuh#L157
| return range_first + static_cast<vertex_t>(v_offset); | ||
| })); | ||
| detail::bitmap_offset_to_vertex_op_t<vertex_t>{ | ||
| local_frontier_range_firsts[partition_idx]}); |
There was a problem hiding this comment.
This should work better than re-defining a new functor.
#include <cugraph/utilities/device_functors.cuh>
...
detail::shift_right_t<vertex_t>{local_frontier_range_firsts[partition_idx]}
There was a problem hiding this comment.
Modified to use shift_right_t
| if (i < major_sparse_range_size) { return major_range_first + i; } | ||
| return *(dcs_nzd_vertices + (i - major_sparse_range_size)); | ||
| } | ||
| }; |
There was a problem hiding this comment.
This function is pretty much of the opposite of
https://github.com/rapidsai/cugraph/blob/main/cpp/include/cugraph/edge_partition_device_view.cuh#L36
(but the function above is a function, this is a functor).
We may better co-locate these two functions in a single file.
There was a problem hiding this comment.
One suggestion is to create
major_from_major_hypersparse_idx_nocheck_impl and create a functor using this function.
| sh_iter_32_t, | ||
| sh_iter_32_t, | ||
| rmm::cuda_stream_view) const; | ||
|
|
There was a problem hiding this comment.
So, here we are explicitly specializing for
MajorIterator = vertex_t const*
or
MajorIterator = thrust::counting_iterator<vertex_t>
and
MaskIterator = uint32_t const*
To make this more explicit,
what about creating
compute_(number_of_edges|local_degrees)(_with_mask)(_async) functions that take
raft::device_span<vertex_t const> majors,
raft::device_span<uint32_t const> masks
or
raft::device_span<uint32_t const> masks,
vertex_t major_first,
vertex_t major_last
and implement these function using a detail space function taking MaskIterator and MajorIterator?
This will make using public standard functions more explicit (and better resembles cugraph conventions) while stilling allowing to use the detail space function that works with any valid thrust fancy iterators.
There was a problem hiding this comment.
Maybe one complication this introduces is that
inside a primitive we often don't know whether a passed iterator is a pointer or a fancy iterator. So, we may need to do something like
if constexpr (std::is_pointer_v) {
...
}
else {
...
}
but the simplification in the edge_partition_device_view_t class's public interface might be worth more than the increase in the code complexity in primitive internals.
There was a problem hiding this comment.
I modified the interface to do raft::device_span for the masks. I left the iterator approach in place for MajorIterator. extract_transform_if_v_frontier_e passes a transform iterator that has optimizations for a bit mask compacted frontier. I didn't see a clean way to make an API that would accommodate that without making a very specific interface. It seemed like leaving MajorIterator as is was cleaner to me.
…ore consistent with other naming
… od_shortest_path
| handle.get_stream()); | ||
| auto edge_partition_e_mask_span = | ||
| raft::device_span<uint32_t const>(edge_partition_e_mask->value_first(), | ||
| static_cast<size_t>(edge_partition.number_of_edges())); |
There was a problem hiding this comment.
static_cast<size_t>(edge_partition.number_of_edges())
=>
packed_bool_size(static_cast<size_t>(edge_partition.number_of_edges()))
There was a problem hiding this comment.
I think we are providing the wrong size in multiple places in this PR, we need to fix all of them.
| edge_partition.compute_number_of_edges_with_mask_async( | ||
| auto edge_partition_e_mask_span = raft::device_span<uint32_t const>( | ||
| edge_partition_e_mask->value_first(), | ||
| static_cast<size_t>(edge_partition.number_of_edges())); |
| edge_partition.compute_number_of_edges_with_mask_async( | ||
| auto edge_partition_e_mask_span = raft::device_span<uint32_t const>( | ||
| edge_partition_e_mask->value_first(), | ||
| static_cast<size_t>(edge_partition.number_of_edges())); |
| (*edge_partition_e_mask).value_first(), | ||
| raft::device_span<uint32_t const>( | ||
| (*edge_partition_e_mask).value_first(), | ||
| static_cast<size_t>(edge_partition.number_of_edges())), |
| if (edge_partition_e_mask) { | ||
| edge_partition_mask_span = | ||
| raft::device_span<uint32_t const>((*edge_partition_e_mask).value_first(), | ||
| static_cast<size_t>(edge_partition.number_of_edges())); |
| (*edge_partition_e_mask).value_first(), handle.get_stream()); | ||
| auto edge_partition_mask_span = | ||
| raft::device_span<uint32_t const>((*edge_partition_e_mask).value_first(), | ||
| static_cast<size_t>(edge_partition.number_of_edges())); |
| if (edge_partition_e_mask) { | ||
| auto edge_partition_mask_span = | ||
| raft::device_span<uint32_t const>((*edge_partition_e_mask).value_first(), | ||
| static_cast<size_t>(edge_partition.number_of_edges())); |
| template void view_t::compute_number_of_edges_with_mask_async<int32_t const*>( | ||
| raft::device_span<uint32_t const>, | ||
| int32_t const*, | ||
| int32_t const*, |
There was a problem hiding this comment.
Taking both raft::device_span and raw pointers look a bit odd here. Should we better take both edge mask and vertex list using device_span?
There was a problem hiding this comment.
Cleaned up in latest push
| template void view_t::compute_number_of_edges_with_mask_async<thrust::counting_iterator<int32_t>>( | ||
| raft::device_span<uint32_t const>, | ||
| thrust::counting_iterator<int32_t>, | ||
| thrust::counting_iterator<int32_t>, |
There was a problem hiding this comment.
Can we take local_vertex_partition_range_first, and local_vertex_partition_range_last intead? (and create counting iterators inside?) It will be more consistent with the rest of cugraph.
There was a problem hiding this comment.
Made it a tuple of range_first/range_last, compiler was having trouble disambiguating.
| thrust_tuple_get_or_identity<decltype(frontier_key_last), 0>(frontier_key_last); | ||
| auto frontier_majors = raft::device_span<vertex_t const>{ | ||
| thrust_tuple_get_or_identity<decltype(frontier_key_first), 0>(frontier_key_first), | ||
| static_cast<size_t>(cuda::std::distance(frontier_key_first, frontier_key_last))}; |
There was a problem hiding this comment.
We now have repeated_vertex_bucket_view_t; this generates keys on the fly using transform_iterator.
https://github.com/rapidsai/cugraph/blob/main/cpp/include/cugraph/prims/vertex_frontier.cuh#L714
If repeated_vertex_bucket_view_t is used, this code won't work.
I think we should support this case as well.
if constexpr (std::is_pointer_v<decltype(thrust_tuple_get_or_identity<decltype(frontier_key_first), 0>(frontier_key_first)>) {
// call the common path function
}
else {
// call the detail path function
}
There was a problem hiding this comment.
Fixed in next push.
| packed_bool_size(static_cast<size_t>(edge_partition.number_of_edges()))); | ||
| edge_partition.compute_number_of_edges_with_mask_async( | ||
| edge_partition_e_mask_span, | ||
| raft::device_span<vertex_t const>{major_first, num_keys}, |
There was a problem hiding this comment.
Something very minor but should we mix (ptr, size) and {ptr, size}? Better stick to the first approach for consistency?
| edge_partition.compute_number_of_edges_async( | ||
| major_first, | ||
| major_first + num_keys, | ||
| raft::device_span<vertex_t const>{major_first, num_keys}, |
There was a problem hiding this comment.
Something very minor but should we mix (ptr, size) and {ptr, size}? Better stick to the first approach for consistency?
| packed_bool_size(static_cast<size_t>(edge_partition.number_of_edges()))); | ||
| ret += edge_partition.compute_number_of_edges_with_mask( | ||
| edge_partition_mask_span, | ||
| raft::device_span<vertex_t const>{local_frontier_vertex_first, frontier.size()}, |
There was a problem hiding this comment.
Something very minor but should we mix (ptr, size) and {ptr, size}? Better stick to the first approach for consistency?
| local_frontier_vertex_first + frontier.size(), | ||
| handle.get_stream()); | ||
| ret += edge_partition.compute_number_of_edges( | ||
| raft::device_span<vertex_t const>{local_frontier_vertex_first, frontier.size()}, |
There was a problem hiding this comment.
Something very minor but should we mix (ptr, size) and {ptr, size}? Better stick to the first approach for consistency?
| * SPDX-FileCopyrightText: Copyright (c) 2023-2026, NVIDIA CORPORATION. | ||
| * SPDX-License-Identifier: Apache-2.0 | ||
| */ | ||
| #include "structure/edge_partition_device_view_impl.cuh" |
| * SPDX-FileCopyrightText: Copyright (c) 2023-2026, NVIDIA CORPORATION. | ||
| * SPDX-License-Identifier: Apache-2.0 | ||
| */ | ||
| #include "structure/edge_partition_device_view_impl.cuh" |
| using view_t = edge_partition_device_view_t<int64_t, int64_t, false>; | ||
| using od_extract_iter_t = | ||
| cuda::transform_iterator<extract_v_t<int64_t, uint32_t, uint64_t>, uint64_t const*>; | ||
| template void view_t::compute_number_of_edges_with_mask_async<od_extract_iter_t>( | ||
| raft::device_span<uint32_t const>, | ||
| od_extract_iter_t, | ||
| od_extract_iter_t, | ||
| raft::device_span<size_t>, | ||
| rmm::cuda_stream_view) const; |
|
|
||
| __device__ return_type_t compute_degree(edge_t offset, edge_t degree) const | ||
| { | ||
| if (mask_first) { |
There was a problem hiding this comment.
I am debating whether we should be more explicit about the existence of mask here.
Say MaskIterator is not a pointer.
if (mask_first) will still be false with a default constructed fancy iterators? Maybe... yes, maybe... not.
Should we better be more explicit and use cuda::std::optional?
There was a problem hiding this comment.
This is mimicking your original code, and we don't currently have any examples where it's not a raw pointer.
However, I'll make that change, it's simple enough.
|
|
||
| template __host__ rmm::device_uvector<edge_t> compute_local_degrees_with_mask_sg( | ||
| raft::device_span<uint32_t const> edge_mask, | ||
| std::tuple<vertex_t, vertex_t> vertex_partition_range, |
There was a problem hiding this comment.
I am debating between using "std::tuple" here versus using two separate vertex_t input parameters.
The former sounds better if we are freshly re-designing the API, but the latter looks a bit more consistent with the rest of the code base.
It seems like we are using two variables for input parameters while using the std::tuple for return values. What do you think?
(base) seunghwak@SK:~/RAPIDS/cugraph/cpp/include$ grep -rHn "range_last," .
./cugraph/edge_partition_view.hpp:51: vertex_t major_range_last,
./cugraph/edge_partition_view.hpp:53: vertex_t minor_range_last,
./cugraph/graph_view.hpp:644: major_range_last,
./cugraph/graph_view.hpp:646: minor_range_last,
./cugraph/detail/decompress_edge_partition.cuh:37: vertex_t major_range_last,
./cugraph/detail/decompress_edge_partition.cuh:67: vertex_t major_range_last,
./cugraph/prims/transform_reduce_e_by_src_dst_key.cuh:184: typename GraphViewType::vertex_type major_range_last,
./cugraph/prims/transform_reduce_e_by_src_dst_key.cuh:270: typename GraphViewType::vertex_type major_range_last,
./cugraph/prims/transform_reduce_e_by_src_dst_key.cuh:369: typename GraphViewType::vertex_type major_range_last,
./cugraph/prims/vertex_frontier.cuh:123: typename thrust::iterator_traits<VertexIterator>::value_type vertex_range_last,
./cugraph/prims/vertex_frontier.cuh:170: typename thrust::iterator_traits<InputVertexIterator>::value_type vertex_range_last,
./cugraph/prims/vertex_frontier.cuh:212: typename thrust::iterator_traits<OutputVertexIterator>::value_type vertex_range_last,
./cugraph/prims/detail/per_v_transform_reduce_e.cuh:2056: range_last = std::min(range_last, *(edge_partition.major_hypersparse_first()));
./cugraph/prims/detail/per_v_transform_reduce_e.cuh:2080: range_last,
./cugraph/prims/detail/extract_transform_if_v_frontier_e.cuh:1338: range_last,
./cugraph/prims/transform_reduce_e.cuh:148: typename GraphViewType::vertex_type major_range_last,
./cugraph/prims/transform_reduce_e.cuh:237: typename GraphViewType::vertex_type major_range_last,
./cugraph/prims/transform_reduce_e.cuh:315: typename GraphViewType::vertex_type major_range_last,
(base) seunghwak@SK:~/RAPIDS/cugraph/cpp/include$ grep -rHn "tuple<vertex_t, vertex_t>" .
./cugraph/src_dst_lookup_container.hpp:26: typename value_t = cuda::std::tuple<vertex_t, vertex_t>>
./cugraph/graph_view.hpp:116: std::tuple<vertex_t, vertex_t> local_vertex_partition_range() const
./cugraph/graph_view.hpp:137: std::tuple<vertex_t, vertex_t> vertex_partition_range(size_t partition_idx) const
./cugraph/graph_view.hpp:167: std::tuple<vertex_t, vertex_t> local_edge_partition_major_range(size_t partition_idx) const
./cugraph/graph_view.hpp:196: std::tuple<vertex_t, vertex_t> local_edge_partition_minor_range() const
./cugraph/graph_view.hpp:398: std::tuple<vertex_t, vertex_t> local_vertex_partition_range() const
./cugraph/graph_view.hpp:863: std::tuple<vertex_t, vertex_t> local_vertex_partition_range() const
./cugraph/prims/transform_reduce_dst_nbr_intersection_of_e_endpoints_by_v.cuh:50: __device__ int operator()(cuda::std::tuple<vertex_t, vertex_t> tup) const
./cugraph/prims/per_v_transform_reduce_dst_key_aggregated_outgoing_e.cuh:88: operator()(cuda::std::tuple<vertex_t, vertex_t> val /* major, minor key */) const
./cugraph/prims/per_v_transform_reduce_dst_key_aggregated_outgoing_e.cuh:97: __device__ bool operator()(cuda::std::tuple<vertex_t, vertex_t> pair) const
./cugraph/prims/per_v_transform_reduce_dst_key_aggregated_outgoing_e.cuh:137: operator()(cuda::std::tuple<vertex_t, vertex_t> val /* major, minor key */) const
./cugraph/prims/detail/nbr_intersection.cuh:274: __device__ edge_t operator()(cuda::std::tuple<vertex_t, vertex_t> pair) const
./cugraph/prims/detail/nbr_intersection.cuh:702: cuda::std::tuple<vertex_t, vertex_t>>);
./cugraph/prims/detail/nbr_intersection.cuh:760: vertex_pair_first, thrust_tuple_get<cuda::std::tuple<vertex_t, vertex_t>, size_t{1}>{});
./cugraph/prims/detail/nbr_intersection.cuh:1122: vertex_pair_first, thrust_tuple_get<cuda::std::tuple<vertex_t, vertex_t>, size_t{0}>{});
./cugraph/prims/detail/nbr_intersection.cuh:1181: auto vertex_pair_buffer = allocate_dataframe_buffer<cuda::std::tuple<vertex_t, vertex_t>>(
./cugraph/utilities/graph_partition_utils.cuh:117: cuda::std::tuple<vertex_t, vertex_t> pair /* major, minor */) const
./cugraph/utilities/graph_partition_utils.cuh:158: __device__ int operator()(cuda::std::tuple<vertex_t, vertex_t> pair /* major, minor */) const
./cugraph/utilities/graph_partition_utils.cuh:193: cuda::std::tuple<vertex_t, vertex_t> pair /* major, minor */) const
./cugraph/utilities/graph_partition_utils.cuh:225: __device__ int operator()(cuda::std::tuple<vertex_t, vertex_t> pair /* major, minor */) const
./cugraph/utilities/graph_partition_utils.cuh:259: cuda::std::tuple<vertex_t, vertex_t> pair /* major, minor */) const
./cugraph/utilities/graph_partition_utils.cuh:282: __device__ int operator()(cuda::std::tuple<vertex_t, vertex_t> pair /* major, minor */) const
./cugraph/utilities/error_check_utils.cuh:38: __device__ bool operator()(cuda::std::tuple<vertex_t, vertex_t> pair) const
./cugraph/graph_generators.hpp:206: std::vector<std::tuple<vertex_t, vertex_t>> const& component_parameters_v);
./cugraph/graph_generators.hpp:284: std::vector<std::tuple<vertex_t, vertex_t>> const& component_parameters_v);
There was a problem hiding this comment.
I like the tuple because it makes it obvious that the two parameters are related. I agree it's an API change. We could either add a FIXME to update the rest of the API or I suppose we could ripple through the rest of the API in this PR also.
I can switch it back if you think that's better and we can grapple with it later. Switching it back will require using SFINAE syntax to disambiguate passing vertex_partition_range_first/vertex_partition_range_last from passing MajorIterator pairs, or combining those into one call and disambiguating the call to the detail implementation via constexpr syntax.
Summary
edge_partition_device_view_t::compute_number_of_edges*andcompute_local_degrees*method bodies out of the headeredge_partition_device_view.cuhinto a separate_impl.cuhfile, with explicit template instantiations in 4 dedicated.cutranslation units (SG/MG × v32/v64).cuda::transform_iteratortypes in prims headers with named functors (bitmap_offset_to_vertex_op_t,sparse_hypersparse_major_op_t) so those iterator types can also be explicitly instantiated without leaking the heavy_impl.cuhback into broadly-included headers.compute_number_of_edgesandcompute_number_of_edges_with_maskare now thin inline wrappers that delegate to their_asynccounterparts viarmm::device_scalar, eliminating separate Thrusttransform_reducekernel instantiations.local_degree_op_tandlocal_degree_with_mask_op_tinto a single functor with a nullableuint32_t const*mask parameter. No-mask methods become inline wrappers that passnullptr, reusing the same with-mask instantiations.vertex_t*/vertex_t const*instantiations: SFINAE overloads const-cast non-const pointer arguments to const, so onlyvertex_t const*instantiations are needed in the.cufiles.libcugraph.so475→456 MB,libcugraph_mg.so576→512 MB.Background
The 8
compute_number_of_edges*/compute_local_degrees*template methods onedge_partition_device_view_twere defined inline inedge_partition_device_view.cuh. Since this header is transitively included by ~44.cutranslation units (via prims headers), each TU compiled its own copies of the CUBDeviceReduce::Sumand Thrusttransform_reducekernels for every iterator type used. This was the single largest source of CUDA device code duplication in the library.Approach
Following the existing codebase pattern (e.g.,
graph_view_impl.cuh+graph_view_sg_v32_e32.cu):compute_number_of_edges_impl.cuh: Contains the out-of-line method definitions for all methods on both MG and SG partial specializations..cufiles: Each includes_impl.cuhand provides explicit instantiations for commonMajorIteratortypes (vertex_t const*,counting_iterator<vertex_t>).cuda::transform_iterator<lambda>types that required the method definitions to be visible. Replacing the lambdas with named functors (compute_number_of_edges_functors.cuh) makes the iterator types nameable for explicit instantiation in the MG.cufiles.od_shortest_distances_impl.cuh: Retains the_impl.cuhinclude for its own rare iterator type, but this file is only compiled by 2 SG TUs so the impact is negligible.compute_number_of_edges,compute_number_of_edges_with_mask) are inline wrappers that allocate anrmm::device_scalar<size_t>, call the_asyncvariant, and returnscalar.value(stream). This eliminates separate Thrusttransform_reducekernel codepaths — only CUBDeviceReduce::Sumkernels remain.local_degree_op_tandlocal_degree_with_mask_op_tare merged into a singlelocal_degree_op_twith aMaskIteratortemplate parameter (defaulting touint32_t const*). Acompute_degreehelper uses a runtime null check on the mask pointer:if (mask_first)callscount_set_bits, otherwise returns the raw degree. No-mask methods passnullptr, reusing the exact same template instantiations as with-mask calls. The branch is uniform across all GPU threads (same pointer for all elements), so the prediction cost is negligible.MajorIteratorarguments (T*where!std::is_const_v<T>) and forward to theT const*variant. This halves the pointer-type instantiations — onlyvertex_t const*is explicitly instantiated in the.cufiles.Size reduction breakdown
libcugraph.solibcugraph_mg.soInstantiation count: 80 → 24 across the 4 dedicated
.cufiles.Latest edits to software, tested on CUDA 12:
libcugraph.solibcugraph_mg.soLatest edits to software, tested on CUDA 13:
libcugraph.solibcugraph_mg.soTesting
All 92 C++ tests pass (
test-cugraph-cpp -j10).