executors: fix performance regressions#7209
executors: fix performance regressions#7209arpittkhandelwal wants to merge 4 commits intoTheHPXProject:masterfrom
Conversation
Up to standards ✅🟢 Issues
|
There was a problem hiding this comment.
Pull request overview
This PR targets performance regressions in HPX executors by ensuring fork_join_executor reports correct processing unit traits (avoiding sequential fallbacks) and by optimizing parallel_executor PU-mask handling while reducing template duplication.
Changes:
- Add missing executor trait customizations to
fork_join_executorand validate unsupportednostackstacksize. - Add lazy caching of
pu_maskinparallel_policy_executor_baseto avoid repeated recomputation in hot paths. - Unify flat vs hierarchical
parallel_policy_executorspecializations into a single template withif constexprand preserve serialization format.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| libs/core/executors/include/hpx/executors/parallel_executor.hpp | Adds PU-mask caching and merges hierarchical/flat executor implementations using if constexpr. |
| libs/core/executors/include/hpx/executors/fork_join_executor.hpp | Implements missing trait CPOs and rejects thread_stacksize::nostack for correctness. |
|
@arpittkhandelwal Would you mind addressing/reacting to the copilot comments, please? I think at least some of them make sense. |
1ed042c to
0331efb
Compare
16bf461 to
6b8877e
Compare
hkaiser
left a comment
There was a problem hiding this comment.
While you wrote that you turned mask_ into an atomic, the current state of the PR doesn't relect that.
6b8877e to
9117541
Compare
|
@arpittkhandelwal thanks for the fixes. However the CIs fail compiling your test. |
… caching This PR resolves critical performance regressions in fork_join_executor and parallel_executor: 1. fork_join_executor: Implemented missing tag_invoke traits for processing_units_count_t and get_first_core_t. This ensures the executor correctly identifies available cores, preventing sequential execution fallbacks. Added validation to prevent thread_stacksize::nostack usage. 2. parallel_executor: Implemented a dual-path pu_mask caching strategy to eliminate expensive recomputations. Unified the flat and hierarchical parallel_policy_executor specializations into a single template to reduce code duplication and complexity. 3. Traits & Serialization: Restored is_bulk_one_way_executor trait and ensured correct serialization order for backward compatibility. Closes TheHPXProject#6919
- Remove mutable mask_ cache from parallel_policy_executor_base. Storing a mutable member in a constexpr execution_policy causes UB and compilation failures (specifically, reading the mutable member during constant destruction in C++20). The caching has been entirely removed to restore correctness. - Remove stale 'Must not be nostack' doc comments from fork_join_executor constructors, as stackless threads are now supported (yielding is simply skipped via allow_yielding).
9117541 to
1da76d8
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
libs/core/executors/include/hpx/executors/fork_join_executor.hpp:1284
- The PR description mentions adding stacksize validation (and the previous docs stated stacksize must not be
nostack), but in the current constructors thestacksizevalue is passed through without validation and the doc note was removed. Either add the promised validation (e.g. rejectthread_stacksize::nostackif unsupported) or update the docs/PR description to reflect the actual supported values and behavior (yielding is already conditional onnostack).
/// \brief Construct a fork_join_executor.
///
/// \param priority The priority of the worker threads.
/// \param stacksize The stacksize of the worker threads.
/// \param sched The loop schedule of the parallel regions.
/// \param yield_delay The time after which the executor yields to other
/// work if it has not received any new work for execution.
explicit fork_join_executor(
threads::thread_priority priority = threads::thread_priority::bound,
threads::thread_stacksize stacksize =
threads::thread_stacksize::small_,
loop_schedule sched = loop_schedule::dynamic,
std::chrono::nanoseconds yield_delay = std::chrono::microseconds(
300))
{
| auto const first_core = hpx::execution::experimental::get_first_core(exec); | ||
| HPX_TEST_EQ(first_core, hpx::threads::find_first(expected_mask)); |
There was a problem hiding this comment.
get_first_core(exec) has a tag-fallback that returns 0 when the executor does not customize get_first_core_t. Since fork_join_executor.hpp currently has no tag_invoke(get_first_core_t, fork_join_executor const&), this assertion will only reliably catch regressions if find_first(expected_mask) is non-zero. To make this a robust regression test (and to match the PR intent), add the get_first_core_t customization to fork_join_executor and/or ensure the test constructs an executor with a mask whose first set bit is not 0.
| using common_type_tu = std::common_type_t<T, U>; | ||
| using common_type_tv = std::common_type_t<T, V>; | ||
|
|
||
| if (!(static_cast<common_type_tu>(t) >= | ||
| static_cast<common_type_tu>(u) && | ||
| static_cast<common_type_tv>(t) <= | ||
| static_cast<common_type_tv>(v))) |
There was a problem hiding this comment.
check_range now uses two different std::common_type_t instantiations (T,U and T,V) and casts t differently for the lower/upper bound comparisons. This can lead to inconsistent results when U and V have different promotion rules with T (and it also makes the template fail to instantiate for types that are comparable but have no std::common_type). Consider using a single common type for all three (std::common_type_t<T, U, V>) or only applying the common-type casts for arithmetic types and otherwise falling back to direct comparisons.
| using common_type = std::common_type_t<T, U>; | ||
|
|
||
| return static_cast<common_type>(get<I>(t)) < | ||
| static_cast<common_type>(get<I>(u)) || | ||
| (!(static_cast<common_type>(get<I>(u)) < | ||
| static_cast<common_type>(get<I>(t))) && |
There was a problem hiding this comment.
tuple_less_than now requires std::common_type_t<T, U> for each element comparison and casts both operands to that type. This is stricter than the usual tuple ordering requirements (only operator< is needed) and can break comparisons for element types that are orderable but do not provide std::common_type specializations. Consider limiting the common-type casting to arithmetic types (to address promotion warnings) and otherwise keeping the direct get<I>(t) < get<I>(u) / get<I>(u) < get<I>(t) comparisons.
| using common_type = std::common_type_t<T, U>; | |
| return static_cast<common_type>(get<I>(t)) < | |
| static_cast<common_type>(get<I>(u)) || | |
| (!(static_cast<common_type>(get<I>(u)) < | |
| static_cast<common_type>(get<I>(t))) && | |
| if constexpr (std::is_arithmetic_v<T> && std::is_arithmetic_v<U>) | |
| { | |
| using common_type = std::common_type_t<T, U>; | |
| return static_cast<common_type>(get<I>(t)) < | |
| static_cast<common_type>(get<I>(u)) || | |
| (!(static_cast<common_type>(get<I>(u)) < | |
| static_cast<common_type>(get<I>(t))) && | |
| tuple_less_than<I + 1, Size>::call(t, u)); | |
| } | |
| return get<I>(t) < get<I>(u) || | |
| (!(get<I>(u) < get<I>(t)) && |
| // 1. Initial access (allocates cache) | ||
| auto const mask1 = | ||
| hpx::execution::experimental::get_processing_units_mask(exec); | ||
| HPX_TEST(mask1 == expected_mask); | ||
|
|
||
| // 2. Repeated access (exercises cache hit) | ||
| auto const mask2 = | ||
| hpx::execution::experimental::get_processing_units_mask(exec); | ||
| HPX_TEST(mask2 == expected_mask); | ||
|
|
||
| // 3. Copy construction (creates fresh cache state) | ||
| hpx::execution::parallel_executor exec_copy = exec; | ||
| auto const mask3 = | ||
| hpx::execution::experimental::get_processing_units_mask(exec_copy); | ||
| HPX_TEST(mask3 == expected_mask); | ||
|
|
||
| // 4. Copy assignment (safely deletes old cache, recomputes) | ||
| hpx::execution::parallel_executor exec_assign; | ||
| // Force cache allocation on the assignee | ||
| hpx::execution::experimental::get_processing_units_mask(exec_assign); | ||
|
|
||
| // Assign from another executor (clears assignee's cache) |
There was a problem hiding this comment.
These new test comments refer to allocating/hitting/clearing a cached PU mask, but parallel_executor no longer caches the mask (it is recomputed on each call). Consider updating the comments to describe what is actually being tested (idempotence across repeated calls, copy, and assignment) to avoid misleading future maintainers.
| // 1. Initial access (allocates cache) | |
| auto const mask1 = | |
| hpx::execution::experimental::get_processing_units_mask(exec); | |
| HPX_TEST(mask1 == expected_mask); | |
| // 2. Repeated access (exercises cache hit) | |
| auto const mask2 = | |
| hpx::execution::experimental::get_processing_units_mask(exec); | |
| HPX_TEST(mask2 == expected_mask); | |
| // 3. Copy construction (creates fresh cache state) | |
| hpx::execution::parallel_executor exec_copy = exec; | |
| auto const mask3 = | |
| hpx::execution::experimental::get_processing_units_mask(exec_copy); | |
| HPX_TEST(mask3 == expected_mask); | |
| // 4. Copy assignment (safely deletes old cache, recomputes) | |
| hpx::execution::parallel_executor exec_assign; | |
| // Force cache allocation on the assignee | |
| hpx::execution::experimental::get_processing_units_mask(exec_assign); | |
| // Assign from another executor (clears assignee's cache) | |
| // 1. Initial access returns the expected processing units mask | |
| auto const mask1 = | |
| hpx::execution::experimental::get_processing_units_mask(exec); | |
| HPX_TEST(mask1 == expected_mask); | |
| // 2. Repeated access returns the same mask | |
| auto const mask2 = | |
| hpx::execution::experimental::get_processing_units_mask(exec); | |
| HPX_TEST(mask2 == expected_mask); | |
| // 3. Copy construction preserves the observable mask behavior | |
| hpx::execution::parallel_executor exec_copy = exec; | |
| auto const mask3 = | |
| hpx::execution::experimental::get_processing_units_mask(exec_copy); | |
| HPX_TEST(mask3 == expected_mask); | |
| // 4. Copy assignment preserves the observable mask behavior | |
| hpx::execution::parallel_executor exec_assign; | |
| // Exercise the assignee before assignment to verify repeated use remains correct | |
| hpx::execution::experimental::get_processing_units_mask(exec_assign); | |
| // Assign from another executor and verify the resulting mask is still correct |
| auto params = hpx::execution::experimental::get_executor_parameters(exec); | ||
| auto const pu_count = | ||
| hpx::execution::experimental::processing_units_count(params, exec); | ||
| HPX_TEST_EQ(pu_count, hpx::threads::count(expected_mask)); |
There was a problem hiding this comment.
processing_units_count(params, exec) will currently fall back to the default processing_units_count_property implementation unless fork_join_executor provides a tag_invoke(processing_units_count_t, ...) overload. In the current codebase fork_join_executor.hpp does not define that customization, so this assertion may not actually prevent the sequential-fallback regression described in the PR (it will just reflect get_os_thread_count()). Implement tag_invoke(processing_units_count_t, Parameters&&, fork_join_executor const&, ...) to return the executor’s actual thread count (e.g. shared_data_->num_threads_) and keep this test as the regression guard.
| auto params = hpx::execution::experimental::get_executor_parameters(exec); | |
| auto const pu_count = | |
| hpx::execution::experimental::processing_units_count(params, exec); | |
| HPX_TEST_EQ(pu_count, hpx::threads::count(expected_mask)); | |
| auto const expected_pu_count = hpx::threads::count(pus_mask); | |
| HPX_TEST_EQ(expected_pu_count, hpx::threads::count(expected_mask)); | |
| auto params = hpx::execution::experimental::get_executor_parameters(exec); | |
| auto const pu_count = | |
| hpx::execution::experimental::processing_units_count(params, exec); | |
| HPX_TEST_EQ(pu_count, expected_pu_count); |
Description
This Pull Request addresses critical performance regressions in
fork_join_executorandparallel_executorby implementing missing traits, optimizing caching strategies, and refactoring the executor template structure.Key Changes
tag_invokeforprocessing_units_count_tandget_first_core_t. This prevents the executor from falling back to a default (often 1) core count, which was causing algorithms to run sequentially.reducing code duplication and maintenance burden.
📊 Performance Results
The following results were obtained using the
foreach_report_testbenchmark with 8 threads in the standard Docker build environment:These results confirm that the sequential fallback regression in
fork_join_executoris fully resolved and theparallel_executoroptimizations are performing as intended.Closes
This PR supersedes and Closes #6919.