executors: fix performance regressions by arpittkhandelwal · Pull Request #7209 · TheHPXProject/hpx

arpittkhandelwal · 2026-04-20T17:40:04Z

Description

This Pull Request addresses critical performance regressions in fork_join_executor and parallel_executor by implementing missing traits, optimizing caching strategies, and refactoring the executor template structure.

Key Changes

fork_join_executor:
- Implemented tag_invoke for processing_units_count_t and get_first_core_t. This prevents the executor from falling back to a default (often 1) core count, which was causing algorithms to run sequentially.
- Added stacksize validation to ensure correct executor instantiation.
  reducing code duplication and maintenance burden.

📊 Performance Results

The following results were obtained using the foreach_report_test benchmark with 8 threads in the standard Docker build environment:

Executor	Execution Time (ms)	Status
fork_join_executor	~2.3ms	✅ Fixed (from ~150ms regression)
parallel_executor	~16.8ms	✅ Stable (at parallel baseline)

These results confirm that the sequential fallback regression in fork_join_executor is fully resolved and the parallel_executor optimizations are performing as intended.

Closes

This PR supersedes and Closes #6919.

codacy-production · 2026-04-20T17:40:56Z

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

Copilot

Pull request overview

This PR targets performance regressions in HPX executors by ensuring fork_join_executor reports correct processing unit traits (avoiding sequential fallbacks) and by optimizing parallel_executor PU-mask handling while reducing template duplication.

Changes:

Add missing executor trait customizations to fork_join_executor and validate unsupported nostack stacksize.
Add lazy caching of pu_mask in parallel_policy_executor_base to avoid repeated recomputation in hot paths.
Unify flat vs hierarchical parallel_policy_executor specializations into a single template with if constexpr and preserve serialization format.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
libs/core/executors/include/hpx/executors/parallel_executor.hpp	Adds PU-mask caching and merges hierarchical/flat executor implementations using `if constexpr`.
libs/core/executors/include/hpx/executors/fork_join_executor.hpp	Implements missing trait CPOs and rejects `thread_stacksize::nostack` for correctness.

hkaiser · 2026-04-23T20:15:34Z

@arpittkhandelwal Would you mind addressing/reacting to the copilot comments, please? I think at least some of them make sense.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

hkaiser

While you wrote that you turned mask_ into an atomic, the current state of the PR doesn't relect that.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

hkaiser · 2026-04-25T15:47:49Z

@arpittkhandelwal thanks for the fixes. However the CIs fail compiling your test.

… caching This PR resolves critical performance regressions in fork_join_executor and parallel_executor: 1. fork_join_executor: Implemented missing tag_invoke traits for processing_units_count_t and get_first_core_t. This ensures the executor correctly identifies available cores, preventing sequential execution fallbacks. Added validation to prevent thread_stacksize::nostack usage. 2. parallel_executor: Implemented a dual-path pu_mask caching strategy to eliminate expensive recomputations. Unified the flat and hierarchical parallel_policy_executor specializations into a single template to reduce code duplication and complexity. 3. Traits & Serialization: Restored is_bulk_one_way_executor trait and ensured correct serialization order for backward compatibility. Closes TheHPXProject#6919

- Remove mutable mask_ cache from parallel_policy_executor_base. Storing a mutable member in a constexpr execution_policy causes UB and compilation failures (specifically, reading the mutable member during constant destruction in C++20). The caching has been entirely removed to restore correctness. - Remove stale 'Must not be nostack' doc comments from fork_join_executor constructors, as stackless threads are now supported (yielding is simply skipped via allow_yielding).

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

libs/core/executors/include/hpx/executors/fork_join_executor.hpp:1284

The PR description mentions adding stacksize validation (and the previous docs stated stacksize must not be nostack), but in the current constructors the stacksize value is passed through without validation and the doc note was removed. Either add the promised validation (e.g. reject thread_stacksize::nostack if unsupported) or update the docs/PR description to reflect the actual supported values and behavior (yielding is already conditional on nostack).

        /// \brief Construct a fork_join_executor.
        ///
        /// \param priority  The priority of the worker threads.
        /// \param stacksize The stacksize of the worker threads.
        /// \param sched     The loop schedule of the parallel regions.
        /// \param yield_delay The time after which the executor yields to other
        ///        work if it has not received any new work for execution.
        explicit fork_join_executor(
            threads::thread_priority priority = threads::thread_priority::bound,
            threads::thread_stacksize stacksize =
                threads::thread_stacksize::small_,
            loop_schedule sched = loop_schedule::dynamic,
            std::chrono::nanoseconds yield_delay = std::chrono::microseconds(
                300))
        {

Copilot · 2026-04-26T05:39:16Z

+    auto const first_core = hpx::execution::experimental::get_first_core(exec);
+    HPX_TEST_EQ(first_core, hpx::threads::find_first(expected_mask));


get_first_core(exec) has a tag-fallback that returns 0 when the executor does not customize get_first_core_t. Since fork_join_executor.hpp currently has no tag_invoke(get_first_core_t, fork_join_executor const&), this assertion will only reliably catch regressions if find_first(expected_mask) is non-zero. To make this a robust regression test (and to match the PR intent), add the get_first_core_t customization to fork_join_executor and/or ensure the test constructs an executor with a mask whose first set bit is not 0.

Copilot · 2026-04-26T05:39:16Z

+                using common_type_tu = std::common_type_t<T, U>;
+                using common_type_tv = std::common_type_t<T, V>;
+
+                if (!(static_cast<common_type_tu>(t) >=
+                            static_cast<common_type_tu>(u) &&
+                        static_cast<common_type_tv>(t) <=
+                            static_cast<common_type_tv>(v)))


check_range now uses two different std::common_type_t instantiations (T,U and T,V) and casts t differently for the lower/upper bound comparisons. This can lead to inconsistent results when U and V have different promotion rules with T (and it also makes the template fail to instantiate for types that are comparable but have no std::common_type). Consider using a single common type for all three (std::common_type_t<T, U, V>) or only applying the common-type casts for arithmetic types and otherwise falling back to direct comparisons.

Copilot · 2026-04-26T05:39:17Z

+                using common_type = std::common_type_t<T, U>;
+
+                return static_cast<common_type>(get<I>(t)) <
+                    static_cast<common_type>(get<I>(u)) ||
+                    (!(static_cast<common_type>(get<I>(u)) <
+                         static_cast<common_type>(get<I>(t))) &&


tuple_less_than now requires std::common_type_t<T, U> for each element comparison and casts both operands to that type. This is stricter than the usual tuple ordering requirements (only operator< is needed) and can break comparisons for element types that are orderable but do not provide std::common_type specializations. Consider limiting the common-type casting to arithmetic types (to address promotion warnings) and otherwise keeping the direct get(t) < get(u) / get(u) < get(t) comparisons.

Suggested change

using common_type = std::common_type_t<T, U>;

return static_cast<common_type>(get(t)) <

static_cast<common_type>(get(u)) ||

(!(static_cast<common_type>(get(u)) <

static_cast<common_type>(get(t))) &&

if constexpr (std::is_arithmetic_v<T> && std::is_arithmetic_v)

{

using common_type = std::common_type_t<T, U>;

return static_cast<common_type>(get(t)) <

static_cast<common_type>(get(u)) ||

(!(static_cast<common_type>(get(u)) <

static_cast<common_type>(get(t))) &&

tuple_less_than::call(t, u));

}

return get(t) < get(u) ||

(!(get(u) < get(t)) &&

Copilot · 2026-04-26T05:39:17Z

+    // 1. Initial access (allocates cache)
+    auto const mask1 =
+        hpx::execution::experimental::get_processing_units_mask(exec);
+    HPX_TEST(mask1 == expected_mask);
+
+    // 2. Repeated access (exercises cache hit)
+    auto const mask2 =
+        hpx::execution::experimental::get_processing_units_mask(exec);
+    HPX_TEST(mask2 == expected_mask);
+
+    // 3. Copy construction (creates fresh cache state)
+    hpx::execution::parallel_executor exec_copy = exec;
+    auto const mask3 =
+        hpx::execution::experimental::get_processing_units_mask(exec_copy);
+    HPX_TEST(mask3 == expected_mask);
+
+    // 4. Copy assignment (safely deletes old cache, recomputes)
+    hpx::execution::parallel_executor exec_assign;
+    // Force cache allocation on the assignee
+    hpx::execution::experimental::get_processing_units_mask(exec_assign);
+
+    // Assign from another executor (clears assignee's cache)


These new test comments refer to allocating/hitting/clearing a cached PU mask, but parallel_executor no longer caches the mask (it is recomputed on each call). Consider updating the comments to describe what is actually being tested (idempotence across repeated calls, copy, and assignment) to avoid misleading future maintainers.

Suggested change

// 1. Initial access (allocates cache)

auto const mask1 =

hpx::execution::experimental::get_processing_units_mask(exec);

HPX_TEST(mask1 == expected_mask);

// 2. Repeated access (exercises cache hit)

auto const mask2 =

hpx::execution::experimental::get_processing_units_mask(exec);

HPX_TEST(mask2 == expected_mask);

// 3. Copy construction (creates fresh cache state)

hpx::execution::parallel_executor exec_copy = exec;

auto const mask3 =

hpx::execution::experimental::get_processing_units_mask(exec_copy);

HPX_TEST(mask3 == expected_mask);

// 4. Copy assignment (safely deletes old cache, recomputes)

hpx::execution::parallel_executor exec_assign;

// Force cache allocation on the assignee

hpx::execution::experimental::get_processing_units_mask(exec_assign);

// Assign from another executor (clears assignee's cache)

// 1. Initial access returns the expected processing units mask

auto const mask1 =

hpx::execution::experimental::get_processing_units_mask(exec);

HPX_TEST(mask1 == expected_mask);

// 2. Repeated access returns the same mask

auto const mask2 =

hpx::execution::experimental::get_processing_units_mask(exec);

HPX_TEST(mask2 == expected_mask);

// 3. Copy construction preserves the observable mask behavior

hpx::execution::parallel_executor exec_copy = exec;

auto const mask3 =

hpx::execution::experimental::get_processing_units_mask(exec_copy);

HPX_TEST(mask3 == expected_mask);

// 4. Copy assignment preserves the observable mask behavior

hpx::execution::parallel_executor exec_assign;

// Exercise the assignee before assignment to verify repeated use remains correct

hpx::execution::experimental::get_processing_units_mask(exec_assign);

// Assign from another executor and verify the resulting mask is still correct

Copilot · 2026-04-26T05:39:17Z

+    auto params = hpx::execution::experimental::get_executor_parameters(exec);
+    auto const pu_count =
+        hpx::execution::experimental::processing_units_count(params, exec);
+    HPX_TEST_EQ(pu_count, hpx::threads::count(expected_mask));


processing_units_count(params, exec) will currently fall back to the default processing_units_count_property implementation unless fork_join_executor provides a tag_invoke(processing_units_count_t, ...) overload. In the current codebase fork_join_executor.hpp does not define that customization, so this assertion may not actually prevent the sequential-fallback regression described in the PR (it will just reflect get_os_thread_count()). Implement tag_invoke(processing_units_count_t, Parameters&&, fork_join_executor const&, ...) to return the executor’s actual thread count (e.g. shared_data_->num_threads_) and keep this test as the regression guard.

Suggested change

auto params = hpx::execution::experimental::get_executor_parameters(exec);

auto const pu_count =

hpx::execution::experimental::processing_units_count(params, exec);

HPX_TEST_EQ(pu_count, hpx::threads::count(expected_mask));

auto const expected_pu_count = hpx::threads::count(pus_mask);

HPX_TEST_EQ(expected_pu_count, hpx::threads::count(expected_mask));

auto params = hpx::execution::experimental::get_executor_parameters(exec);

auto const pu_count =

hpx::execution::experimental::processing_units_count(params, exec);

HPX_TEST_EQ(pu_count, expected_pu_count);

arpittkhandelwal requested a review from hkaiser as a code owner April 20, 2026 17:40

Copilot AI review requested due to automatic review settings April 20, 2026 17:40

Copilot started reviewing on behalf of arpittkhandelwal April 20, 2026 17:40 View session

arpittkhandelwal mentioned this pull request Apr 20, 2026

Fix performance regression in fork_join_executor by implementing missing traits #6919

Closed

Copilot AI reviewed Apr 20, 2026

View reviewed changes

hkaiser added type: optimization type: enhancement type: compatibility issue category: executors labels Apr 20, 2026

Copilot AI review requested due to automatic review settings April 24, 2026 00:34

arpittkhandelwal force-pushed the fix/executor-performance-final branch from 1ed042c to 0331efb Compare April 24, 2026 00:34

Copilot started reviewing on behalf of arpittkhandelwal April 24, 2026 00:35 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Comment thread libs/core/executors/include/hpx/executors/parallel_executor.hpp Outdated

Comment thread libs/core/executors/include/hpx/executors/parallel_executor.hpp Outdated

Copilot AI review requested due to automatic review settings April 24, 2026 00:54

arpittkhandelwal force-pushed the fix/executor-performance-final branch from 16bf461 to 6b8877e Compare April 24, 2026 00:54

Copilot started reviewing on behalf of arpittkhandelwal April 24, 2026 00:54 View session

hkaiser reviewed Apr 24, 2026

View reviewed changes

Comment thread libs/core/executors/include/hpx/executors/fork_join_executor.hpp Outdated

Comment thread libs/core/executors/include/hpx/executors/fork_join_executor.hpp Outdated

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Comment thread libs/core/executors/include/hpx/executors/parallel_executor.hpp

arpittkhandelwal force-pushed the fix/executor-performance-final branch from 6b8877e to 9117541 Compare April 24, 2026 01:24

hkaiser changed the title ~~executors: fix performance regressions and optimize parallel_executor caching~~ executors: fix performance regressions Apr 25, 2026

arpittkhandelwal added 3 commits April 26, 2026 10:58

executors: fix memory leak and data race in parallel_executor cache

acfa35a

arpittkhandelwal force-pushed the fix/executor-performance-final branch from 9117541 to 1da76d8 Compare April 26, 2026 05:29

Fix -Wdouble-promotion in tuple and testing api

e14c53a

Copilot AI review requested due to automatic review settings April 26, 2026 05:32

Copilot started reviewing on behalf of arpittkhandelwal April 26, 2026 05:32 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

		auto const first_core = hpx::execution::experimental::get_first_core(exec);
		HPX_TEST_EQ(first_core, hpx::threads::find_first(expected_mask));

Uh oh!

Conversation

arpittkhandelwal commented Apr 20, 2026 • edited by hkaiser Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Changes

📊 Performance Results

Closes

Uh oh!

codacy-production Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Up to standards ✅

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hkaiser commented Apr 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

hkaiser left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

hkaiser commented Apr 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

arpittkhandelwal commented Apr 20, 2026 •

edited by hkaiser

Loading

codacy-production Bot commented Apr 20, 2026 •

edited

Loading