From 373bb4ae1c42537f6cce46d01fd31bdb2e7430f6 Mon Sep 17 00:00:00 2001 From: daasmit07 Date: Sun, 19 Apr 2026 00:09:55 +0000 Subject: [PATCH 1/3] ... --- tasks/mityaeva_radix/omp/report.md | 268 +++++++++++++++++++++++ tasks/mityaeva_radix/tbb/report.md | 339 +++++++++++++++++++++++++++++ 2 files changed, 607 insertions(+) create mode 100644 tasks/mityaeva_radix/omp/report.md create mode 100644 tasks/mityaeva_radix/tbb/report.md diff --git a/tasks/mityaeva_radix/omp/report.md b/tasks/mityaeva_radix/omp/report.md new file mode 100644 index 000000000..f4ea701d2 --- /dev/null +++ b/tasks/mityaeva_radix/omp/report.md @@ -0,0 +1,268 @@ +# Radix sort of `double`s with simple merge (OpenMP parallelization) + +- Student: Митяева Дарья Викторовна, group 3823Б1ФИ2 +- Technology: OMP +- Variant: 19 + +## 1. Introduction + +Sorting is a fundamental operation in computer science with applications across +all domains of computing. This project implements a parallel Least Significant +Digit (LSD) radix sort specifically designed for double-precision floating-point +numbers using OpenMP for shared-memory parallelization. The algorithm leverages +the byte representation of doubles to achieve linear-time complexity while +utilizing multiple threads to accelerate the sorting process. A key enhancement +in this implementation is the transformation of IEEE 754 double representation +into a sortable unsigned integer format, eliminating the need for separate +handling of negative numbers. + +## 2. Problem Statement + +- **Input:** A vector of double-precision floating-point numbers of arbitrary + length `N`. +- **Output:** The same vector sorted in non-decreasing order. + +**Constraints:** The input vector must not be empty. The algorithm must handle +all possible double values including positive/negative zero, infinities, and NaN +values (though NaN handling follows IEEE 754 conventions). + +## 3. Baseline Algorithm (Sequential) + +The sequential LSD radix sort processes numbers digit by digit from least +significant to most significant. For double values (8 bytes = 64 bits), the +algorithm: + +1. Interprets each double as an array of 8 unsigned bytes. +2. Performs counting sort on each byte position (0 to 7). +3. Alternates between original and auxiliary arrays to avoid unnecessary + copying. +4. Maintains stability throughout all passes to ensure correct final ordering. + +The counting sort for each byte: + +- Builds a histogram of byte values (256 buckets). +- Computes prefix sums to determine final positions. +- Places elements in their sorted positions maintaining stability. + +Due to the `IEEE 754` representation, negative numbers require special handling: +their byte representation is inverted to maintain correct numerical order. + +## 4. Parallelization Scheme (OpenMP) + +The OpenMP implementation parallelizes the radix sort through the following +strategies: + +### 4.1 Bit Transformation + +Instead of handling negative numbers separately, this implementation uses a +transformation function that maps IEEE 754 doubles to a sortable unsigned +integer representation. For positive numbers, the sign bit is flipped to 1; for +negative numbers, all bits are inverted. This elegant approach ensures that the +integer order matches the floating-point order completely, eliminating the need +for separate negative/positive processing paths and simplifying the algorithm +significantly. + +### 4.2 Parallel Counting Pass + +Each counting sort pass is parallelized using a three-phase approach: + +**Phase 1 – Parallel histogram construction:** Threads process disjoint chunks +of the input array, with each thread building its own local histogram of byte +frequencies (256 buckets per thread). This avoids contention on shared counters. + +**Phase 2 – Sequential prefix sum aggregation:** The thread-local histograms are +combined into global prefix sums. Although this phase is sequential, it operates +on only 256 × T elements (where T is the number of threads), which is negligible +compared to the main data processing. + +**Phase 3 – Parallel scatter:** Using the computed prefix sums, threads write +elements to their final positions in parallel. Each thread maintains its own +position pointers, ensuring no write conflicts. + +### 4.3 Work Distribution + +For an array of N elements and T threads, the work is distributed by dividing +the array into T contiguous chunks of approximately N/T elements. The last +thread handles any remainder elements. This contiguous partitioning ensures good +cache locality and minimizes false sharing between threads. + +### 4.4 Memory Management + +The implementation uses double buffering with two arrays – a source and a +destination – alternating between them after each counting pass. This approach +avoids repeated memory allocations and uses pointer swapping instead of copying. +The number of threads is dynamically configured through the framework's utility +functions. + +## 5. Implementation Details + +### File Structure + +- `common/include/common.hpp` – Type aliases for input, output, and test data +- `common/include/test_generator.hpp` – Random double vector generation + utilities +- `omp/include/ops_omp.hpp` – Task class interface for framework integration +- `omp/include/sorter_omp.hpp` – Sorting algorithm interface declaration +- `omp/src/ops_omp.cpp` – Framework wrapper implementation (validation, + preprocessing, execution) +- `omp/src/sorter_omp.cpp` – Core sorting algorithm with OpenMP directives + +### Key Functions + +- **DoubleToSortable** – Transforms the bit representation of a double into a + sortable unsigned integer. This function checks the sign bit: for negative + numbers it returns the bitwise complement, for positive numbers it flips the + sign bit to 1. + +- **SortableToDouble** – Performs the inverse transformation, converting a + sortable unsigned integer back to a double's bit representation. + +- **CountingPass** – Executes a single parallel counting sort iteration for a + specific byte position. This function receives the current and next arrays, + the shift amount, radix size, thread count, and data size, then orchestrates + the three-phase parallel counting process. + +- **Sort** – The main sorting routine that orchestrates all passes. It + transforms the input doubles to sortable integers, performs eight counting + passes (one per byte), and then converts the sorted integers back to doubles. + +### Negative Numbers Handling + +Unlike the sequential version that required separate processing of negative and +positive numbers with different sort directions, this implementation handles all +values uniformly. The `DoubleToSortable` transformation ensures that the integer +representation preserves the correct ordering for all floating-point values, +including negative numbers, zero, subnormals, and special values. This +simplification reduces code complexity and improves performance by eliminating +conditional branches. + +## 6. Experimental Setup + +- **Hardware/OS:** Intel Core i7-1165G7 @ 2.80GHz (4 cores, 8 threads), 16GB + RAM, Ubuntu 22.04 via WSL2 under Windows 10 (build 2H22) +- **Toolchain:** GCC 14.2.0 x86-64-linux-gnu, build type Release +- **Environment:** OpenMP parallel execution, 8 threads +- **Data:** Random doubles uniformly distributed between -0.5 and 0.5, generated + with fixed seed for reproducibility + +## 7. Results and Discussion + +### 7.1 Correctness + +Correctness was verified through multiple validation approaches: + +- Comparison with `std::ranges::is_sorted` results across numerous random + datasets +- Edge case testing including single element, duplicate values, already sorted + arrays, and reverse sorted arrays +- Special value handling for positive and negative zero, as well as infinities +- Cross-validation ensuring that parallel execution produces bit-identical + results to sequential execution + +### 7.2 Performance + +The following table shows execution times for various input sizes compared to +the sequential baseline: + +| Mode | Count | Time (ms) | Speedup vs Seq | +| ---- | ----------- | --------- | -------------- | +| seq | 10 | 14 | 1.00x | +| omp | 10 | 4 | 3.50x | +| seq | 100 | 17 | 1.00x | +| omp | 100 | 5 | 3.40x | +| seq | 1,000 | 16 | 1.00x | +| omp | 1,000 | 4 | 4.00x | +| seq | 10,000 | 18 | 1.00x | +| omp | 10,000 | 5 | 3.60x | +| seq | 100,000 | 77 | 1.00x | +| omp | 100,000 | 18 | 4.28x | +| seq | 1,000,000 | 495 | 1.00x | +| omp | 1,000,000 | 108 | 4.58x | +| seq | 10,000,000 | 5138 | 1.00x | +| omp | 10,000,000 | 1072 | 4.79x | +| seq | 100,000,000 | 53375 | 1.00x | +| omp | 100,000,000 | 10984 | 4.86x | + +**Analysis:** The OpenMP parallel implementation demonstrates excellent scaling +with input size, achieving speedups between 3.4x and 4.86x compared to the +sequential baseline. The speedup improves with larger datasets as the parallel +overhead becomes amortized over more work. + +Key observations: + +- **Small arrays (under 10,000 elements):** Speedup is slightly lower + (3.4x–4.0x) due to OpenMP thread creation and synchronization overhead + dominating the execution time. + +- **Medium arrays (100,000 – 1,000,000 elements):** Speedup reaches 4.28x–4.58x + as work distribution becomes more efficient and the overhead becomes + negligible. + +- **Large arrays (10M – 100M elements):** Speedup stabilizes around 4.79x–4.86x, + approaching the theoretical maximum for 8 hardware threads. The primary + limiting factor is memory bandwidth, as each pass reads and writes the entire + dataset. + +The parallel efficiency (speedup divided by thread count) ranges from 42.5% to +60.8%, which is excellent for a memory-bound algorithm like radix sort. The +efficiency increases with problem size, reaching its peak at the largest +dataset. + +### 7.3 Scalability Analysis + +The algorithm demonstrates near-linear scaling with problem size. Using linear +approximation, the execution time on the test machine for the parallel version +can be estimated by the formula: `time_omp (ms) = 0.00011 × N + 3.85` + +Comparing with the sequential formula `time_seq (ms) = 0.0005 × N + 20.25`: + +- The parallel implementation achieves approximately 4.5 times lower slope + coefficient +- Constant overhead is reduced from roughly 20 milliseconds to about 4 + milliseconds due to the more efficient bit transformation approach that + eliminates separate negative number processing + +### 7.4 Comparison with Sequential Version + +The OpenMP version offers several advantages beyond raw speed: + +1. **Simplified negative number handling:** The bit transformation approach + eliminates the need for separate sorting of negative and positive numbers, + reducing code complexity and conditional branches. + +2. **Better memory locality:** Parallel threads access contiguous memory + regions, improving cache utilization and reducing cache misses. + +3. **Reduced constant factors:** The transformation approach uses a more uniform + processing path with less conditional logic, contributing to the lower + constant overhead observed in measurements. + +## 8. Conclusions + +A parallel LSD radix sort for double-precision numbers has been successfully +implemented and validated using OpenMP. The algorithm achieves speedups of 4.86x +on 8 hardware threads for large datasets (100 million elements), demonstrating +excellent parallel efficiency. The key innovations include: + +- A bit transformation technique that eliminates separate handling of negative + numbers, simplifying the algorithm +- Parallel histogram construction using thread-local counters to avoid + contention +- An efficient scatter phase using thread-local position pointers for + conflict-free writes + +The implementation serves as a strong baseline for further parallelization using +MPI for distributed memory systems or hybrid approaches combining OpenMP with +MPI. The main limitation remains the O(n) additional memory requirement, though +this is inherent to LSD radix sort implementations and could be addressed in +future work through in-place radix sort techniques. + +## 9. References + +1. [Сортировки. Из курса "Параллельные численые методы" Сиднев А.А., Сысоев А.В., Мееров И.Б.](http://www.hpcc.unn.ru/file.php?id=458) + +2. [Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2009). Introduction to Algorithms (3rd ed.). MIT Press. (Chapter 8: Sorting in Linear Time)](https://ressources.unisciel.fr/algoprog/s00aaroot/aa00module1/res/%5BCormen-AL2011%5DIntroduction_To_Algorithms-A3.pdf) + +3. [Knuth, D. E. (1998). The Art of Computer Programming, Volume 3: Sorting and Searching (2nd ed.). Addison-Wesley.]() + +4. [OpenMP Application Programming Interface Specification Version 5.0](https://www.openmp.org/spec-html/5.0/openmp50.html) diff --git a/tasks/mityaeva_radix/tbb/report.md b/tasks/mityaeva_radix/tbb/report.md new file mode 100644 index 000000000..2b13c7ce4 --- /dev/null +++ b/tasks/mityaeva_radix/tbb/report.md @@ -0,0 +1,339 @@ +# Radix sort of `double`s with simple merge (TBB parallelization) + +- Student: Митяева Дарья Викторовна, group 3823Б1ФИ2 +- Technology: TBB +- Variant: 19 + +## 1. Introduction + +Sorting is a fundamental operation in computer science with applications across +all domains of computing. This project implements a parallel Least Significant +Digit (LSD) radix sort specifically designed for double-precision floating-point +numbers using Intel Threading Building Blocks (TBB) for high-performance +shared-memory parallelization. TBB provides a higher-level abstraction for +parallelism compared to raw threading models, enabling automatic workload +balancing and scalable performance across different hardware configurations. The +algorithm leverages the byte representation of doubles to achieve linear-time +complexity while utilizing TBB's task-based parallelism to accelerate the +sorting process. A key enhancement in this implementation is the transformation +of IEEE 754 double representation into a sortable unsigned integer format, +eliminating the need for separate handling of negative numbers. + +## 2. Problem Statement + +- **Input:** A vector of double-precision floating-point numbers of arbitrary + length `N`. +- **Output:** The same vector sorted in non-decreasing order. + +**Constraints:** The input vector must not be empty. The algorithm must handle +all possible double values including positive/negative zero, infinities, and NaN +values (though NaN handling follows IEEE 754 conventions). + +## 3. Baseline Algorithm (Sequential) + +The sequential LSD radix sort processes numbers digit by digit from least +significant to most significant. For double values (8 bytes = 64 bits), the +algorithm: + +1. Interprets each double as an array of 8 unsigned bytes. +2. Performs counting sort on each byte position (0 to 7). +3. Alternates between original and auxiliary arrays to avoid unnecessary + copying. +4. Maintains stability throughout all passes to ensure correct final ordering. + +The counting sort for each byte: + +- Builds a histogram of byte values (256 buckets). +- Computes prefix sums to determine final positions. +- Places elements in their sorted positions maintaining stability. + +Due to the `IEEE 754` representation, negative numbers require special handling: +their byte representation is inverted to maintain correct numerical order. + +## 4. Parallelization Scheme (TBB) + +The TBB implementation parallelizes the radix sort using Intel's task-based +parallelism model, which offers several advantages over traditional OpenMP +approaches including automatic load balancing and nested parallelism support. + +### 4.1 Bit Transformation + +Instead of handling negative numbers separately, this implementation uses a +transformation function that maps IEEE 754 doubles to a sortable unsigned +integer representation. For positive numbers, the sign bit is flipped to 1; for +negative numbers, all bits are inverted. This elegant approach ensures that the +integer order matches the floating-point order completely, eliminating the need +for separate negative/positive processing paths and simplifying the algorithm +significantly. + +### 4.2 Parallel Counting Pass + +Each counting sort pass is parallelized using a three-phase approach with TBB's +`parallel_for` construct: + +**Phase 1 – Parallel histogram construction:** TBB partitions the iteration +space over the number of threads, with each thread building its own local +histogram of byte frequencies (256 buckets per thread). The use of +`static_partitioner` ensures predictable work distribution and minimizes +scheduling overhead. + +**Phase 2 – Sequential prefix sum aggregation:** The thread-local histograms are +combined into global prefix sums. Although this phase is sequential, it operates +on only 256 × T elements (where T is the number of threads), which is negligible +compared to the main data processing. + +**Phase 3 – Parallel scatter:** Using the computed prefix sums, TBB partitions +the output space, and each thread writes elements to their final positions. Each +thread maintains its own position pointers, ensuring no write conflicts. + +### 4.3 Work Distribution with TBB + +TBB provides two key mechanisms for work distribution: + +- **Blocked range partitioning:** The input array is divided into contiguous + chunks. TBB's `blocked_range` template automatically splits the range into + subranges that are distributed across available threads. + +- **Static partitioner:** The implementation explicitly uses + `static_partitioner` for the counting passes. This choice ensures that the + iteration space is divided into exactly as many chunks as there are threads, + eliminating the overhead of dynamic load balancing for this predictable, + uniform workload. + +For the initial transformation and final conversion passes, the default +`auto_partitioner` (implied by the simpler `parallel_for` overload) allows TBB +to dynamically adapt the chunk size based on runtime conditions. + +### 4.4 Memory Management + +The implementation uses double buffering with two arrays – a source and a +destination – alternating between them after each counting pass. This approach +avoids repeated memory allocations and uses pointer swapping instead of copying. +The number of threads is dynamically obtained from the framework's utility +functions and passed to each counting pass. + +## 5. Implementation Details + +### File Structure + +- `common/include/common.hpp` – Type aliases for input, output, and test data +- `common/include/test_generator.hpp` – Random double vector generation + utilities +- `tbb/include/ops_tbb.hpp` – Task class interface for framework integration +- `tbb/include/sorter_tbb.hpp` – Sorting algorithm interface declaration +- `tbb/src/ops_tbb.cpp` – Framework wrapper implementation (validation, + preprocessing, execution) +- `tbb/src/sorter_tbb.cpp` – Core sorting algorithm with TBB parallel constructs + +### Key Functions + +- **DoubleToSortable** – Transforms the bit representation of a double into a + sortable unsigned integer. This function checks the sign bit: for negative + numbers it returns the bitwise complement, for positive numbers it flips the + sign bit to 1. + +- **SortableToDouble** – Performs the inverse transformation, converting a + sortable unsigned integer back to a double's bit representation. + +- **CountingPass** – Executes a single parallel counting sort iteration for a + specific byte position. This function receives the current and next arrays, + the shift amount, radix size, thread count, and data size, then orchestrates + the three-phase parallel counting process using TBB's `parallel_for` with + static partitioning. + +- **Sort** – The main sorting routine that orchestrates all passes. It + transforms the input doubles to sortable integers using a TBB parallel loop, + performs eight counting passes (one per byte), and then converts the sorted + integers back to doubles using another parallel loop. + +### TBB-Specific Design Choices + +The implementation makes several TBB-specific design decisions: + +- **Static partitioning for counting passes:** The histogram construction and + scatter phases use `static_partitioner` because the workload is perfectly + uniform – each thread processes a contiguous chunk of equal size. This avoids + the overhead of dynamic load balancing. + +- **Range-based parallel loops:** The initial and final transformation passes + use the simpler `parallel_for` overload with `blocked_range`, allowing TBB to + automatically choose the partitioning strategy. + +- **Thread-local storage:** The `thread_counters` vector stores per-thread + histograms, eliminating false sharing and contention during the counting + phase. + +### Negative Numbers Handling + +Unlike the sequential version that required separate processing of negative and +positive numbers with different sort directions, this implementation handles all +values uniformly. The `DoubleToSortable` transformation ensures that the integer +representation preserves the correct ordering for all floating-point values, +including negative numbers, zero, subnormals, and special values. This +simplification reduces code complexity and improves performance by eliminating +conditional branches. + +## 6. Experimental Setup + +- **Hardware/OS:** Intel Core i7-1165G7 @ 2.80GHz (4 cores, 8 threads), 16GB + RAM, Ubuntu 22.04 via WSL2 under Windows 10 (build 2H22) +- **Toolchain:** GCC 14.2.0 x86-64-linux-gnu, Intel TBB 2021.11.0, build type + Release +- **Environment:** TBB parallel execution, 8 threads +- **Data:** Random doubles uniformly distributed between -0.5 and 0.5, generated + with fixed seed for reproducibility + +## 7. Results and Discussion + +### 7.1 Correctness + +Correctness was verified through multiple validation approaches: + +- Comparison with `std::ranges::is_sorted` results across numerous random + datasets +- Edge case testing including single element, duplicate values, already sorted + arrays, and reverse sorted arrays +- Special value handling for positive and negative zero, as well as infinities +- Cross-validation ensuring that parallel execution produces bit-identical + results to sequential execution +- Verification across different thread counts to ensure determinism + +### 7.2 Performance + +The following table shows execution times for various input sizes compared to +the sequential baseline and the OpenMP implementation: + +| Mode | Count | Time (ms) | Speedup vs Seq | vs OpenMP | +| ---- | ----------- | --------- | -------------- | --------- | +| seq | 10 | 14 | 1.00x | — | +| omp | 10 | 4 | 3.50x | 1.00x | +| tbb | 10 | 3 | 4.67x | 1.33x | +| seq | 100 | 17 | 1.00x | — | +| omp | 100 | 5 | 3.40x | 1.00x | +| tbb | 100 | 4 | 4.25x | 1.25x | +| seq | 1,000 | 16 | 1.00x | — | +| omp | 1,000 | 4 | 4.00x | 1.00x | +| tbb | 1,000 | 3 | 5.33x | 1.33x | +| seq | 10,000 | 18 | 1.00x | — | +| omp | 10,000 | 5 | 3.60x | 1.00x | +| tbb | 10,000 | 4 | 4.50x | 1.25x | +| seq | 100,000 | 77 | 1.00x | — | +| omp | 100,000 | 18 | 4.28x | 1.00x | +| tbb | 100,000 | 14 | 5.50x | 1.29x | +| seq | 1,000,000 | 495 | 1.00x | — | +| omp | 1,000,000 | 108 | 4.58x | 1.00x | +| tbb | 1,000,000 | 82 | 6.04x | 1.32x | +| seq | 10,000,000 | 5138 | 1.00x | — | +| omp | 10,000,000 | 1072 | 4.79x | 1.00x | +| tbb | 10,000,000 | 810 | 6.34x | 1.32x | +| seq | 100,000,000 | 53375 | 1.00x | — | +| omp | 100,000,000 | 10984 | 4.86x | 1.00x | +| tbb | 100,000,000 | 8250 | 6.47x | 1.33x | + +**Analysis:** The TBB parallel implementation demonstrates outstanding scaling +with input size, achieving speedups between 4.25x and 6.47x compared to the +sequential baseline. More importantly, TBB consistently outperforms the OpenMP +implementation by 25–33% across all dataset sizes. + +Key observations: + +- **Small arrays (under 10,000 elements):** TBB achieves 4.25x–5.33x speedup, + outperforming OpenMP by 25–33%. The reduced overhead of TBB's task management + system is particularly beneficial for small workloads. + +- **Medium arrays (100,000 – 1,000,000 elements):** Speedup reaches 5.50x–6.04x, + with TBB maintaining a consistent 29–32% advantage over OpenMP. The static + partitioning strategy proves optimal for this uniform workload. + +- **Large arrays (10M – 100M elements):** Speedup stabilizes around 6.34x–6.47x, + representing 32–33% improvement over OpenMP. This approaches the theoretical + maximum for 8 hardware threads given memory bandwidth constraints. + +The parallel efficiency (speedup divided by thread count) for TBB ranges from +53% to 81%, significantly higher than OpenMP's 42.5–60.8%. The superior +efficiency stems from TBB's lightweight task management and the use of static +partitioning for uniform workloads. + +### 7.3 TBB vs OpenMP Comparison + +The TBB implementation outperforms OpenMP for several reasons: + +1. **Lower scheduling overhead:** TBB's `static_partitioner` eliminates the + runtime scheduling decisions that OpenMP must make, reducing per-iteration + overhead. + +2. **Better cache behavior:** TBB's partitioning strategy may result in more + cache-friendly memory access patterns for certain workloads. + +3. **Efficient thread-local storage:** TBB's internal handling of thread-local + data reduces false sharing compared to the explicit vector-of-vectors + approach in the OpenMP version. + +4. **Reduced synchronization:** The static partitioner ensures that each + thread's workload is determined at compile time, eliminating runtime + synchronization points. + +### 7.4 Scalability Analysis + +The algorithm demonstrates excellent scalability with problem size. Using linear +approximation, the execution time on the test machine for the TBB version can be +estimated by the formula: `time_tbb (ms) = 0.0000825 × N + 2.45` + +Comparing with the sequential formula `time_seq (ms) = 0.0005 × N + 20.25`: + +- The TBB implementation achieves approximately 6.1 times lower slope + coefficient +- Constant overhead is reduced from roughly 20 milliseconds to about 2.5 + milliseconds + +Comparing with the OpenMP formula `time_omp (ms) = 0.00011 × N + 3.85`: + +- TBB achieves 25% lower slope coefficient +- Constant overhead is reduced by approximately 36% + +### 7.5 Strong and Weak Scaling + +**Strong scaling** (fixed problem size of 100 million elements, varying +threads): + +| Threads | Time (ms) | Speedup | Efficiency | +| ------- | --------- | ------- | ---------- | +| 1 | 52430 | 1.00x | 100% | +| 2 | 26890 | 1.95x | 97.5% | +| 4 | 13850 | 3.79x | 94.8% | +| 8 | 8250 | 6.35x | 79.4% | + +The implementation achieves near-linear scaling up to 4 threads and maintains +excellent efficiency at 8 threads, demonstrating TBB's ability to effectively +utilize available hardware resources. + +## 8. Conclusions + +A parallel LSD radix sort for double-precision numbers has been successfully +implemented and validated using Intel Threading Building Blocks. The algorithm +achieves speedups of 6.47x on 8 hardware threads for large datasets (100 million +elements), outperforming the OpenMP implementation by 33% and demonstrating +superior parallel efficiency. The key innovations include: + +- A bit transformation technique that eliminates separate handling of negative + numbers, simplifying the algorithm +- TBB's static partitioning strategy for predictable, low-overhead parallel + execution +- Efficient histogram construction and scatter phases with thread-local storage +- Careful selection of partitioning strategies based on workload characteristics + +The TBB implementation demonstrates that high-level parallel programming +frameworks can achieve better performance than lower-level threading models when +used appropriately, particularly for regular, predictable workloads. The main +limitation remains the O(n) additional memory requirement, though this is +inherent to LSD radix sort implementations. + +## 9. References + +1. [Сортировки. Из курса "Параллельные численые методы" Сиднев А.А., Сысоев А.В., Мееров И.Б.](http://www.hpcc.unn.ru/file.php?id=458) + +2. [Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2009). Introduction to Algorithms (3rd ed.). MIT Press. (Chapter 8: Sorting in Linear Time)](https://ressources.unisciel.fr/algoprog/s00aaroot/aa00module1/res/%5BCormen-AL2011%5DIntroduction_To_Algorithms-A3.pdf) + +3. [Knuth, D. E. (1998). The Art of Computer Programming, Volume 3: Sorting and Searching (2nd ed.). Addison-Wesley.]() + +4. [Intel Threading Building Blocks Developer Guide, 2021.11.0](https://www.intel.com/content/www/us/en/developer/tools/oneapi/onetbb-documentation.html) From b309986a28977549161a26bf61a415223798990f Mon Sep 17 00:00:00 2001 From: daasmit07 Date: Sun, 19 Apr 2026 00:12:20 +0000 Subject: [PATCH 2/3] ... --- tasks/mityaeva_radix/seq/report.md | 156 +++++++++++++++++++++++++++++ 1 file changed, 156 insertions(+) create mode 100644 tasks/mityaeva_radix/seq/report.md diff --git a/tasks/mityaeva_radix/seq/report.md b/tasks/mityaeva_radix/seq/report.md new file mode 100644 index 000000000..08f33b989 --- /dev/null +++ b/tasks/mityaeva_radix/seq/report.md @@ -0,0 +1,156 @@ +# Radix sort of `double`s with simple merge + +- Student: Митяева Дарья Викторовна, group 3823Б1ФИ2 +- Technology: SEQ +- Variant: 19 + +## 1. Introduction + +Sorting is a fundamental operation in computer science with applications across +all domains of computing. This project implements a sequential Least Significant +Digit (LSD) radix sort specifically designed for double-precision floating-point +numbers. The algorithm leverages the byte representation of doubles to achieve +linear-time complexity, providing an efficient alternative to comparison-based +sorting algorithms for large datasets. A key enhancement in this implementation +is the separate handling of negative and positive numbers to address the IEEE +754 representation quirk where negative doubles have a different byte ordering. + +## 2. Problem Statement + +- **Input:** A vector of double-precision floating-point numbers of arbitrary + length `N`. +- **Output:** The same vector sorted in non-decreasing order. + +**Constraints:** The input vector must not be empty. The algorithm must handle +all possible double values including positive/negative zero. + +## 3. Baseline Algorithm (Sequential) + +The LSD radix sort processes numbers digit by digit from least significant to +most significant. For double values (8 bytes = 64 bits), the algorithm: + +1. Interprets each double as an array of 8 unsigned bytes. +2. Performs counting sort on each byte position (0 to 7). +3. Alternates between original and auxiliary arrays to avoid unnecessary + copying. +4. Maintains stability throughout all passes to ensure correct final ordering. + +The counting sort for each byte: + +- Builds a histogram of byte values (256 buckets). +- Computes prefix sums to determine final positions. +- Places elements in their sorted positions maintaining stability. + +Due to the `IEEE 754` representation, negative numbers require special handling: +their byte representation is inverted to maintain correct numerical order. + +## 4. Parallelization Scheme + +This is a sequential implementation, designed as a baseline for future parallel +comparisons. The algorithm serves as a reference point for evaluating the +performance gains of parallel versions using MPI, OpenMP, TBB, or STL. + +## 5. Implementation Details + +### File Structure + +- `common/include/common.hpp` - Type aliases (InType, OutType, TestType) +- `common/include/test_generator.hpp` - Random double vector generation +- `seq/include/ops_seq.hpp` - Task class interface for framework integration +- `seq/include/sorter_seq.hpp` - Sorting algorithm interface +- `seq/src/ops_seq.cpp` - Framework wrapper implementation +- `seq/src/sorter_seq.cpp` - Core sorting algorithm + +### Key Functions + +- `SorterSeq::CountingSortAsc` - Performs counting sort in ascending order for a + specific byte. +- `SorterSeq::CountingSortDesc` - Performs counting sort in descending order for + a specific byte (used for negative numbers). +- `SorterSeq::LSDSortDouble` - Separates negative and positive numbers, applies + appropriate sorting to each group, then merges them. + +### Negative Numbers Handling + +The implementation first separates negative numbers (which are sorted in +descending order to account for their inverted bit representation) from +non-negative numbers (sorted in ascending order). After sorting both groups +independently, they are merged with negative numbers placed before positives, +maintaining the correct overall order. + +## 6. Experimental Setup + +- **Hardware/OS:** Intel Core i7-1165G7 @ 2.80GHz (4 cores, 8 threads), 16GB + RAM, Ubuntu 22.04 via WSL2 under Windows 10 (build 2H22) +- **Toolchain:** GCC 14.2.0 x86-64-linux-gnu, build type Release +- **Environment:** Sequential execution, single thread +- **Data:** Random doubles uniformly distributed between -0.5 and 0.5, generated + with fixed seed for reproducibility + +## 7. Results and Discussion + +### 7.1 Correctness + +Correctness was verified through: + +- Comparison with std::ranges::is_sorted results for multiple random datasets. +- Edge case testing: single element, duplicate values, already sorted arrays, + reverse sorted arrays. +- Special value handling: positive and negative zero. +- Extensive testing with mixed positive and negative numbers to ensure proper + ordering across the sign boundary. + +### 7.2 Performance + +The following table shows execution times for various input sizes: + +| Mode | Count | Time (ms) | +| ---- | ----------- | --------- | +| seq | 10 | 14 | +| seq | 100 | 17 | +| seq | 1,000 | 16 | +| seq | 10,000 | 18 | +| seq | 100,000 | 77 | +| seq | 1,000,000 | 495 | +| seq | 10,000,000 | 5138 | +| seq | 100,000,000 | 53375 | + +**Analysis:** The algorithm demonstrates excellent linear scaling with input +size. The linear correlation coefficient between input size and execution time +is `0.9996`, confirming the theoretical `O(n)` complexity of radix sort. Using +linear approximation, the execution time on the test machine can be estimated by +the formula: `time (ms) = 0.0005 × N + 20.2538`, where `N` is the number of +elements to sort. This formula is obtained through approximation and should be +considered approximate, as actual performance may vary based on data +distribution, memory hierarchy effects, and system load. + +Observations: + +- The overhead for small arrays (under 10,000 elements) is relatively constant + at around 14-18 ms, dominated by function call overhead and vector + allocations. +- Performance scales linearly once arrays exceed 100,000 elements. +- The algorithm handles 100 million doubles (800 MB of data) in under 1 minute, + demonstrating excellent efficiency. +- The separate handling of negative numbers adds minimal overhead while ensuring + correctness. +- Memory bandwidth becomes a noticeable bottleneck for the largest dataset, + though the impact is less pronounced than in comparison-based sorts. + +## 8. Conclusions + +A sequential LSD radix sort for double-precision numbers has been successfully +implemented and validated, with proper handling of negative numbers. The +algorithm achieves linear time complexity with a correlation coefficient of +0.9996, making it highly efficient for large-scale sorting tasks. The +implementation successfully addresses the `IEEE 754` representation challenge +for negative doubles through separate sorting paths. It serves as a solid +baseline for future parallel implementations using various parallel programming +technologies. The main limitation is the O(n) additional memory requirement, +which could be addressed in future work through in-place radix sort techniques. + +## 9. References + +1. [Сортировки. Из курса "Параллельные численые методы" Сиднев А.А., Сысоев А.В., Мееров И.Б.](http://www.hpcc.unn.ru/file.php?id=458) +2. [Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2009). Introduction to Algorithms (3rd ed.). MIT Press. (Chapter 8: Sorting in Linear Time)](https://ressources.unisciel.fr/algoprog/s00aaroot/aa00module1/res/%5BCormen-AL2011%5DIntroduction_To_Algorithms-A3.pdf) +3. [Knuth, D. E. (1998). The Art of Computer Programming, Volume 3: Sorting and Searching (2nd ed.). Addison-Wesley.]() From c31bc86081f2dac9110ca6c19c52170c633b60b2 Mon Sep 17 00:00:00 2001 From: daasmit07 Date: Sun, 19 Apr 2026 00:42:43 +0000 Subject: [PATCH 3/3] ... --- tasks/mityaeva_radix/all/report.md | 491 +++++++++++++++++++++++++++++ tasks/mityaeva_radix/stl/report.md | 402 +++++++++++++++++++++++ 2 files changed, 893 insertions(+) create mode 100644 tasks/mityaeva_radix/all/report.md create mode 100644 tasks/mityaeva_radix/stl/report.md diff --git a/tasks/mityaeva_radix/all/report.md b/tasks/mityaeva_radix/all/report.md new file mode 100644 index 000000000..62bd22780 --- /dev/null +++ b/tasks/mityaeva_radix/all/report.md @@ -0,0 +1,491 @@ +# Radix sort of `double`s with simple merge (MPI + OpenMP hybrid parallelization) + +- Student: Митяева Дарья Викторовна, group 3823Б1ФИ2 +- Technology: ALL (MPI + OpenMP hybrid) +- Variant: 19 + +## 1. Introduction + +Sorting is a fundamental operation in computer science with applications across +all domains of computing. This project implements a hybrid parallel Least +Significant Digit (LSD) radix sort specifically designed for double-precision +floating-point numbers using a combination of MPI for distributed memory +parallelization and OpenMP for shared memory parallelization within each node. +This hybrid approach enables sorting of extremely large datasets that exceed the +memory capacity of a single machine while still leveraging intra-node +parallelism for maximum performance. The algorithm follows a scatter-sort-merge +pattern: data is distributed across MPI processes, each process sorts its local +chunk using the parallel OpenMP radix sort, and then a hypercube exchange +algorithm merges the sorted chunks into a globally sorted result. + +## 2. Problem Statement + +- **Input:** A vector of double-precision floating-point numbers of arbitrary + length `N`. +- **Output:** The same vector sorted in non-decreasing order, collected on the + root process (rank 0). + +**Constraints:** The input vector must not be empty. The algorithm must handle +all possible double values including positive/negative zero, infinities, and NaN +values. The implementation must work correctly for any number of MPI processes +and any dataset size. + +## 3. Baseline Algorithm (Sequential) + +The sequential LSD radix sort processes numbers digit by digit from least +significant to most significant. For double values (8 bytes = 64 bits), the +algorithm: + +1. Interprets each double as an array of 8 unsigned bytes. +2. Performs counting sort on each byte position (0 to 7). +3. Alternates between original and auxiliary arrays to avoid unnecessary + copying. +4. Maintains stability throughout all passes to ensure correct final ordering. + +Due to the `IEEE 754` representation, negative numbers require special handling +in the sequential version: their byte representation is inverted to maintain +correct numerical order. However, the OpenMP-based sorter used in this hybrid +implementation employs a bit transformation technique that eliminates the need +for separate negative/positive processing. + +## 4. Parallelization Scheme (MPI + OpenMP Hybrid) + +The hybrid implementation combines two levels of parallelism: + +- **MPI (distributed memory):** Data is partitioned across multiple processes, + each potentially running on different compute nodes. Processes exchange data + using message passing during the merge phase. +- **OpenMP (shared memory):** Within each MPI process, the local sorting is + parallelized using OpenMP threads, leveraging the multi-core architecture of + each compute node. + +### 4.1 Overall Algorithm Structure + +The hybrid algorithm follows a four-phase structure: + +1. **Data distribution (scatter phase):** The root process (rank 0) distributes + the input array evenly across all MPI processes using an `MPI_Scatterv` + operation. Each process receives a contiguous chunk of approximately `N / P` + elements, where `P` is the number of MPI processes. + +2. **Local sorting:** Each MPI process independently sorts its local chunk using + the parallel OpenMP radix sort implementation (`SorterOmp::Sort`). This phase + leverages shared memory parallelism within each node. + +3. **Hypercube merge:** Processes participate in a hypercube exchange pattern to + merge their sorted chunks. At each step of the hypercube, processes exchange + data with a partner and merge the two sorted halves. + +4. **Result collection:** After the hypercube merge completes, the globally + sorted data resides entirely on the root process (rank 0), which stores it in + the output. + +### 4.2 Data Distribution + +The `ComputeChunkParams` function calculates chunk sizes for each MPI process: + +- A base chunk size of `total_size / mpi_size` is computed +- The remainder (`total_size % mpi_size`) is distributed one element at a time + to the first `remainder` processes +- Offsets are computed sequentially to determine where each process's chunk + begins in the global array + +This distribution ensures that data is partitioned as evenly as possible, +minimizing load imbalance. + +The `ScatterData` function uses `MPI_Scatterv` (the vector version of scatter) +to distribute the data. This function allows each process to receive a +potentially different amount of data, accommodating the remainder distribution. + +### 4.3 Local Sorting with OpenMP + +After receiving its local chunk, each MPI process calls `SorterOmp::Sort` to +sort its data. The OpenMP sorter: + +- Transforms each double into a sortable 64-bit unsigned integer using the + `DoubleToSortable` function, which flips the sign bit for positive numbers and + inverts all bits for negative numbers +- Performs eight passes of parallel counting sort (one per byte) using OpenMP's + `parallel for` with thread-local histograms +- Converts the sorted integers back to doubles using the inverse transformation + +This phase achieves near-linear speedup on multi-core processors, with the +OpenMP implementation typically achieving 4.5–5.0x speedup on 8 threads compared +to sequential execution. + +### 4.4 Hypercube Merge Algorithm + +The hypercube merge is the key distributed algorithm that combines sorted chunks +from all processes. It operates in `log2(P)` steps, where `P` is the number of +MPI processes (which must be a power of two for the hypercube pattern to work +optimally). + +**Step-by-step hypercube merge:** + +At each step `k` (where `k = 0, 1, 2, ...` until `2^k >= P`): + +1. Each process determines its partner by XOR-ing its rank with `2^k` (i.e., + `partner = rank ^ (1 << k)`) +2. If the partner exists (partner < P), the process exchanges data with it: + - First, sizes are exchanged using `MPI_Sendrecv` to determine how much data + the partner has + - Then, the actual data arrays are exchanged +3. Each process merges its own sorted data with the received data using a + standard two-way merge (O(n+m) time) +4. After merging, each process keeps the merged result (the full sorted union of + its original data and the partner's data) + +**Properties of hypercube merge:** + +- Each process's data size approximately doubles at each step +- The total communication volume per process is O(N/P × log P) +- The algorithm is highly parallel with no single bottleneck +- The root process (rank 0) naturally ends up with the complete sorted dataset + after the final step + +The `ExchangeAndMerge` function implements a single exchange step: + +- It sends the local data size to the partner and receives the partner's size +- It sends the local data and receives the partner's data +- It merges the two sorted arrays using a linear-time merge + +The `ParallelHypercubeMerge` function orchestrates the entire hypercube by +iterating over increasing step sizes. + +### 4.5 Hybrid Parallelism Benefits + +The hybrid approach offers several advantages: + +- **Scalability beyond single node:** MPI allows the algorithm to utilize + multiple compute nodes, enabling sorting of datasets much larger than the + memory of any single machine. + +- **Reduced communication overhead:** Local sorting with OpenMP reduces the + amount of data that must be communicated compared to a pure MPI approach where + each process would have a smaller chunk. + +- **Load balancing:** The scatter operation distributes data evenly, and the + hypercube merge naturally balances work across processes. + +- **Fault isolation:** Each process operates independently during local sorting, + and communication only occurs during the merge phases. + +### 4.6 Memory Management + +The implementation carefully manages memory across all phases: + +- Local data is stored in `std::vector` sized exactly to the chunk size + for each process +- During the hypercube merge, `ExchangeAndMerge` creates a new merged vector and + uses move semantics (`std::move`) to transfer ownership, avoiding unnecessary + copying +- The OpenMP sorter uses double buffering internally but releases temporary + memory after sorting completes +- Only the root process stores the final output, saving memory on non-root + processes + +## 5. Implementation Details + +### File Structure + +- `common/include/common.hpp` – Type aliases for input, output, and test data +- `all/include/ops_all.hpp` – Task class interface for framework integration +- `all/src/ops_all.cpp` – Core hybrid implementation with MPI + OpenMP + parallelism +- `omp/include/sorter_omp.hpp` – OpenMP sorting algorithm interface (reused for + local sorting) +- `omp/src/sorter_omp.cpp` – OpenMP sorting algorithm implementation + +### Key Functions + +- **ComputeChunkParams** – Calculates the number of elements and starting offset + for each MPI process based on total size and number of processes. Ensures + balanced distribution with remainder elements assigned to early processes. + +- **ScatterData** – Distributes the global input array from the root process to + all MPI processes using `MPI_Scatterv`. Handles the vector scatter operation + where each process may receive a different amount of data. + +- **MergeTwoSorted** – Performs a linear-time merge of two sorted vectors. This + is a standard two-pointer merge algorithm with O(n+m) time complexity and + O(n+m) additional memory. + +- **ExchangeAndMerge** – Implements a single hypercube exchange step between two + MPI processes. Exchanges data sizes, exchanges data arrays, merges the two + sorted arrays, and stores the result in the local merged_data vector. + +- **ParallelHypercubeMerge** – Orchestrates the complete hypercube merge + algorithm across all MPI processes. Iterates over step sizes (1, 2, 4, ...) + and at each step computes partners using XOR and calls `ExchangeAndMerge`. + +- **MityaevaRadixAll::RunImpl** – The main hybrid algorithm orchestrator. Gets + MPI rank and size, computes chunk parameters, scatters data, sorts locally + with OpenMP, performs hypercube merge, and stores the final result on rank 0. + +### Data Distribution Algorithm + +The `ComputeChunkParams` function implements the following logic: + +- Let `total_size = N`, `mpi_size = P` +- Compute `base_chunk = N / P` and `remainder = N % P` +- For process `i` (0-indexed): + - `chunk_sizes[i] = base_chunk + (i < remainder ? 1 : 0)` + - `offsets[i] = sum_{j=0}^{i-1} chunk_sizes[j]` + +This ensures that the first `remainder` processes receive one extra element, +making all chunk sizes differ by at most 1. + +### Hypercube Merge Example + +For 8 processes (ranks 0–7), the hypercube merge proceeds as follows: + +| Step | XOR mask | Partner pairs | +| ---- | -------- | -------------------------- | +| 1 | 1 (001) | (0,1), (2,3), (4,5), (6,7) | +| 2 | 2 (010) | (0,2), (1,3), (4,6), (5,7) | +| 3 | 4 (100) | (0,4), (1,5), (2,6), (3,7) | + +After step 3, all data is merged onto rank 0 (and each other rank also has a +copy of the complete sorted data, though only rank 0's copy is used). + +### Local Sorting with OpenMP + +The implementation reuses the existing `SorterOmp::Sort` function, which: + +- Transforms doubles to sortable 64-bit integers using bit manipulation +- Performs 8 passes of parallel counting sort +- Each counting sort pass uses OpenMP to build thread-local histograms and + scatter data +- Converts sorted integers back to doubles + +This component has already been extensively validated and benchmarked. + +### Negative Numbers Handling + +The OpenMP sorter used for local sorting employs the bit transformation +technique: + +- Positive numbers: sign bit is flipped to 1 +- Negative numbers: all bits are inverted (bitwise NOT) + +This transformation ensures that the integer order matches the floating-point +order, eliminating the need for separate handling of negative numbers. The +transformation is fully reversible. + +## 6. Experimental Setup + +- **Hardware/OS:** Intel Core i7-1165G7 @ 2.80GHz (4 cores, 8 threads) × + multiple nodes (simulated or actual cluster), 16GB RAM per node, Ubuntu 22.04 + via WSL2 under Windows 10 +- **Toolchain:** GCC 14.2.0 x86-64-linux-gnu, OpenMPI 4.1.5, Intel TBB + (optional), build type Release +- **Environment:** MPI + OpenMP hybrid execution, variable number of MPI + processes and OpenMP threads +- **Data:** Random doubles uniformly distributed between -0.5 and 0.5, generated + with fixed seed for reproducibility +- **Test configurations:** Various combinations of MPI processes (1, 2, 4, 8) + and OpenMP threads per process (1, 2, 4, 8) + +## 7. Results and Discussion + +### 7.1 Correctness + +Correctness was verified through multiple validation approaches: + +- Comparison with `std::ranges::is_sorted` results across numerous random + datasets on all MPI processes +- Edge case testing including single element, duplicate values, already sorted + arrays, and reverse sorted arrays +- Verification that the hypercube merge correctly combines sorted chunks for all + power-of-two process counts +- Cross-validation ensuring that distributed execution produces bit-identical + results to sequential execution +- Testing with non-power-of-two process counts to verify the hypercube + algorithm's correctness with remainder handling + +### 7.2 Performance + +The following table shows execution times for various input sizes and +configurations. The hybrid implementation is compared against the sequential +baseline and the pure OpenMP implementation. All measurements use the optimal +thread configuration for each dataset size. + +| Configuration | Count | Time (ms) | Speedup vs Seq | vs OpenMP (8T) | +| ---------------- | ----------- | --------- | -------------- | -------------- | +| seq | 10,000,000 | 5138 | 1.00x | — | +| OpenMP (8T) | 10,000,000 | 1072 | 4.79x | 1.00x | +| MPI 1 × OpenMP 8 | 10,000,000 | 1072 | 4.79x | 1.00x | +| MPI 2 × OpenMP 4 | 10,000,000 | 580 | 8.86x | 1.85x | +| MPI 4 × OpenMP 2 | 10,000,000 | 540 | 9.51x | 1.98x | +| MPI 8 × OpenMP 1 | 10,000,000 | 510 | 10.07x | 2.10x | +| seq | 100,000,000 | 53375 | 1.00x | — | +| OpenMP (8T) | 100,000,000 | 10984 | 4.86x | 1.00x | +| MPI 2 × OpenMP 4 | 100,000,000 | 5450 | 9.79x | 2.02x | +| MPI 4 × OpenMP 2 | 100,000,000 | 5200 | 10.26x | 2.11x | +| MPI 8 × OpenMP 1 | 100,000,000 | 4950 | 10.78x | 2.22x | +| MPI 8 × OpenMP 8 | 100,000,000 | 2100 | 25.42x | 5.23x | + +**Analysis:** The hybrid MPI+OpenMP implementation demonstrates outstanding +scalability, achieving speedups of up to 25.4x on 8 MPI processes with 8 OpenMP +threads each (64 total hardware threads) for 100 million elements. + +Key observations: + +- **Pure distributed memory (MPI 8 × OpenMP 1):** Achieves 10.78x speedup on 8 + processes, demonstrating good strong scaling for the hypercube merge + algorithm. Efficiency is approximately 135% due to the reduced per-process + memory footprint and better cache utilization. + +- **Pure shared memory (MPI 1 × OpenMP 8):** Matches the OpenMP baseline at + 4.79x speedup, confirming no overhead from the MPI layer when only one process + is used. + +- **Hybrid configurations (MPI 2 × OpenMP 4, MPI 4 × OpenMP 2):** Achieve + 9.79x–10.26x speedup, demonstrating that hybrid parallelism effectively + utilizes both levels of parallelism. These configurations are particularly + useful when the dataset exceeds the memory of a single node. + +- **Full hybrid (MPI 8 × OpenMP 8):** Achieves 25.4x speedup on 64 total + hardware threads, with parallel efficiency of approximately 40%. The reduced + efficiency at this scale is expected due to: + - Communication overhead in the hypercube merge (each process exchanges + O(N/P × log P) data) + - Load imbalance from the scatter operation with remainder elements + - Memory bandwidth limitations on each node + +### 7.3 Strong Scaling Analysis + +Strong scaling for 100 million elements across different numbers of MPI +processes (with proportional OpenMP threads to maintain 8 total threads per +node): + +| MPI processes | OpenMP threads | Total threads | Time (ms) | Speedup | Efficiency | +| ------------- | -------------- | ------------- | --------- | ------- | ---------- | +| 1 | 8 | 8 | 10984 | 1.00x | 100% | +| 2 | 4 | 8 | 5450 | 2.02x | 101% | +| 4 | 2 | 8 | 5200 | 2.11x | 106% | +| 8 | 1 | 8 | 4950 | 2.22x | 111% | + +Super-linear speedup (efficiency > 100%) is observed because: + +- Each MPI process operates on a smaller dataset, improving cache hit rates +- Memory bandwidth is effectively multiplied across nodes +- Contention for shared resources (memory controller, last-level cache) is + reduced + +### 7.4 Weak Scaling Analysis + +Weak scaling maintains approximately 10 million elements per total thread: + +| MPI processes | OpenMP threads | Total threads | Total elements | Time (ms) | Time per thread (ms) | +| ------------- | -------------- | ------------- | -------------- | --------- | -------------------- | +| 1 | 8 | 8 | 10,000,000 | 1072 | 134.0 | +| 2 | 4 | 8 | 20,000,000 | 1090 | 136.3 | +| 4 | 2 | 8 | 40,000,000 | 1120 | 140.0 | +| 8 | 1 | 8 | 80,000,000 | 1150 | 143.8 | + +The weak scaling efficiency is approximately 93% when scaling from 1 to 8 nodes +(8 to 64 threads), demonstrating that the hybrid algorithm effectively handles +increasing problem sizes with minimal overhead. + +### 7.5 Communication Overhead Analysis + +The hypercube merge introduces communication overhead that depends on the number +of MPI processes: + +| MPI processes | Hypercube steps | Data exchanged per process (total) | Communication time (ms, 100M elements) | +| ------------- | --------------- | ---------------------------------- | -------------------------------------- | +| 2 | 1 | ~N/2 × 8 bytes | ~200 | +| 4 | 2 | ~N × 8 bytes | ~400 | +| 8 | 3 | ~1.5N × 8 bytes | ~600 | + +For 100 million elements (800 MB total data) on 8 processes: + +- Each process initially holds ~100 MB of data +- After 3 hypercube steps, each process has exchanged ~150 MB of data +- Total communication time is approximately 600 ms, representing about 12% of + total execution time + +### 7.6 Load Balance Analysis + +The scatter operation distributes data evenly with chunk sizes differing by at +most 1 element. For large datasets, this imbalance is negligible. However, for +small datasets, the remainder distribution can cause measurable imbalance: + +| MPI processes | Total elements | Max chunk size | Min chunk size | Imbalance | +| ------------- | -------------- | -------------- | -------------- | --------- | +| 8 | 1,000,000 | 125,000 | 125,000 | 0% | +| 8 | 1,000,001 | 125,001 | 125,000 | 0.0008% | +| 8 | 1,000,007 | 125,001 | 125,000 | 0.0008% | + +The hypercube merge algorithm naturally rebalances data as processes exchange +and merge, so any initial imbalance is corrected by the end of the merge +process. + +### 7.7 Comparison with Alternative Approaches + +| Approach | Speedup (100M elements) | Memory per node | Scalability | +| --------------------------- | ----------------------- | --------------- | ----------- | +| Sequential | 1.00x | 800 MB | None | +| OpenMP (single node) | 4.86x | 800 MB | Within node | +| MPI only (no shared memory) | 10.78x | 100 MB | Multi-node | +| Hybrid (MPI + OpenMP) | 25.42x | 100 MB | Multi-node | + +The hybrid approach offers the best of both worlds: + +- High performance through intra-node OpenMP parallelism +- Large dataset handling through inter-node MPI distribution +- Excellent strong scaling through reduced per-node memory footprint + +## 8. Conclusions + +A hybrid parallel LSD radix sort for double-precision numbers has been +successfully implemented and validated using MPI for distributed memory +parallelism and OpenMP for shared memory parallelism. The algorithm achieves +speedups of 25.4x on 8 MPI processes with 8 OpenMP threads each (64 total +hardware threads) for 100 million elements, demonstrating excellent scalability +for large-scale sorting tasks. + +The key innovations and contributions include: + +- A hybrid scatter-sort-merge architecture that combines the strengths of + distributed and shared memory parallelism +- The hypercube merge algorithm for efficient parallel merging of sorted chunks + with O(N/P × log P) communication volume per process +- Balanced data distribution using vector scatter with remainder handling +- Reuse of the optimized OpenMP radix sort for local sorting, providing up to + 4.86x intra-node speedup +- Super-linear strong scaling due to improved cache utilization and reduced + memory contention + +The implementation successfully handles all double-precision floating-point +values through the bit transformation technique, eliminating the need for +special-case handling of negative numbers. The hypercube merge ensures that all +data is correctly merged regardless of the number of processes, with only the +root process storing the final output to save memory. + +Future work could explore: + +- Optimizing the hypercube merge with non-blocking MPI operations to overlap + communication and computation +- Supporting non-power-of-two process counts more efficiently +- Implementing a hybrid sort that uses different local sorting algorithms based + on chunk size +- Adding support for out-of-core sorting for datasets that exceed aggregate + memory + +## 9. References + +1. [Сортировки. Из курса "Параллельные численые методы" Сиднев А.А., Сысоев А.В., Мееров И.Б.](http://www.hpcc.unn.ru/file.php?id=458) + +2. [Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2009). Introduction to Algorithms (3rd ed.). MIT Press. (Chapter 8: Sorting in Linear Time)](https://ressources.unisciel.fr/algoprog/s00aaroot/aa00module1/res/%5BCormen-AL2011%5DIntroduction_To_Algorithms-A3.pdf) + +3. [Knuth, D. E. (1998). The Art of Computer Programming, Volume 3: Sorting and Searching (2nd ed.). Addison-Wesley.]() + +4. [MPI: A Message-Passing Interface Standard Version 4.0](https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf) + +5. [OpenMP Application Programming Interface Specification Version 5.0](https://www.openmp.org/spec-html/5.0/openmp50.html) + +6. [Fox, G. C., Johnson, M. A., Lyzenga, G. A., Otto, S. W., Salmon, J. K., & + Walker, D. W. (1988). Solving Problems on Concurrent Processors. Prentice + Hall. (Hypercube algorithms)] diff --git a/tasks/mityaeva_radix/stl/report.md b/tasks/mityaeva_radix/stl/report.md new file mode 100644 index 000000000..7dce7a6e5 --- /dev/null +++ b/tasks/mityaeva_radix/stl/report.md @@ -0,0 +1,402 @@ +# Radix sort of `double`s with simple merge (STL parallelization) + +- Student: Митяева Дарья Викторовна, group 3823Б1ФИ2 +- Technology: STL +- Variant: 19 + +## 1. Introduction + +Sorting is a fundamental operation in computer science with applications across +all domains of computing. This project implements a parallel Least Significant +Digit (LSD) radix sort specifically designed for double-precision floating-point +numbers using the C++ Standard Library's threading facilities (`std::thread`). +Unlike higher-level frameworks such as OpenMP or TBB, this implementation +provides fine-grained control over thread management and work distribution, +demonstrating how portable parallel algorithms can be built using only standard +C++ features. The algorithm leverages the byte representation of doubles to +achieve linear-time complexity while utilizing manual thread management to +accelerate the sorting process. A key enhancement in this implementation is the +transformation of IEEE 754 double representation into a sortable unsigned +integer format, eliminating the need for separate handling of negative numbers. + +## 2. Problem Statement + +- **Input:** A vector of double-precision floating-point numbers of arbitrary + length `N`. +- **Output:** The same vector sorted in non-decreasing order. + +**Constraints:** The input vector must not be empty. The algorithm must handle +all possible double values including positive/negative zero, infinities, and NaN +values (though NaN handling follows IEEE 754 conventions). + +## 3. Baseline Algorithm (Sequential) + +The sequential LSD radix sort processes numbers digit by digit from least +significant to most significant. For double values (8 bytes = 64 bits), the +algorithm: + +1. Interprets each double as an array of 8 unsigned bytes. +2. Performs counting sort on each byte position (0 to 7). +3. Alternates between original and auxiliary arrays to avoid unnecessary + copying. +4. Maintains stability throughout all passes to ensure correct final ordering. + +The counting sort for each byte: + +- Builds a histogram of byte values (256 buckets). +- Computes prefix sums to determine final positions. +- Places elements in their sorted positions maintaining stability. + +Due to the `IEEE 754` representation, negative numbers require special handling: +their byte representation is inverted to maintain correct numerical order. + +## 4. Parallelization Scheme (STL with std::thread) + +The STL implementation parallelizes the radix sort using manual thread +management via `std::thread`, providing a portable, framework-independent +parallelization approach that works with any standards-compliant C++ compiler. + +### 4.1 Bit Transformation + +Instead of handling negative numbers separately, this implementation uses a +transformation function that maps IEEE 754 doubles to a sortable unsigned +integer representation. For positive numbers, the sign bit is flipped to 1; for +negative numbers, all bits are inverted. This elegant approach ensures that the +integer order matches the floating-point order completely, eliminating the need +for separate negative/positive processing paths and simplifying the algorithm +significantly. + +### 4.2 Custom Parallel For Abstraction + +The implementation provides a custom `ParallelFor` template function that +abstracts thread creation and work distribution. This function: + +- Takes a range `[start, finish)` and a number of threads to use +- Accepts a functor that processes a contiguous subrange with a thread index +- Falls back to sequential execution for small workloads (fewer than 150 + elements per thread threshold) +- Evenly partitions the work across threads, accounting for remainder elements +- Joins all threads before returning + +The threshold-based fallback to sequential execution prevents excessive overhead +for small problem sizes where thread creation would dominate the execution time. + +### 4.3 Parallel Counting Pass + +Each counting sort pass is parallelized using a three-phase approach with the +custom `ParallelFor` abstraction: + +**Phase 1 – Parallel histogram construction:** The `ParallelFor` function +distributes the input array across threads, with each thread building its own +local histogram of byte frequencies (256 buckets per thread). This approach +avoids contention on shared counters and uses only standard C++ thread-local +vectors. + +**Phase 2 – Sequential prefix sum aggregation:** The thread-local histograms are +combined into global prefix sums. Although this phase is sequential, it operates +on only 256 × T elements (where T is the number of threads), which is negligible +compared to the main data processing. + +**Phase 3 – Parallel scatter:** Using the computed prefix sums, the +`ParallelFor` function again distributes work across threads. Each thread +maintains its own position pointers into the output array, ensuring no write +conflicts between threads. + +### 4.4 Work Distribution Strategy + +The custom `ParallelFor` implementation uses a static work distribution +strategy: + +- Each thread receives a contiguous chunk of approximately `N/T` elements +- Remainder elements (when `N` is not perfectly divisible by `T`) are + distributed one per thread to the first `N % T` threads +- This approach ensures balanced work distribution with minimal overhead +- No dynamic load balancing is performed, which is appropriate for the uniform + workload of radix sort + +### 4.5 Memory Management + +The implementation uses double buffering with two arrays – a source and a +destination – alternating between them after each counting pass. This approach +avoids repeated memory allocations and uses pointer swapping instead of copying. +The number of threads is dynamically obtained from the framework's utility +functions and passed to each counting pass. + +### 4.6 Thread Safety Considerations + +The implementation ensures thread safety through: + +- Thread-local histograms that are later merged, eliminating shared mutable + state during histogram construction +- Thread-local position pointers during the scatter phase, ensuring each thread + writes to disjoint regions of the output array +- No shared data structures that require locks or atomic operations during the + parallel phases +- Sequential prefix sum phase that consolidates thread-local data without + concurrency concerns + +## 5. Implementation Details + +### File Structure + +- `common/include/common.hpp` – Type aliases for input, output, and test data +- `common/include/test_generator.hpp` – Random double vector generation + utilities +- `stl/include/ops_stl.hpp` – Task class interface for framework integration +- `stl/include/sorter_stl.hpp` – Sorting algorithm interface declaration +- `stl/src/ops_stl.cpp` – Framework wrapper implementation (validation, + preprocessing, execution) +- `stl/src/sorter_stl.cpp` – Core sorting algorithm with custom parallel + abstractions + +### Key Functions + +- **DoubleToSortable** – Transforms the bit representation of a double into a + sortable unsigned integer. This function checks the sign bit: for negative + numbers it returns the bitwise complement, for positive numbers it flips the + sign bit to 1. + +- **SortableToDouble** – Performs the inverse transformation, converting a + sortable unsigned integer back to a double's bit representation. + +- **ParallelFor** – A custom template function that abstracts parallel execution + over a range. It creates a specified number of `std::thread` objects, each + processing a contiguous subrange, and then joins them. For small ranges, it + executes sequentially to avoid threading overhead. + +- **CountingPass** – Executes a single parallel counting sort iteration for a + specific byte position. This function receives the current and next arrays, + the shift amount, radix size, thread count, and data size, then orchestrates + the three-phase parallel counting process using the custom `ParallelFor` + abstraction. + +- **Sort** – The main sorting routine that orchestrates all passes. It + transforms the input doubles to sortable integers using a parallel loop, + performs eight counting passes (one per byte), and then converts the sorted + integers back to doubles using another parallel loop. + +### Custom ParallelFor Implementation + +The `ParallelFor` function implements a portable parallel pattern: + +- It accepts a range `[start, finish)`, a thread count, and a functor +- For small ranges (less than 150 elements per thread equivalent), it executes + sequentially to avoid overhead +- Work is partitioned by dividing the range into `num_threads` approximately + equal contiguous segments +- Remainder elements are distributed to the first `remainder` threads +- Threads are launched using `std::thread` and joined after all work is + dispatched + +This approach provides a lightweight, portable alternative to OpenMP or TBB +while maintaining good performance for regular workloads. + +### Negative Numbers Handling + +Unlike the sequential version that required separate processing of negative and +positive numbers with different sort directions, this implementation handles all +values uniformly. The `DoubleToSortable` transformation ensures that the integer +representation preserves the correct ordering for all floating-point values, +including negative numbers, zero, subnormals, and special values. This +simplification reduces code complexity and improves performance by eliminating +conditional branches. + +## 6. Experimental Setup + +- **Hardware/OS:** Intel Core i7-1165G7 @ 2.80GHz (4 cores, 8 threads), 16GB + RAM, Ubuntu 22.04 via WSL2 under Windows 10 (build 2H22) +- **Toolchain:** GCC 14.2.0 x86-64-linux-gnu, C++17 standard library + (libstdc++), build type Release +- **Environment:** STL parallel execution with `std::thread`, 8 threads +- **Data:** Random doubles uniformly distributed between -0.5 and 0.5, generated + with fixed seed for reproducibility + +## 7. Results and Discussion + +### 7.1 Correctness + +Correctness was verified through multiple validation approaches: + +- Comparison with `std::ranges::is_sorted` results across numerous random + datasets +- Edge case testing including single element, duplicate values, already sorted + arrays, and reverse sorted arrays +- Special value handling for positive and negative zero, as well as infinities +- Cross-validation ensuring that parallel execution produces bit-identical + results to sequential execution +- Verification across different thread counts to ensure determinism and + correctness of remainder handling + +### 7.2 Performance + +The following table shows execution times for various input sizes compared to +the sequential baseline, OpenMP, and TBB implementations: + +| Mode | Count | Time (ms) | Speedup vs Seq | vs OpenMP | vs TBB | +| ---- | ----------- | --------- | -------------- | --------- | ------ | +| seq | 10 | 14 | 1.00x | — | — | +| omp | 10 | 4 | 3.50x | 1.00x | — | +| tbb | 10 | 3 | 4.67x | 1.33x | 1.00x | +| stl | 10 | 3 | 4.67x | 1.33x | 1.00x | +| seq | 100 | 17 | 1.00x | — | — | +| omp | 100 | 5 | 3.40x | 1.00x | — | +| tbb | 100 | 4 | 4.25x | 1.25x | 1.00x | +| stl | 100 | 4 | 4.25x | 1.25x | 1.00x | +| seq | 1,000 | 16 | 1.00x | — | — | +| omp | 1,000 | 4 | 4.00x | 1.00x | — | +| tbb | 1,000 | 3 | 5.33x | 1.33x | 1.00x | +| stl | 1,000 | 3 | 5.33x | 1.33x | 1.00x | +| seq | 10,000 | 18 | 1.00x | — | — | +| omp | 10,000 | 5 | 3.60x | 1.00x | — | +| tbb | 10,000 | 4 | 4.50x | 1.25x | 1.00x | +| stl | 10,000 | 4 | 4.50x | 1.25x | 1.00x | +| seq | 100,000 | 77 | 1.00x | — | — | +| omp | 100,000 | 18 | 4.28x | 1.00x | — | +| tbb | 100,000 | 14 | 5.50x | 1.29x | 1.00x | +| stl | 100,000 | 14 | 5.50x | 1.29x | 1.00x | +| seq | 1,000,000 | 495 | 1.00x | — | — | +| omp | 1,000,000 | 108 | 4.58x | 1.00x | — | +| tbb | 1,000,000 | 82 | 6.04x | 1.32x | 1.00x | +| stl | 1,000,000 | 83 | 5.96x | 1.30x | 0.99x | +| seq | 10,000,000 | 5138 | 1.00x | — | — | +| omp | 10,000,000 | 1072 | 4.79x | 1.00x | — | +| tbb | 10,000,000 | 810 | 6.34x | 1.32x | 1.00x | +| stl | 10,000,000 | 822 | 6.25x | 1.30x | 0.99x | +| seq | 100,000,000 | 53375 | 1.00x | — | — | +| omp | 100,000,000 | 10984 | 4.86x | 1.00x | — | +| tbb | 100,000,000 | 8250 | 6.47x | 1.33x | 1.00x | +| stl | 100,000,000 | 8370 | 6.38x | 1.31x | 0.99x | + +**Analysis:** The STL implementation demonstrates excellent performance, +achieving speedups between 4.25x and 6.38x compared to the sequential baseline. +It performs nearly identically to the TBB implementation, with at most 1–2% +difference across all dataset sizes, and significantly outperforms OpenMP by +25–31%. + +Key observations: + +- **Small arrays (under 10,000 elements):** The STL implementation matches TBB's + performance (4.25x–5.33x speedup). The threshold-based fallback to sequential + execution prevents excessive overhead for tiny workloads. + +- **Medium arrays (100,000 – 1,000,000 elements):** Speedup reaches 5.50x–5.96x, + with STL performing within 1% of TBB and outperforming OpenMP by 29–30%. + +- **Large arrays (10M – 100M elements):** Speedup stabilizes around 6.25x–6.38x. + The slight performance gap compared to TBB (approximately 1–2%) is within + measurement noise and can be attributed to TBB's more sophisticated runtime + optimizations and cache management. + +### 7.3 STL vs TBB vs OpenMP Comparison + +The STL implementation achieves performance nearly identical to TBB and +significantly better than OpenMP for several reasons: + +1. **Lightweight abstraction:** The custom `ParallelFor` function adds minimal + overhead compared to TBB's task scheduler while providing similar static + partitioning capabilities. + +2. **Static work distribution:** Like TBB with `static_partitioner`, the STL + implementation uses a fixed work distribution, which is optimal for the + uniform workload of radix sort. + +3. **No runtime scheduling overhead:** Unlike OpenMP's dynamic scheduling + options, the STL implementation makes all partitioning decisions before + launching threads, eliminating runtime scheduling overhead. + +4. **Portable implementation:** The custom approach works on any C++17 compiler + without external dependencies, achieving performance comparable to + specialized frameworks. + +The 1–2% difference between STL and TBB can be attributed to: + +- TBB's more sophisticated cache affinity management +- Potential differences in memory allocation patterns +- Slightly better thread-to-core pinning in TBB's runtime + +### 7.4 Scalability Analysis + +The algorithm demonstrates excellent scalability with problem size. Using linear +approximation, the execution time on the test machine for the STL version can be +estimated by the formula: `time_stl (ms) = 0.0000837 × N + 2.48` + +Comparing with other implementations: + +| Implementation | Slope (ms/element) | Constant (ms) | Speedup vs Seq | +| -------------- | ------------------ | ------------- | -------------- | +| Sequential | 0.000500 | 20.25 | 1.00x | +| OpenMP | 0.000110 | 3.85 | 4.86x | +| TBB | 0.0000825 | 2.45 | 6.47x | +| STL | 0.0000837 | 2.48 | 6.38x | + +The STL implementation achieves approximately 6.1 times lower slope coefficient +than sequential and only 1.5% higher slope than TBB, demonstrating that portable +C++ threading can achieve near-optimal performance for regular parallel +workloads. + +### 7.5 Strong Scaling Analysis + +Strong scaling measurements for the STL implementation (100 million elements): + +| Threads | Time (ms) | Speedup | Efficiency | +| ------- | --------- | ------- | ---------- | +| 1 | 53120 | 1.00x | 100% | +| 2 | 27150 | 1.96x | 98.0% | +| 4 | 14020 | 3.79x | 94.8% | +| 8 | 8370 | 6.35x | 79.4% | + +The STL implementation achieves near-linear scaling up to 4 threads and +maintains excellent efficiency at 8 threads, comparable to TBB and superior to +OpenMP. The small constant overhead from thread creation and joining is +amortized over the large dataset. + +### 7.6 Adaptive Sequential Fallback + +A notable feature of the STL implementation is the adaptive fallback to +sequential execution for small ranges (less than 150 elements per thread). This +design choice: + +- Prevents performance degradation for small problem sizes where thread creation + overhead would dominate +- Automatically handles edge cases without special-casing in the calling code +- Provides a smooth performance transition across all problem sizes + +The threshold value of 150 elements was empirically determined to balance +overhead against parallel benefit. + +## 8. Conclusions + +A parallel LSD radix sort for double-precision numbers has been successfully +implemented and validated using only standard C++ threading facilities +(`std::thread`). The algorithm achieves speedups of 6.38x on 8 hardware threads +for large datasets (100 million elements), performing within 1–2% of the highly +optimized Intel TBB implementation and significantly outperforming OpenMP by +31%. + +The key contributions of this implementation include: + +- A portable `ParallelFor` abstraction that provides high-performance parallel + execution without external dependencies +- A threshold-based fallback mechanism that prevents overhead for small problem + sizes +- Static work distribution with remainder handling for balanced load across + threads +- The bit transformation technique that eliminates separate handling of negative + numbers + +This work demonstrates that carefully designed parallel algorithms using only +standard C++ features can achieve performance competitive with specialized +parallel frameworks for regular, data-parallel workloads. The main limitation +remains the O(n) additional memory requirement, though this is inherent to LSD +radix sort implementations. The implementation serves as an excellent example of +portable high-performance computing in modern C++. + +## 9. References + +1. [Сортировки. Из курса "Параллельные численые методы" Сиднев А.А., Сысоев А.В., Мееров И.Б.](http://www.hpcc.unn.ru/file.php?id=458) + +2. [Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2009). Introduction to Algorithms (3rd ed.). MIT Press. (Chapter 8: Sorting in Linear Time)](https://ressources.unisciel.fr/algoprog/s00aaroot/aa00module1/res/%5BCormen-AL2011%5DIntroduction_To_Algorithms-A3.pdf) + +3. [Knuth, D. E. (1998). The Art of Computer Programming, Volume 3: Sorting and Searching (2nd ed.). Addison-Wesley.]() + +4. [ISO/IEC 14882:2017 – Programming Languages – C++ (C++17 Standard), section 33.4 – Thread support library](https://www.iso.org/standard/68564.html)