Performance optimization for time stepping in cylindrical coordinates for m ≠ 0#3158
Performance optimization for time stepping in cylindrical coordinates for m ≠ 0#3158oskooi wants to merge 2 commits intoNanoComp:masterfrom
Conversation
src/step_db.cpp
Outdated
| const int nr_loop = (gv.big_corner().in_direction(R) - is_r) / 2 + 1; | ||
| realnum *rinv_arr = new realnum[nr_loop]; | ||
| for (int ir = 0; ir < nr_loop; ++ir) | ||
| rinv_arr[ir] = the_m / (is_r + 2 * ir); |
There was a problem hiding this comment.
A suggested rewrite to guard against division by zero at r = 0:
for (int ir = 0; ir < nr_loop; ++ir) {
int r_idx = is_r + 2 * ir;
rinv_arr[ir] = (r_idx != 0) ? the_m / r_idx : 0;
}There was a problem hiding this comment.
I would rather leave it as-is.
We only divide by zero here in cases where we aren't actually going to use rinv_arr[0] (for the Z component, we have special update rules at Inf in that case — if we ever accidentally use it, we will immediately notice because the infinities will propagate. Whereas storing zero instead would cause such a bug to be silently hidden.
However, it might be nice to put a comment in the code:
// intentionally store Inf for rinv_arr[0] if !is_r, since we should never use that value|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3158 +/- ##
==========================================
+ Coverage 73.81% 73.90% +0.09%
==========================================
Files 18 18
Lines 5423 5454 +31
==========================================
+ Hits 4003 4031 +28
- Misses 1420 1423 +3 🚀 New features to boost your workflow:
|
|
Have you benchmarked this change? What is the performance improvement? |
|
The changes in this PR do seem to provide a modest speedup. I ran an experiment on my local machine (Intel Xeon i7-7700K, serial Meep built from source: a474edd) for Benchmarking results are shown in the table below.
|
|
There is a decent speedup with the latest changes (8fb234a) using the same benchmarking test:
|
|
What is the speedup compared to the current |
|
Here are the benchmarking results for the master branch (a53a3ed). This PR does indeed provide a significant improvement in runtime even for a single thread. Note: benchmarking results were obtained using Meep compiled using single-precision floating point.
|



Context
The FDTD time-stepping hot path calls
step_db()andupdate_eh()twice per timestep each (once for B/H, once for D/E). The inner loop kernels instep_generic.cpp/step_generic_stride1.cppare already well-optimized with OpenMP (PLOOP_OVER_IVECS), SIMD vectorization (PS1LOOP),restrictpointers, and specialized code paths for each PML/conductivity combination.However, the cylindrical-coordinate (
Dcyl) m≠0 code instep_db.cpp(lines 178-293) is a major performance bottleneck:LOOP_OVER_VOL_OWNED0instead of parallelPLOOPvariants.ivecobject viaIVEC_LOOP_ILOC, calls.r(), and performs a floating-point division – yet the r-coordinate only changes with the outer (R) loop.KSTRIDE_DEFPML-index setup is redundantly placed inside the loop body instead of outside.These issues make cylindrical simulations with m≠0 significantly slower than they need to be.
Optimizations
1. Precompute reciprocal-r table for cylindrical m≠0 loops (HIGH IMPACT)
File:
src/step_db.cpp, lines 178-293Before the 8 special-case loops, precompute a 1D array of
the_m / r_intvalues indexed by the R loop dimension. Inside the loop, replace:with a direct lookup using the R loop index. This eliminates per-point ivec construction,
.r()method dispatch, and floating-point division. The R loop index maps toloop_i1for Dcyl (R is outermost, Z is innermost/stride-1).Implementation: after computing
the_m(line 200) andis(line 193), determine the R dimension's loop parameters fromgv.yucky_direction()and precomputerinv_arr[ir] = the_m / (is.in_direction(R) + 2 * ir)forir = 0..nr-1.2. Parallelize cylindrical m≠0 loops (HIGH IMPACT)
File:
src/step_db.cpp, lines 203-291Replace all 8 instances of serial
LOOP_OVER_VOL_OWNED0(gv, cc, i)withPLOOP_OVER_VOL_OWNED(gv, cc, i)(orPS1LOOP_OVER_VOL_OWNED0when stride-1). This enables OpenMP multi-threading for the cylindrical m≠0 field updates, which are currently the only serial loops on the hot path.With the precomputed rinv table (optimization 1), the loop body becomes pure array arithmetic, ideal for parallization and SIMD.
3. Host
KSTRIDE_DEFoutside loops (MEDIUM IMPACT)File:
src/step_db.cpp, lines 203-291Move
KSTRIDE_DEF(dsig, k, is, gv)andKSTRIDE_DEF(dsigu, ku, is, gv)from inside the loop body to before the loop, matching the pattern used instep_generic.cpp. Currently these defineconstvariables every iteration; hoisting them:step_generic.cpp4. Add SIMD hint to f_rderiv_int inner loop (MEDIUM IMPACT)
File:
src/step_db.cpp, lines 109-117The Z inner loop in the
f_rderiv_intcomputation has no cross-iteration dependencies within a given R row. AddIVDEPpragma before the inner loop to enable auto-vectorization.5. Use memset for f_rderiv_int initialization (LOW IMPACT)
Replace explicit zero-initialization loop with
memset.