Skip to content

Performance optimization for time stepping in cylindrical coordinates for m ≠ 0#3158

Open
oskooi wants to merge 2 commits intoNanoComp:masterfrom
oskooi:cyl_timestep_enhancements
Open

Performance optimization for time stepping in cylindrical coordinates for m ≠ 0#3158
oskooi wants to merge 2 commits intoNanoComp:masterfrom
oskooi:cyl_timestep_enhancements

Conversation

@oskooi
Copy link
Collaborator

@oskooi oskooi commented Mar 1, 2026

Context

The FDTD time-stepping hot path calls step_db() and update_eh() twice per timestep each (once for B/H, once for D/E). The inner loop kernels in step_generic.cpp / step_generic_stride1.cpp are already well-optimized with OpenMP (PLOOP_OVER_IVECS), SIMD vectorization (PS1LOOP), restrict pointers, and specialized code paths for each PML/conductivity combination.

However, the cylindrical-coordinate (Dcyl) m≠0 code in step_db.cpp (lines 178-293) is a major performance bottleneck:

  1. It uses serial LOOP_OVER_VOL_OWNED0 instead of parallel PLOOP variants.
  2. Every grid point constructs an ivec object via IVEC_LOOP_ILOC, calls .r(), and performs a floating-point division – yet the r-coordinate only changes with the outer (R) loop.
  3. The KSTRIDE_DEF PML-index setup is redundantly placed inside the loop body instead of outside.

These issues make cylindrical simulations with m≠0 significantly slower than they need to be.

Optimizations

1. Precompute reciprocal-r table for cylindrical m≠0 loops (HIGH IMPACT)

File: src/step_db.cpp, lines 178-293

Before the 8 special-case loops, precompute a 1D array of the_m / r_int values indexed by the R loop dimension. Inside the loop, replace:

IVEC_LOOP_ILOC(gv, here);
realnum rinv = the_m / here.r();

with a direct lookup using the R loop index. This eliminates per-point ivec construction, .r() method dispatch, and floating-point division. The R loop index maps to loop_i1 for Dcyl (R is outermost, Z is innermost/stride-1).

Implementation: after computing the_m (line 200) and is (line 193), determine the R dimension's loop parameters from gv.yucky_direction() and precompute rinv_arr[ir] = the_m / (is.in_direction(R) + 2 * ir) for ir = 0..nr-1.

2. Parallelize cylindrical m≠0 loops (HIGH IMPACT)

File: src/step_db.cpp, lines 203-291

Replace all 8 instances of serial LOOP_OVER_VOL_OWNED0(gv, cc, i) with PLOOP_OVER_VOL_OWNED(gv, cc, i) (or PS1LOOP_OVER_VOL_OWNED0 when stride-1). This enables OpenMP multi-threading for the cylindrical m≠0 field updates, which are currently the only serial loops on the hot path.

With the precomputed rinv table (optimization 1), the loop body becomes pure array arithmetic, ideal for parallization and SIMD.

3. Host KSTRIDE_DEF outside loops (MEDIUM IMPACT)

File: src/step_db.cpp, lines 203-291

Move KSTRIDE_DEF(dsig, k, is, gv) and KSTRIDE_DEF(dsigu, ku, is, gv) from inside the loop body to before the loop, matching the pattern used in step_generic.cpp. Currently these define const variables every iteration; hoisting them:

  • Reduces redundant computation (even if compiler may hoist them for serial loops)
  • Is required for correct OpenMP parallel behavior (avoids per-thread redundant setup)
  • Matches the established pattern from the well-optimized step_generic.cpp

4. Add SIMD hint to f_rderiv_int inner loop (MEDIUM IMPACT)

File: src/step_db.cpp, lines 109-117

The Z inner loop in the f_rderiv_int computation has no cross-iteration dependencies within a given R row. Add IVDEP pragma before the inner loop to enable auto-vectorization.

5. Use memset for f_rderiv_int initialization (LOW IMPACT)

Replace explicit zero-initialization loop with memset.

src/step_db.cpp Outdated
const int nr_loop = (gv.big_corner().in_direction(R) - is_r) / 2 + 1;
realnum *rinv_arr = new realnum[nr_loop];
for (int ir = 0; ir < nr_loop; ++ir)
rinv_arr[ir] = the_m / (is_r + 2 * ir);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A suggested rewrite to guard against division by zero at r = 0:

for (int ir = 0; ir < nr_loop; ++ir) {
  int r_idx = is_r + 2 * ir;
  rinv_arr[ir] = (r_idx != 0) ? the_m / r_idx : 0;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather leave it as-is.

We only divide by zero here in cases where we aren't actually going to use rinv_arr[0] (for the Z component, we have special update rules at $r=0$), so it is actually a good thing that we are storing Inf in that case — if we ever accidentally use it, we will immediately notice because the infinities will propagate. Whereas storing zero instead would cause such a bug to be silently hidden.

However, it might be nice to put a comment in the code:

// intentionally store Inf for rinv_arr[0] if !is_r, since we should never use that value

@codecov-commenter
Copy link

codecov-commenter commented Mar 1, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.90%. Comparing base (f29a8c7) to head (8fb234a).
⚠️ Report is 111 commits behind head on master.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3158      +/-   ##
==========================================
+ Coverage   73.81%   73.90%   +0.09%     
==========================================
  Files          18       18              
  Lines        5423     5454      +31     
==========================================
+ Hits         4003     4031      +28     
- Misses       1420     1423       +3     

see 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@stevengj
Copy link
Collaborator

stevengj commented Mar 1, 2026

Have you benchmarked this change? What is the performance improvement?

@oskooi
Copy link
Collaborator Author

oskooi commented Mar 1, 2026

The changes in this PR do seem to provide a modest speedup. I ran an experiment on my local machine (Intel Xeon i7-7700K, serial Meep built from source: a474edd) for OMP_NUM_THREADS of 1, 2, and 3 using the example of Tutorial/Nonaxisymmetric Dipole Sources involving a ring-current source positioned at $r = 3.5$ μm with $m = 6$. The script is here.

Benchmarking results are shown in the table below.

Screenshot from 2026-02-28 20-10-07

@oskooi
Copy link
Collaborator Author

oskooi commented Mar 7, 2026

There is a decent speedup with the latest changes (8fb234a) using the same benchmarking test:

Screenshot from 2026-03-07 11-03-50

@stevengj
Copy link
Collaborator

stevengj commented Mar 7, 2026

What is the speedup compared to the current master branch?

@oskooi
Copy link
Collaborator Author

oskooi commented Mar 8, 2026

Here are the benchmarking results for the master branch (a53a3ed). This PR does indeed provide a significant improvement in runtime even for a single thread.

Note: benchmarking results were obtained using Meep compiled using single-precision floating point.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants