Fix iterative recon: engine DC convention, position anchoring, iteration cap, probe warm-start by gwbischof · Pull Request #63 · NSLS2/holoptycho

Garrett Bischof (gwbischof) · 2026-06-12T19:38:18Z

Problem

The iterative (DM/ML) recon produced garbage objects, intermittent non-finite values, and cudaErrorIllegalAddress crashes on streaming runs. Diagnosis: eight independent defects stacked, which is why no single fix showed clean improvement. The method that cracked it: dump the engine's exact inputs, reproduce reconstruction offline (scripts/offline_epie.py), verify the inputs element-by-element, then run the actual engine class on the dumped inputs (scripts/offline_engine_test.py) to isolate solver behavior from streaming dynamics.

Fixes (in this PR)

GPU memory safety (three distinct `cudaErrorIllegalAddress` bugs)

Accumulate-kernel slice double-offset (the big one): accumulate_obj_mode/accumulate_prb_mode index product_d with the global point index (start + z), expecting the full array — the port passed a slice already offset to start_point. Every batch after the first read the wrong points' data (silently corrupting object updates), and reads ran out of bounds once start ≥ num_points_l/2 (the deterministic ~170 s crashes). Found with CUDA_LAUNCH_BLOCKING=1.
Uninitialized point_info_d: cp.empty left garbage windows in slots whose points hadn't arrived; the it==0 init kernel processes ALL slots → garbage int32s as object indices → nondeterministic crashes. Now cp.zeros.
Feed past engine capacity: engine buffers hold live_num_points_max points (default 8192), but the feed guards used the scan's nz (40000) — device copies wrote past the buffers at frame 8193. Feed now stops at capacity with a one-time warning.

Solver correctness

Engine DC convention: the kernels run plain unshifted FFTs with pointwise amplitude comparison, so the engine's diff_d copy must be fftshifted to DC-at-corner (as HXN_databroker.py always did) — lost in the ptychoml preprocess migration. ImageSendOp now applies it on the engine copy only.
Stale DM dual state: product_d was initialized once at iteration 0 when most streaming points hadn't arrived; later points carried garbage dual state for the whole run. init_product_range() now initializes each newly activated range.
0/0 NaN divide: the object normalization divided the whole array including probe-untouched pixels (prb_norm = 0) → NaNs that tripped write_live's finite check (the chronic "non-finite, skipping" log spam) and wrecked dashboard display normalization. Now a guarded divide.
Iteration cap: n_iterations doubled as snapshot ring depth AND termination cap, truncating streaming runs at ~6% of the scan. Decoupled into max_iterations (safety cap, default unlimited; runs end on data-complete + 30 refinement iterations).
Position anchoring: scan anchored with 2 px of undershoot margin → negative crop windows. Now centered in the over-allocated object array + clamp-and-warn.

Validated geometry (deterministic, no autodetect/empirics)

dp_orient_iterative unset (identity): DP and positions are both already global after detector_orientation. (An earlier rot90_cw finding was an artifact of a thin-band amplitude-only sweep compensating for the sign error below.)
x_direction_iterative = +1.0 (vs shared -1.0): wrong parity converges to the conjugate twin (amplitude fine, phase ramps/fringes). Validated by controlled x/y/both flip experiments.
Probe: real-space, centered, never shifted; warm-start via --probe-path (cold-start probe-from-diff also converges offline with correct geometry).

Tooling

scripts/offline_epie.py: pipeline-free ePIE harness (Tiled fetch + incremental cache, engine conventions, warm/cold probe, D4 + position swap/flip knobs, 6-panel snapshots).
scripts/offline_engine_test.py: runs the actual StreamingPtychoRecon on dumped engine inputs — isolates solver math from streaming.
HOLOPTYCHO_DUMP_ENGINE_INPUTS=<dir>: dumps engine diff_d/point_info_d + the exact engine-feed DPs.
Replay: beamline-validated engine params in iterative/both mode, --max-iterations (replay default 200), --probe-path, --tiled-workers doc fix (httpx pool is 100, not 10; 16 workers ≈ 700 fps).

Verification

Engine inputs vs offline harness: positions exact (0 px), DPs correlation 1.000000 (uniform normalization factor only).
Offline full-scan (40k frames) reconstruction with the validated geometry: clean specimen, flat phase, warm AND cold-start probes.
Engine port on dumped inputs (all points active): same quality class as ePIE on equal data — solver math confirmed after fixes.
Pipeline: full streaming runs complete cleanly (rc=0, zero non-finite warnings, no crashes).
pixi run python -m pytest tests/ → 124 passed, 2 pre-existing test_smoke failures (no local tiled server).

Resolution status

With all fixes in, the pipeline now reproduces the engine's computation faithfully: a three-way comparison at 8192 frames (pipeline streaming vs the engine class offline with all points active vs the offline ePIE harness, identical verified inputs) shows pipeline ≈ engine-offline. The remaining quality gap vs the beamline reference is a missing feature, not a bug: streaming_recon is a DM-only subset, while the reference runs ML_grad for ~90% of iterations (alg_percentage=0.1). Porting the ML_grad stage is #64.

Known limitation

On a 16 GB GPU the engine holds 8192 of 40000 points (~786 KB/point) — the dashboard shows a band, not the full scan. Beamline-class GPUs with live_num_points_max sized to the scan (GUI used 41000) cover the whole grid. Raising capacity within GPU memory: set live_num_points_max in the config.

🤖 Generated with Claude Code

…ion cap, probe warm-start The iterative engine had been broken by several stacked issues since the ptychoml preprocessing migration. Diagnosed by dumping the engine's exact inputs and reproducing the reconstruction offline (scripts/offline_epie.py, new) — validated end-to-end on the full 40k-frame scan 411993. 1. Engine DC convention (the critical one): the engine's CUDA kernels run plain unshifted FFTs and compare amplitudes pointwise, so its diff_d copy must be fftshifted to DC-at-corner. HXN_databroker's diffamp prep and the old holoscan-framework feed both applied this shift; it was lost in the ptychoml migration, scrambling the Fourier amplitude constraint (garbage objects, intermittent non-finite values, corrupted probe updates). ImageSendOp now applies it unconditionally on the engine copy only — the shared diff_amp (ViT, dashboard) keeps the beam-centered layout. 2. Position anchoring: the scan was anchored at the first chunk's minimum with only obj_pad//2 = 2 px of slack, so any position undershoot produced negative crop windows and out-of-bounds GPU reads (cudaErrorIllegalAddress in the gather/scatter kernels). PointProcessorOp now centers the scan in the over-allocated object array (~450 px margin each side) and clamps any remaining out-of-bounds window with a warning instead of crashing. 3. Iteration cap: n_iterations doubled as both the snapshot ring-buffer depth and the termination cap, killing streaming runs after ~6% of a 40k-frame scan. The cap is now a separate max_iterations (default effectively unlimited; runs end when data collection completes + 30 refinement iterations). n_iterations remains only the ring depth. 4. Probe warm-start: streaming mode only supported probe-from-diff; with a thin early-scan band that cold start is under-determined. initial_probe can now load a probe .npy from a prior reconstruction via prb_path + init_prb_flag=False (CLI: --probe-path). Replay additionally applies beamline-validated engine parameters (looser amp/pha clamps, ml_weight 1.0, no probe propagation, frozen probe, object canvas headroom) in iterative/both mode, and defaults max_iterations to 200 for quick validation cycles. New debug hooks: HOLOPTYCHO_DUMP_ENGINE_INPUTS dumps the engine's diff_d/point_info buffers and the exact engine-feed DPs. The offline-validated engine orientation for 411993 is rot90_cw (--dp-orient-iterative rot90_cw), matching the old branch's rot90 feed; kept as a flag until confirmed in-pipeline, then it should become the default.

Continued offline validation (full 40k-frame scan 411993, warm and cold-start probe) overturned the earlier rot90_cw finding: that rotation was an artifact of a thin-band amplitude-only sweep, compensating for a position-side convention error. The deterministic recipe is: - dp_orient_iterative unset (identity) — the DP and positions are both already in the global frame after detector_orientation - engine fftshift to DC-at-corner (unchanged from the previous commit) - x_direction_iterative = +1.0 (flipped vs the shared x_direction = -1.0): with the wrong parity the solver converges to the conjugate twin (amplitude fine, phase carries a residual ramp/fringes); the single x flip yields a flat clean phase, validated against y-flip and both-flip controls offline_epie.py grows the knobs used for this: --swap-pos / --flip-pos-x / --flip-pos-y position-geometry experiments, optional --probe (cold start via the engine's probe-from-diff formula, probe updates implied), and wrap-aware per-epoch probe re-centering (the amplitude constraint is invariant to circular probe shifts, so cold starts otherwise drift into a corner-wrapped degenerate state).

… CUDA crashes iter_once's it==0 init kernel processes ALL num_points_l slots of point_info_d, including ones whose windows haven't been written yet (points stream in over the whole scan). cp.empty left recycled GPU pool garbage in those slots, which the gather kernel used as object-array indices -> cudaErrorIllegalAddress, nondeterministically (runs survived whenever the recycled memory happened to be zeros). A zero window reads obj[0:nx, 0:ny], always in bounds.

obj_update / prb_norm divides the whole object array, but prb_norm is zero wherever no probe window has contributed yet — most of the object early in a streaming scan. The resulting 0/0 NaNs tripped write_live's global finite check (the long-standing 'non-finite values in probe/object' skips) and wrecked the dashboard display normalization mid-scan even when the covered region was reconstructing correctly. Uncovered pixels now keep their current value.

tiled does not configure httpx limits, so the connection pool is the httpx default max_connections=100, not ~10 — worker counts up to ~32 are safe. 16 workers reach ~700 fps against the NSLS2 tiled server when the network allows; pair high worker counts with a smaller --chunk-size to bound in-flight memory (workers x chunk_size x frame_bytes).

The engine's diff_d/point_info_d buffers hold live_num_points_max points (default 8192), but the ImageSendOp/PointProcessorOp feed guards used max_points = nz (the scan's frame count, e.g. 40000). Once the stream passed num_points_l frames, the device copies wrote past the end of the engine buffers -> cudaErrorIllegalAddress, deterministically at frame num_points_l+1 (the repeated ~172 s crashes). The feed now stops at the engine capacity with a one-time warning; ViT/tap branches are unaffected. Cover more of a scan by raising config live_num_points_max within GPU memory (~786 KB/point at 256x256).

…e of crashes + corrupted recon) The accumulate_obj_mode / accumulate_prb_mode CUDA kernels index product_d with the GLOBAL point index (start + z) — they expect the full array, as the original HXN code passed. The port passed a slice already offset to start_point, double-offsetting every read: batches after the first read the WRONG points' product data (silently corrupting the object updates — why the streaming recon looked wrong while the offline harness, which has no such path, reconstructed correctly), and once start_point exceeded num_points_l/2 the reads ran past the allocation entirely (cudaErrorIllegalAddress, deterministically at ~5k points received ~170s, the repeated crash signature). Found with CUDA_LAUNCH_BLOCKING=1 pointing at the kernel launch instead of the downstream memcpy.

Verified the pipeline engine's inputs are identical to the offline harness's (positions exact, DPs corr=1.0 up to a uniform normalization factor), isolating the remaining recon-quality gap to solver state: product_d (the DM dual) was only initialized once at iteration 0, when most of a streaming scan's points hadn't arrived and their windows were zeros. Every later-arriving point then carried a stale dual for the rest of the run, and the DM update (2*prb*obj - product) mixed real data with garbage state for most of the scan. ePIE (offline) has no dual state, which is why it converged on the same inputs. init_product_range() re-runs the same init kernel scoped to each newly activated [prev, ready) range as the stream advances.

…ed engine inputs Instantiates the actual engine class outside the pipeline, loads the HOLOPTYCHO_DUMP_ENGINE_INPUTS dumps (byte-identical inputs), activates all points from iteration 0, and runs the DM loop. Used to isolate solver behavior from streaming dynamics: with identical inputs the DM port and the ePIE harness now produce the same quality class of band reconstruction, confirming the engine math after the accumulate/dual-state fixes. Engine coverage on a 16 GB GPU is capped at live_num_points_max (default 8192) of a 40k-frame scan — full-scan quality requires the beamline-class GPU.

When the engine finishes it calls stop_execution() and ends the whole pipeline, so the replay's quick-validation max_iterations=200 default would kill the ViT stream mid-scan in --mode both. The default now applies only to iterative-only runs; both-mode keeps the pipeline default (unlimited; run ends when data collection completes) unless --max-iterations is passed explicitly.

… clear_region The 1.625x range scaling (object-canvas headroom copied from the beamline GUI config) is superseded by PointProcessorOp centering the scan in the over-allocated object array, and it was actively harmful: it pushed the canvas/points sparsity ratio over clear_region's >16 gate, so every streaming advance zeroed the object under the new points' 256 px windows — which overlap previously reconstructed rows. The object never accumulated cleanly (speckled amplitude, striped phase) even though the same engine on the same inputs with all points active reconstructs fine. With the original commanded range the ratio is ~7.5 and clear_region stays off. Also restores the proper ViT mosaic canvas size in both mode.

…nt configurable The 200-iteration max_iterations default truncated runs mid-stream once ingest got fast (16 tiled workers deliver 8192 frames in ~30 s while the engine manages ~3 iterations/s) — exactly the failure the cap decoupling was meant to prevent. Quick validation runs are bounded by --max-frames instead. it_ends_after (refinement iterations after data collection completes) is now a config key (default 30). The replay sets 300: with fast ingest nearly all of the engine's full-data iterations happen after the stream ends, and 30 left the reconstruction underdone.

clear_region zeroes the object under newly activated points' windows, which overlap previously reconstructed rows on dense fly scans — every streaming advance wiped a stripe of converged object. Its sparsity heuristic (canvas/points > 16) was the only gate. New config key clear_region_enabled (default True, preserving existing behavior) disables it outright; the replay sets False. The replay's 1.625x x/y_range headroom returns: the object array and the dashboard snapshot crop are sized from the range, so without headroom the scan band fills the array edge-to-edge and the display cuts it off. Its harmful side effect (tripping the clear_region gate) is now disabled explicitly rather than by shrinking the canvas.

Garrett Bischof (gwbischof) added 13 commits June 12, 2026 15:37

Garrett Bischof (gwbischof) mentioned this pull request Jun 13, 2026

Port the ML_grad refinement stage to streaming_recon #64

Closed

Garrett Bischof (gwbischof) merged commit 1343607 into main Jun 13, 2026
5 checks passed

Garrett Bischof (gwbischof) deleted the fix/iterative-recon-recipe branch June 13, 2026 17:13

This was referenced Jun 14, 2026

Iterative engine: drive the ptycho library recon (one_iter) instead of the reimplemented streaming_recon.py #65

Open

Iterative streaming: drive ptycho's ML_grad, match holoscan-framework (#65) #66

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix iterative recon: engine DC convention, position anchoring, iteration cap, probe warm-start#63

Fix iterative recon: engine DC convention, position anchoring, iteration cap, probe warm-start#63
Garrett Bischof (gwbischof) merged 13 commits into
mainfrom
fix/iterative-recon-recipe

Garrett Bischof (gwbischof) commented Jun 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Garrett Bischof (gwbischof) commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fixes (in this PR)

GPU memory safety (three distinct cudaErrorIllegalAddress bugs)

Solver correctness

Validated geometry (deterministic, no autodetect/empirics)

Tooling

Verification

Resolution status

Known limitation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Garrett Bischof (gwbischof) commented Jun 12, 2026 •

edited

Loading

GPU memory safety (three distinct `cudaErrorIllegalAddress` bugs)