Skip to content

Fix iterative recon: engine DC convention, position anchoring, iteration cap, probe warm-start#63

Merged
Garrett Bischof (gwbischof) merged 13 commits into
mainfrom
fix/iterative-recon-recipe
Jun 13, 2026
Merged

Fix iterative recon: engine DC convention, position anchoring, iteration cap, probe warm-start#63
Garrett Bischof (gwbischof) merged 13 commits into
mainfrom
fix/iterative-recon-recipe

Conversation

@gwbischof

@gwbischof Garrett Bischof (gwbischof) commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Problem

The iterative (DM/ML) recon produced garbage objects, intermittent non-finite values, and cudaErrorIllegalAddress crashes on streaming runs. Diagnosis: eight independent defects stacked, which is why no single fix showed clean improvement. The method that cracked it: dump the engine's exact inputs, reproduce reconstruction offline (scripts/offline_epie.py), verify the inputs element-by-element, then run the actual engine class on the dumped inputs (scripts/offline_engine_test.py) to isolate solver behavior from streaming dynamics.

Fixes (in this PR)

GPU memory safety (three distinct cudaErrorIllegalAddress bugs)

  1. Accumulate-kernel slice double-offset (the big one): accumulate_obj_mode/accumulate_prb_mode index product_d with the global point index (start + z), expecting the full array — the port passed a slice already offset to start_point. Every batch after the first read the wrong points' data (silently corrupting object updates), and reads ran out of bounds once start ≥ num_points_l/2 (the deterministic ~170 s crashes). Found with CUDA_LAUNCH_BLOCKING=1.
  2. Uninitialized point_info_d: cp.empty left garbage windows in slots whose points hadn't arrived; the it==0 init kernel processes ALL slots → garbage int32s as object indices → nondeterministic crashes. Now cp.zeros.
  3. Feed past engine capacity: engine buffers hold live_num_points_max points (default 8192), but the feed guards used the scan's nz (40000) — device copies wrote past the buffers at frame 8193. Feed now stops at capacity with a one-time warning.

Solver correctness

  1. Engine DC convention: the kernels run plain unshifted FFTs with pointwise amplitude comparison, so the engine's diff_d copy must be fftshifted to DC-at-corner (as HXN_databroker.py always did) — lost in the ptychoml preprocess migration. ImageSendOp now applies it on the engine copy only.
  2. Stale DM dual state: product_d was initialized once at iteration 0 when most streaming points hadn't arrived; later points carried garbage dual state for the whole run. init_product_range() now initializes each newly activated range.
  3. 0/0 NaN divide: the object normalization divided the whole array including probe-untouched pixels (prb_norm = 0) → NaNs that tripped write_live's finite check (the chronic "non-finite, skipping" log spam) and wrecked dashboard display normalization. Now a guarded divide.
  4. Iteration cap: n_iterations doubled as snapshot ring depth AND termination cap, truncating streaming runs at ~6% of the scan. Decoupled into max_iterations (safety cap, default unlimited; runs end on data-complete + 30 refinement iterations).
  5. Position anchoring: scan anchored with 2 px of undershoot margin → negative crop windows. Now centered in the over-allocated object array + clamp-and-warn.

Validated geometry (deterministic, no autodetect/empirics)

  • dp_orient_iterative unset (identity): DP and positions are both already global after detector_orientation. (An earlier rot90_cw finding was an artifact of a thin-band amplitude-only sweep compensating for the sign error below.)
  • x_direction_iterative = +1.0 (vs shared -1.0): wrong parity converges to the conjugate twin (amplitude fine, phase ramps/fringes). Validated by controlled x/y/both flip experiments.
  • Probe: real-space, centered, never shifted; warm-start via --probe-path (cold-start probe-from-diff also converges offline with correct geometry).

Tooling

  • scripts/offline_epie.py: pipeline-free ePIE harness (Tiled fetch + incremental cache, engine conventions, warm/cold probe, D4 + position swap/flip knobs, 6-panel snapshots).
  • scripts/offline_engine_test.py: runs the actual StreamingPtychoRecon on dumped engine inputs — isolates solver math from streaming.
  • HOLOPTYCHO_DUMP_ENGINE_INPUTS=<dir>: dumps engine diff_d/point_info_d + the exact engine-feed DPs.
  • Replay: beamline-validated engine params in iterative/both mode, --max-iterations (replay default 200), --probe-path, --tiled-workers doc fix (httpx pool is 100, not 10; 16 workers ≈ 700 fps).

Verification

  • Engine inputs vs offline harness: positions exact (0 px), DPs correlation 1.000000 (uniform normalization factor only).
  • Offline full-scan (40k frames) reconstruction with the validated geometry: clean specimen, flat phase, warm AND cold-start probes.
  • Engine port on dumped inputs (all points active): same quality class as ePIE on equal data — solver math confirmed after fixes.
  • Pipeline: full streaming runs complete cleanly (rc=0, zero non-finite warnings, no crashes).
  • pixi run python -m pytest tests/ → 124 passed, 2 pre-existing test_smoke failures (no local tiled server).

Resolution status

With all fixes in, the pipeline now reproduces the engine's computation faithfully: a three-way comparison at 8192 frames (pipeline streaming vs the engine class offline with all points active vs the offline ePIE harness, identical verified inputs) shows pipeline ≈ engine-offline. The remaining quality gap vs the beamline reference is a missing feature, not a bug: streaming_recon is a DM-only subset, while the reference runs ML_grad for ~90% of iterations (alg_percentage=0.1). Porting the ML_grad stage is #64.

Known limitation

On a 16 GB GPU the engine holds 8192 of 40000 points (~786 KB/point) — the dashboard shows a band, not the full scan. Beamline-class GPUs with live_num_points_max sized to the scan (GUI used 41000) cover the whole grid. Raising capacity within GPU memory: set live_num_points_max in the config.

🤖 Generated with Claude Code

…ion cap, probe warm-start

The iterative engine had been broken by several stacked issues since the
ptychoml preprocessing migration. Diagnosed by dumping the engine's exact
inputs and reproducing the reconstruction offline (scripts/offline_epie.py,
new) — validated end-to-end on the full 40k-frame scan 411993.

1. Engine DC convention (the critical one): the engine's CUDA kernels run
   plain unshifted FFTs and compare amplitudes pointwise, so its diff_d copy
   must be fftshifted to DC-at-corner. HXN_databroker's diffamp prep and the
   old holoscan-framework feed both applied this shift; it was lost in the
   ptychoml migration, scrambling the Fourier amplitude constraint (garbage
   objects, intermittent non-finite values, corrupted probe updates).
   ImageSendOp now applies it unconditionally on the engine copy only —
   the shared diff_amp (ViT, dashboard) keeps the beam-centered layout.

2. Position anchoring: the scan was anchored at the first chunk's minimum
   with only obj_pad//2 = 2 px of slack, so any position undershoot produced
   negative crop windows and out-of-bounds GPU reads (cudaErrorIllegalAddress
   in the gather/scatter kernels). PointProcessorOp now centers the scan in
   the over-allocated object array (~450 px margin each side) and clamps any
   remaining out-of-bounds window with a warning instead of crashing.

3. Iteration cap: n_iterations doubled as both the snapshot ring-buffer depth
   and the termination cap, killing streaming runs after ~6% of a 40k-frame
   scan. The cap is now a separate max_iterations (default effectively
   unlimited; runs end when data collection completes + 30 refinement
   iterations). n_iterations remains only the ring depth.

4. Probe warm-start: streaming mode only supported probe-from-diff; with a
   thin early-scan band that cold start is under-determined. initial_probe
   can now load a probe .npy from a prior reconstruction via prb_path +
   init_prb_flag=False (CLI: --probe-path).

Replay additionally applies beamline-validated engine parameters (looser
amp/pha clamps, ml_weight 1.0, no probe propagation, frozen probe, object
canvas headroom) in iterative/both mode, and defaults max_iterations to 200
for quick validation cycles. New debug hooks: HOLOPTYCHO_DUMP_ENGINE_INPUTS
dumps the engine's diff_d/point_info buffers and the exact engine-feed DPs.

The offline-validated engine orientation for 411993 is rot90_cw
(--dp-orient-iterative rot90_cw), matching the old branch's rot90 feed; kept
as a flag until confirmed in-pipeline, then it should become the default.
Continued offline validation (full 40k-frame scan 411993, warm and cold-start
probe) overturned the earlier rot90_cw finding: that rotation was an artifact
of a thin-band amplitude-only sweep, compensating for a position-side
convention error. The deterministic recipe is:

- dp_orient_iterative unset (identity) — the DP and positions are both
  already in the global frame after detector_orientation
- engine fftshift to DC-at-corner (unchanged from the previous commit)
- x_direction_iterative = +1.0 (flipped vs the shared x_direction = -1.0):
  with the wrong parity the solver converges to the conjugate twin
  (amplitude fine, phase carries a residual ramp/fringes); the single
  x flip yields a flat clean phase, validated against y-flip and
  both-flip controls

offline_epie.py grows the knobs used for this: --swap-pos /
--flip-pos-x / --flip-pos-y position-geometry experiments, optional
--probe (cold start via the engine's probe-from-diff formula, probe
updates implied), and wrap-aware per-epoch probe re-centering (the
amplitude constraint is invariant to circular probe shifts, so cold
starts otherwise drift into a corner-wrapped degenerate state).
… CUDA crashes

iter_once's it==0 init kernel processes ALL num_points_l slots of
point_info_d, including ones whose windows haven't been written yet
(points stream in over the whole scan). cp.empty left recycled GPU pool
garbage in those slots, which the gather kernel used as object-array
indices -> cudaErrorIllegalAddress, nondeterministically (runs survived
whenever the recycled memory happened to be zeros). A zero window reads
obj[0:nx, 0:ny], always in bounds.
obj_update / prb_norm divides the whole object array, but prb_norm is
zero wherever no probe window has contributed yet — most of the object
early in a streaming scan. The resulting 0/0 NaNs tripped write_live's
global finite check (the long-standing 'non-finite values in
probe/object' skips) and wrecked the dashboard display normalization
mid-scan even when the covered region was reconstructing correctly.
Uncovered pixels now keep their current value.
tiled does not configure httpx limits, so the connection pool is the httpx
default max_connections=100, not ~10 — worker counts up to ~32 are safe.
16 workers reach ~700 fps against the NSLS2 tiled server when the network
allows; pair high worker counts with a smaller --chunk-size to bound
in-flight memory (workers x chunk_size x frame_bytes).
The engine's diff_d/point_info_d buffers hold live_num_points_max points
(default 8192), but the ImageSendOp/PointProcessorOp feed guards used
max_points = nz (the scan's frame count, e.g. 40000). Once the stream
passed num_points_l frames, the device copies wrote past the end of the
engine buffers -> cudaErrorIllegalAddress, deterministically at frame
num_points_l+1 (the repeated ~172 s crashes). The feed now stops at the
engine capacity with a one-time warning; ViT/tap branches are unaffected.
Cover more of a scan by raising config live_num_points_max within GPU
memory (~786 KB/point at 256x256).
…e of crashes + corrupted recon)

The accumulate_obj_mode / accumulate_prb_mode CUDA kernels index product_d
with the GLOBAL point index (start + z) — they expect the full array, as
the original HXN code passed. The port passed a slice already offset to
start_point, double-offsetting every read: batches after the first read
the WRONG points' product data (silently corrupting the object updates —
why the streaming recon looked wrong while the offline harness, which has
no such path, reconstructed correctly), and once start_point exceeded
num_points_l/2 the reads ran past the allocation entirely
(cudaErrorIllegalAddress, deterministically at ~5k points received ~170s,
the repeated crash signature). Found with CUDA_LAUNCH_BLOCKING=1 pointing
at the kernel launch instead of the downstream memcpy.
Verified the pipeline engine's inputs are identical to the offline
harness's (positions exact, DPs corr=1.0 up to a uniform normalization
factor), isolating the remaining recon-quality gap to solver state:
product_d (the DM dual) was only initialized once at iteration 0, when
most of a streaming scan's points hadn't arrived and their windows were
zeros. Every later-arriving point then carried a stale dual for the rest
of the run, and the DM update (2*prb*obj - product) mixed real data with
garbage state for most of the scan. ePIE (offline) has no dual state,
which is why it converged on the same inputs.

init_product_range() re-runs the same init kernel scoped to each newly
activated [prev, ready) range as the stream advances.
…ed engine inputs

Instantiates the actual engine class outside the pipeline, loads the
HOLOPTYCHO_DUMP_ENGINE_INPUTS dumps (byte-identical inputs), activates all
points from iteration 0, and runs the DM loop. Used to isolate solver
behavior from streaming dynamics: with identical inputs the DM port and
the ePIE harness now produce the same quality class of band reconstruction,
confirming the engine math after the accumulate/dual-state fixes. Engine
coverage on a 16 GB GPU is capped at live_num_points_max (default 8192)
of a 40k-frame scan — full-scan quality requires the beamline-class GPU.
When the engine finishes it calls stop_execution() and ends the whole
pipeline, so the replay's quick-validation max_iterations=200 default
would kill the ViT stream mid-scan in --mode both. The default now
applies only to iterative-only runs; both-mode keeps the pipeline
default (unlimited; run ends when data collection completes) unless
--max-iterations is passed explicitly.
… clear_region

The 1.625x range scaling (object-canvas headroom copied from the beamline
GUI config) is superseded by PointProcessorOp centering the scan in the
over-allocated object array, and it was actively harmful: it pushed the
canvas/points sparsity ratio over clear_region's >16 gate, so every
streaming advance zeroed the object under the new points' 256 px windows
— which overlap previously reconstructed rows. The object never
accumulated cleanly (speckled amplitude, striped phase) even though the
same engine on the same inputs with all points active reconstructs fine.
With the original commanded range the ratio is ~7.5 and clear_region
stays off. Also restores the proper ViT mosaic canvas size in both mode.
…nt configurable

The 200-iteration max_iterations default truncated runs mid-stream once
ingest got fast (16 tiled workers deliver 8192 frames in ~30 s while the
engine manages ~3 iterations/s) — exactly the failure the cap decoupling
was meant to prevent. Quick validation runs are bounded by --max-frames
instead.

it_ends_after (refinement iterations after data collection completes) is
now a config key (default 30). The replay sets 300: with fast ingest
nearly all of the engine's full-data iterations happen after the stream
ends, and 30 left the reconstruction underdone.
clear_region zeroes the object under newly activated points' windows,
which overlap previously reconstructed rows on dense fly scans — every
streaming advance wiped a stripe of converged object. Its sparsity
heuristic (canvas/points > 16) was the only gate. New config key
clear_region_enabled (default True, preserving existing behavior)
disables it outright; the replay sets False.

The replay's 1.625x x/y_range headroom returns: the object array and the
dashboard snapshot crop are sized from the range, so without headroom the
scan band fills the array edge-to-edge and the display cuts it off. Its
harmful side effect (tripping the clear_region gate) is now disabled
explicitly rather than by shrinking the canvas.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant