Fix iterative recon: engine DC convention, position anchoring, iteration cap, probe warm-start#63
Merged
Conversation
…ion cap, probe warm-start The iterative engine had been broken by several stacked issues since the ptychoml preprocessing migration. Diagnosed by dumping the engine's exact inputs and reproducing the reconstruction offline (scripts/offline_epie.py, new) — validated end-to-end on the full 40k-frame scan 411993. 1. Engine DC convention (the critical one): the engine's CUDA kernels run plain unshifted FFTs and compare amplitudes pointwise, so its diff_d copy must be fftshifted to DC-at-corner. HXN_databroker's diffamp prep and the old holoscan-framework feed both applied this shift; it was lost in the ptychoml migration, scrambling the Fourier amplitude constraint (garbage objects, intermittent non-finite values, corrupted probe updates). ImageSendOp now applies it unconditionally on the engine copy only — the shared diff_amp (ViT, dashboard) keeps the beam-centered layout. 2. Position anchoring: the scan was anchored at the first chunk's minimum with only obj_pad//2 = 2 px of slack, so any position undershoot produced negative crop windows and out-of-bounds GPU reads (cudaErrorIllegalAddress in the gather/scatter kernels). PointProcessorOp now centers the scan in the over-allocated object array (~450 px margin each side) and clamps any remaining out-of-bounds window with a warning instead of crashing. 3. Iteration cap: n_iterations doubled as both the snapshot ring-buffer depth and the termination cap, killing streaming runs after ~6% of a 40k-frame scan. The cap is now a separate max_iterations (default effectively unlimited; runs end when data collection completes + 30 refinement iterations). n_iterations remains only the ring depth. 4. Probe warm-start: streaming mode only supported probe-from-diff; with a thin early-scan band that cold start is under-determined. initial_probe can now load a probe .npy from a prior reconstruction via prb_path + init_prb_flag=False (CLI: --probe-path). Replay additionally applies beamline-validated engine parameters (looser amp/pha clamps, ml_weight 1.0, no probe propagation, frozen probe, object canvas headroom) in iterative/both mode, and defaults max_iterations to 200 for quick validation cycles. New debug hooks: HOLOPTYCHO_DUMP_ENGINE_INPUTS dumps the engine's diff_d/point_info buffers and the exact engine-feed DPs. The offline-validated engine orientation for 411993 is rot90_cw (--dp-orient-iterative rot90_cw), matching the old branch's rot90 feed; kept as a flag until confirmed in-pipeline, then it should become the default.
Continued offline validation (full 40k-frame scan 411993, warm and cold-start probe) overturned the earlier rot90_cw finding: that rotation was an artifact of a thin-band amplitude-only sweep, compensating for a position-side convention error. The deterministic recipe is: - dp_orient_iterative unset (identity) — the DP and positions are both already in the global frame after detector_orientation - engine fftshift to DC-at-corner (unchanged from the previous commit) - x_direction_iterative = +1.0 (flipped vs the shared x_direction = -1.0): with the wrong parity the solver converges to the conjugate twin (amplitude fine, phase carries a residual ramp/fringes); the single x flip yields a flat clean phase, validated against y-flip and both-flip controls offline_epie.py grows the knobs used for this: --swap-pos / --flip-pos-x / --flip-pos-y position-geometry experiments, optional --probe (cold start via the engine's probe-from-diff formula, probe updates implied), and wrap-aware per-epoch probe re-centering (the amplitude constraint is invariant to circular probe shifts, so cold starts otherwise drift into a corner-wrapped degenerate state).
… CUDA crashes iter_once's it==0 init kernel processes ALL num_points_l slots of point_info_d, including ones whose windows haven't been written yet (points stream in over the whole scan). cp.empty left recycled GPU pool garbage in those slots, which the gather kernel used as object-array indices -> cudaErrorIllegalAddress, nondeterministically (runs survived whenever the recycled memory happened to be zeros). A zero window reads obj[0:nx, 0:ny], always in bounds.
obj_update / prb_norm divides the whole object array, but prb_norm is zero wherever no probe window has contributed yet — most of the object early in a streaming scan. The resulting 0/0 NaNs tripped write_live's global finite check (the long-standing 'non-finite values in probe/object' skips) and wrecked the dashboard display normalization mid-scan even when the covered region was reconstructing correctly. Uncovered pixels now keep their current value.
tiled does not configure httpx limits, so the connection pool is the httpx default max_connections=100, not ~10 — worker counts up to ~32 are safe. 16 workers reach ~700 fps against the NSLS2 tiled server when the network allows; pair high worker counts with a smaller --chunk-size to bound in-flight memory (workers x chunk_size x frame_bytes).
The engine's diff_d/point_info_d buffers hold live_num_points_max points (default 8192), but the ImageSendOp/PointProcessorOp feed guards used max_points = nz (the scan's frame count, e.g. 40000). Once the stream passed num_points_l frames, the device copies wrote past the end of the engine buffers -> cudaErrorIllegalAddress, deterministically at frame num_points_l+1 (the repeated ~172 s crashes). The feed now stops at the engine capacity with a one-time warning; ViT/tap branches are unaffected. Cover more of a scan by raising config live_num_points_max within GPU memory (~786 KB/point at 256x256).
…e of crashes + corrupted recon) The accumulate_obj_mode / accumulate_prb_mode CUDA kernels index product_d with the GLOBAL point index (start + z) — they expect the full array, as the original HXN code passed. The port passed a slice already offset to start_point, double-offsetting every read: batches after the first read the WRONG points' product data (silently corrupting the object updates — why the streaming recon looked wrong while the offline harness, which has no such path, reconstructed correctly), and once start_point exceeded num_points_l/2 the reads ran past the allocation entirely (cudaErrorIllegalAddress, deterministically at ~5k points received ~170s, the repeated crash signature). Found with CUDA_LAUNCH_BLOCKING=1 pointing at the kernel launch instead of the downstream memcpy.
Verified the pipeline engine's inputs are identical to the offline harness's (positions exact, DPs corr=1.0 up to a uniform normalization factor), isolating the remaining recon-quality gap to solver state: product_d (the DM dual) was only initialized once at iteration 0, when most of a streaming scan's points hadn't arrived and their windows were zeros. Every later-arriving point then carried a stale dual for the rest of the run, and the DM update (2*prb*obj - product) mixed real data with garbage state for most of the scan. ePIE (offline) has no dual state, which is why it converged on the same inputs. init_product_range() re-runs the same init kernel scoped to each newly activated [prev, ready) range as the stream advances.
…ed engine inputs Instantiates the actual engine class outside the pipeline, loads the HOLOPTYCHO_DUMP_ENGINE_INPUTS dumps (byte-identical inputs), activates all points from iteration 0, and runs the DM loop. Used to isolate solver behavior from streaming dynamics: with identical inputs the DM port and the ePIE harness now produce the same quality class of band reconstruction, confirming the engine math after the accumulate/dual-state fixes. Engine coverage on a 16 GB GPU is capped at live_num_points_max (default 8192) of a 40k-frame scan — full-scan quality requires the beamline-class GPU.
When the engine finishes it calls stop_execution() and ends the whole pipeline, so the replay's quick-validation max_iterations=200 default would kill the ViT stream mid-scan in --mode both. The default now applies only to iterative-only runs; both-mode keeps the pipeline default (unlimited; run ends when data collection completes) unless --max-iterations is passed explicitly.
… clear_region The 1.625x range scaling (object-canvas headroom copied from the beamline GUI config) is superseded by PointProcessorOp centering the scan in the over-allocated object array, and it was actively harmful: it pushed the canvas/points sparsity ratio over clear_region's >16 gate, so every streaming advance zeroed the object under the new points' 256 px windows — which overlap previously reconstructed rows. The object never accumulated cleanly (speckled amplitude, striped phase) even though the same engine on the same inputs with all points active reconstructs fine. With the original commanded range the ratio is ~7.5 and clear_region stays off. Also restores the proper ViT mosaic canvas size in both mode.
…nt configurable The 200-iteration max_iterations default truncated runs mid-stream once ingest got fast (16 tiled workers deliver 8192 frames in ~30 s while the engine manages ~3 iterations/s) — exactly the failure the cap decoupling was meant to prevent. Quick validation runs are bounded by --max-frames instead. it_ends_after (refinement iterations after data collection completes) is now a config key (default 30). The replay sets 300: with fast ingest nearly all of the engine's full-data iterations happen after the stream ends, and 30 left the reconstruction underdone.
clear_region zeroes the object under newly activated points' windows, which overlap previously reconstructed rows on dense fly scans — every streaming advance wiped a stripe of converged object. Its sparsity heuristic (canvas/points > 16) was the only gate. New config key clear_region_enabled (default True, preserving existing behavior) disables it outright; the replay sets False. The replay's 1.625x x/y_range headroom returns: the object array and the dashboard snapshot crop are sized from the range, so without headroom the scan band fills the array edge-to-edge and the display cuts it off. Its harmful side effect (tripping the clear_region gate) is now disabled explicitly rather than by shrinking the canvas.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The iterative (DM/ML) recon produced garbage objects, intermittent non-finite values, and
cudaErrorIllegalAddresscrashes on streaming runs. Diagnosis: eight independent defects stacked, which is why no single fix showed clean improvement. The method that cracked it: dump the engine's exact inputs, reproduce reconstruction offline (scripts/offline_epie.py), verify the inputs element-by-element, then run the actual engine class on the dumped inputs (scripts/offline_engine_test.py) to isolate solver behavior from streaming dynamics.Fixes (in this PR)
GPU memory safety (three distinct
cudaErrorIllegalAddressbugs)accumulate_obj_mode/accumulate_prb_modeindexproduct_dwith the global point index(start + z), expecting the full array — the port passed a slice already offset tostart_point. Every batch after the first read the wrong points' data (silently corrupting object updates), and reads ran out of bounds oncestart ≥ num_points_l/2(the deterministic ~170 s crashes). Found withCUDA_LAUNCH_BLOCKING=1.point_info_d:cp.emptyleft garbage windows in slots whose points hadn't arrived; the it==0 init kernel processes ALL slots → garbage int32s as object indices → nondeterministic crashes. Nowcp.zeros.live_num_points_maxpoints (default 8192), but the feed guards used the scan'snz(40000) — device copies wrote past the buffers at frame 8193. Feed now stops at capacity with a one-time warning.Solver correctness
diff_dcopy must be fftshifted to DC-at-corner (asHXN_databroker.pyalways did) — lost in the ptychoml preprocess migration.ImageSendOpnow applies it on the engine copy only.product_dwas initialized once at iteration 0 when most streaming points hadn't arrived; later points carried garbage dual state for the whole run.init_product_range()now initializes each newly activated range.prb_norm = 0) → NaNs that trippedwrite_live's finite check (the chronic "non-finite, skipping" log spam) and wrecked dashboard display normalization. Now a guarded divide.n_iterationsdoubled as snapshot ring depth AND termination cap, truncating streaming runs at ~6% of the scan. Decoupled intomax_iterations(safety cap, default unlimited; runs end on data-complete + 30 refinement iterations).Validated geometry (deterministic, no autodetect/empirics)
dp_orient_iterativeunset (identity): DP and positions are both already global afterdetector_orientation. (An earlierrot90_cwfinding was an artifact of a thin-band amplitude-only sweep compensating for the sign error below.)x_direction_iterative = +1.0(vs shared-1.0): wrong parity converges to the conjugate twin (amplitude fine, phase ramps/fringes). Validated by controlled x/y/both flip experiments.--probe-path(cold-start probe-from-diff also converges offline with correct geometry).Tooling
scripts/offline_epie.py: pipeline-free ePIE harness (Tiled fetch + incremental cache, engine conventions, warm/cold probe, D4 + position swap/flip knobs, 6-panel snapshots).scripts/offline_engine_test.py: runs the actualStreamingPtychoReconon dumped engine inputs — isolates solver math from streaming.HOLOPTYCHO_DUMP_ENGINE_INPUTS=<dir>: dumps enginediff_d/point_info_d+ the exact engine-feed DPs.--max-iterations(replay default 200),--probe-path,--tiled-workersdoc fix (httpx pool is 100, not 10; 16 workers ≈ 700 fps).Verification
pixi run python -m pytest tests/→ 124 passed, 2 pre-existingtest_smokefailures (no local tiled server).Resolution status
With all fixes in, the pipeline now reproduces the engine's computation faithfully: a three-way comparison at 8192 frames (pipeline streaming vs the engine class offline with all points active vs the offline ePIE harness, identical verified inputs) shows pipeline ≈ engine-offline. The remaining quality gap vs the beamline reference is a missing feature, not a bug: streaming_recon is a DM-only subset, while the reference runs ML_grad for ~90% of iterations (alg_percentage=0.1). Porting the ML_grad stage is #64.
Known limitation
On a 16 GB GPU the engine holds 8192 of 40000 points (~786 KB/point) — the dashboard shows a band, not the full scan. Beamline-class GPUs with
live_num_points_maxsized to the scan (GUI used 41000) cover the whole grid. Raising capacity within GPU memory: setlive_num_points_maxin the config.🤖 Generated with Claude Code