Enable the composite decoder in surface-code example. Depends on #536 by melody-ren · Pull Request #619 · NVIDIA/cudaqx

melody-ren · 2026-06-18T20:52:09Z

Summary

This wires the TRT + PyMatching composite decoder into the merged HOST_CALL realtime path and adds coverage for both the direct composite RPC path and the surface-code example.

Add a TrtDecoderHostCallRpc unit test that builds the composite decoder through the standard decoder config path, routes it through the in-process HOST_CALL session, and validates observable corrections.
Extend surface_code-1 to support trt_decoder configs with a TensorRT predecoder and nested PyMatching global decoder.
Use the Z-sector DEM for the TRT/PyMatching surface-code path because the full X+Z DEM generated by this example is not graphlike for PyMatching.
Add a small ONNX generator for split predecoder outputs [pre_L | residual].
Add TRT surface-code test variants, including inproc_rpc, and make CI install onnx/pyyaml before configure so the tests are actually registered.
Preserve readable syndrome dumps with ROUND_START markers while truncating packed syndrome bits to the real per-round syndrome width.

Merge sequence

Depends on #536 . Should go in after 536.

Testing

Ran in cudaqx-public-pr615-cu13-dev:

cmake --build /tmp/cudaqx-public-pr615-pr536-trt-build-mainrtlibs --target surface_code-1-local
ctest --output-on-failure -R "TrtDecoderHostCallRpc|app_examples.surface_code-1-local-test-distance-3-trt"
ctest --output-on-failure -R "app_examples.surface_code-1-local-test-distance-3($|-trt)"

…output Add a "predecoder" execution mode to the TensorRT decoder so it can be chained with a second decoder (e.g. PyMatching) and return logical-frame observables directly. The TRT model is assumed to emit a single output that concatenates [pre_L (num_observables entries), residual_dets (rest)]. New constructor parameters: - "batch_size": required when the ONNX model has a dynamic batch dim. Used to size the optimization profile and pre-allocate I/O buffers. - "global_decoder" + "global_decoder_params": optional decoder name and params for a follow-up decoder run on the residual_dets portion of the TRT output. Created with the same H passed to trt_decoder. - "O": observables matrix (num_observables x block_size). Enables decode()/decode_batch() to return the predicted logical frame. Number of observables is inferred from O.shape()[0]. Decode behavior matrix: - no global_decoder, no O -> raw TRT output (unchanged). - no global_decoder, O -> return the pre_L prefix only. - global_decoder, no O -> entire output -> global_decoder.result. - global_decoder, O -> residual -> global_decoder; return pre_L XOR global_decoder.logical_frame. Constructor validation when O is set: - output_size_per_sample >= num_observables, and - when global_decoder_ is set, output_size_per_sample == num_observables + global_decoder.syndrome_size. Other changes: - Dynamic batch support: setInputShape per call when the model's batch dim is -1; ONNX builder now installs a min/opt/max optimization profile when "batch_size" is provided. - Split decode_batch into a typed decode_batch_impl<float|uint8_t> for cleaner dtype dispatch (engine I/O dtypes float32 / uint8 unchanged). - Better INFO logging: total non-zero input vs residual detector counts per batch to help diagnose predecoder behavior. Signed-off-by: Ben Howe <bhowe@nvidia.com>

Add a realtime test/demo that initializes the TensorRT decoder from an ONNX predecoder model with PyMatching configured as the global decoder. The driver loads detector, observable, parity-check, observable, and prior data from the Stim export bundle, decodes samples through the composite TRT+PyMatching path, and reports latency, throughput, correctness, and residual-syndrome diagnostics. Register the new test_trt_decoder_composite target when TensorRT, realtime, and the TRT decoder plugin are available. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

Add YAML/config support for TRT decoder runtime options including batch size, CUDA graph execution, global decoder selection, and PyMatching-specific global decoder parameters. Wire realtime decoder construction so TRT configs receive the top-level observable matrix from O_sparse, and pass the same O matrix into PyMatching global decoder params for composite observable decoding. Expose the new config fields through Python bindings and heterogeneous_map round-tripping. Extend YAML tests for TRT config round-trip, runtime parameter conversion, and O_sparse-to-O injection. Update test_trt_decoder_composite to support an optional --config-yaml path, allowing the existing composite demo to construct and run a real TRT+PyMatching decoder directly from YAML while preserving the original manual CLI path. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

…yaml # Conflicts: # libs/qec/unittests/realtime/CMakeLists.txt # libs/qec/unittests/realtime/test_trt_decoder_composite.cpp

Replace the TRT decoder's hardcoded optional PyMatching global decoder params with a tagged global_decoder_config variant. Preserve PyMatching as the current supported concrete config while using std::monostate for the unset case. Update heterogeneous-map conversion, YAML mapping, and Python bindings so the existing PyMatching YAML/Python surface continues to round-trip. Extend the YAML unit test to verify the PyMatching variant arm is selected and still produces the expected runtime parameter map. Signed-off-by: Scott Thornton <wsttiger@gmail.com>

…yaml # Conflicts: # libs/qec/python/bindings/py_decoding_config.cpp

Signed-off-by: Scott Thornton <wsttiger@gmail.com>