Skip to content

fix(core): resolve analyzer perf regression and harden pirx-core visibility#17

Merged
m2papierz merged 5 commits into
masterfrom
fix/analyzer-perf
Jun 13, 2026
Merged

fix(core): resolve analyzer perf regression and harden pirx-core visibility#17
m2papierz merged 5 commits into
masterfrom
fix/analyzer-perf

Conversation

@m2papierz

Copy link
Copy Markdown
Contributor

What

Fix ~20-25% performance regression in ProfileAnalyzer::analyze introduced by error budget tracking, apply the same fix to monte_carlo::extract_summary, and tighten module/type visibility across pirx-core.

Why

CodSpeed flagged analyze[500] at -25.49%, analyze[100] at -22.85%, analyze[10] at -19.7% comparing master
(71f67ed) against the error budget baseline (8bc4d1a). The regression was caused by commit 92cc496 restructuring the GateServed match arm from a guarded pattern to an unguarded one, making every zero-wait event pay for a
saturating_add + bounds-checked array write it previously skipped. The same redundant-counter pattern existed in monte_carlo::extract_summary. Several internal-only types and modules were unnecessarily pub, widening the API surface beyond what downstream crates consume.

How

Analyzer (analyzer.rs):

  • Restored the guarded match arm (GateServed { gate, wait } if *wait > 0) for the stall+counting path; added a
    separate GateServed { .. } arm for zero-wait bucket counting only.
  • Replaced the redundant total_magic_states loop counter with trace.magic_states_consumed (already computed by the engine).
  • Fused the post-loop cumulative_magic_states prefix-sum and cumulative_infidelity computation into a single
    pass, eliminating a separate .map().collect() allocation.

Monte Carlo (monte_carlo.rs):

  • Removed the redundant magic_states_consumed counter; reads trace.magic_states_consumed directly.
  • Restored the GateServed guard so zero-wait events fall through to the no-op arm.

Visibility (lib.rs, buffer.rs, events.rs, routing.rs):

  • buffer, events, routing modules: pubpub(crate).
  • ScalarRouting, ManhattanRouting, RoutingKind, build_position_index, from_config: pubpub(crate).
  • MagicStateBuffer, EngineEvent, TimedEvent, EventQueue: pubpub(crate).
  • MagicStateBuffer::is_full, is_empty, EventQueue::is_empty: gated behind #[cfg(test)] (test-only usage).

Testing

  • make ci passes locally (fmt + clippy + test + audit)
  • New behavior has tests (existing 206 tests cover all changed paths)
  • Hot-path changes have criterion benchmarks (trace_analysis/analyze/{10,100,500})

Checklist

  • PR description explains why, not just what
  • No new unwrap()/expect() in production code
  • No new allocations in the simulation hot loop
  • Crate boundaries respected (pirx-core never imports from pirx-adapters)
  • New dependencies justified (not "it's popular" — what does it replace?)

Error budget tracking (92cc496) changed GateServed from a guarded match arm (wait > 0, falling to no-op catch-all) to an unguarded arm processing every event. In warm-start benchmarks most gates are served with zero wait, so each paid a redundant saturating_add + array write.
@codspeed-hq

codspeed-hq Bot commented Jun 13, 2026

Copy link
Copy Markdown

Merging this PR will improve performance by 19.31%

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 3 improved benchmarks
✅ 12 untouched benchmarks

Performance Changes

Benchmark BASE HEAD Efficiency
analyze[500] 376.4 µs 314.1 µs +19.84%
analyze[100] 85.8 µs 71.6 µs +19.77%
analyze[10] 16.3 µs 13.8 µs +18.32%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.


Comparing fix/analyzer-perf (3711630) with master (71f67ed)

Open in CodSpeed

Revert the separate post-loop pass over trace events — benchmarking
showed it was slower than inline counting. Replace pre-allocated
vec![0; n] + fill loop for cumulative_magic_states and
cumulative_infidelity with iterator collect, avoiding two allocator
calls and ~60 KB of unnecessary zero-initialization.
cumulative_magic_states and cumulative_infidelity were pre-computed
Vec fields on ExecutionProfile, adding 2 allocations + 2 O(n) passes
to every analyze() call. Both are trivially derivable from
magic_states_per_bucket and p_logical. Move them to computed methods
so the cost is paid only by callers that need the timeline curves.
@m2papierz m2papierz merged commit f3cd97b into master Jun 13, 2026
11 checks passed
@m2papierz m2papierz deleted the fix/analyzer-perf branch June 13, 2026 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant