Skip to content

fix: trace store incremental dirty tracking to eliminate GVL stall#43

Merged
Esity merged 3 commits into
mainfrom
fix/trace-store-incremental-dirty-tracking
Jun 17, 2026
Merged

fix: trace store incremental dirty tracking to eliminate GVL stall#43
Esity merged 3 commits into
mainfrom
fix/trace-store-incremental-dirty-tracking

Conversation

@Esity

@Esity Esity commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Replace boolean @traces_dirty with per-trace @dirty_trace_ids Set + @deleted_trace_ids Set
  • snapshot_dirty_state now only serializes traces in the dirty set — O(changed) not O(total)
  • load_from_local stores raw DB rows directly instead of re-serializing all traces on boot

Problem

The trace store's snapshot_dirty_state was calling @traces.transform_values { |t| serialize_trace_for_db(t) } on every flush — serializing every trace to JSON just to diff against the last-persisted state. With thousands of traces containing nested payloads with floats, this:

  1. Monopolized the Ruby GVL for seconds (JSON.generate in C extension doesn't release GVL)
  2. Triggered massive GC pressure from intermediate String allocations
  3. Starved all other threads — measured at 97.5% single-core CPU via macOS Activity Monitor sample
  4. Made the entire LegionIO process unresponsive

The decay_cycle actor (fires every 60s) calls store.store(trace) on every trace with non-zero decay, setting @traces_dirty = true, which guaranteed the full-serialize path fired every cycle.

Fix

Track dirty state per-trace-ID instead of as a boolean. Only serialize the traces that actually changed during flush. The decay cycle touching 500 of 10,000 traces now serializes only those 500.

Test plan

  • Full spec suite passes (2058 examples, 0 failures)
  • Deploy and verify CPU drops from 97% to near-idle between ticks
  • Verify traces persist correctly after decay cycle (spot check SQLite)

snapshot_dirty_state was serializing ALL traces to JSON on every flush
just to diff and find which ones changed. With thousands of traces this
monopolized the Ruby GVL for seconds, starving every other thread and
pegging a single core at 100%.

Replace the boolean @traces_dirty flag with a per-trace @dirty_trace_ids
Set. Now flush only serializes traces that were actually modified —
O(changed) instead of O(total). Also eliminates the redundant
full-collection serialize on boot (load_from_local).
@Esity Esity requested a review from a team as a code owner June 17, 2026 22:36
@Esity Esity merged commit 1da5ccb into main Jun 17, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant