Skip to content

perf(GC): make per-object layout O(1)-loadable — kill per-operation thread-local layout tracking (umbrella: method_calls/array-downgrade/object-property) #5094

@TheHypnoo

Description

@TheHypnoo

Summary

Several of Perry's worst benchmark gaps share one systemic root cause, not separate ones: Perry tracks per-object / per-slot GC layout (which slots hold raw-f64 numbers vs NaN-boxed pointers) in thread-local hashmaps (TYPED_LAYOUTS, LAYOUT_SLOT_MASKS) and queries or updates them on every field/element access. On macOS each access is a _tlv_get_addr (TLS accessor) + a hashmap hash+lookup. The all-numeric "unboxed" fast paths bypass it, but any heterogeneous shape, downgraded array, or raw-f64 class field falls into the per-op hashmap path, which dominates.

Benchmark Gap vs Node Hot path Per-op cost
method_calls (#5093) ~290× (3300 ms vs 11 ms) this.field get/set in class methods js_typed_feedback_class_field_{get,set}_guardclass_field_fast_contractlayout_typed_raw_f64_slot_for_user (TLS TYPED_LAYOUTS)
bench_numeric_array_downgrade ~781×; ~21× over the same-shape numeric array arr[i]=… on heterogeneous/any[] arrays js_array_set_f64_extendnote_array_slotlayout_note_slot (TLS), per write
bench_object_property ~17× after #5084 dynamic property writes same field-access guard family

Evidence (measured)

  • Profile (sample): the downgrade hot loop is a storm of _tlv_get_addr (TLS) calls.
  • method_calls: removing the per-access register_site call (perf(codegen): make typed-feedback site registration opt-in (3.6x on dynamic property access) #5084) → ~2%; an MRU cache over layout_typed_raw_f64_slot_for_user made it 2× SLOWER (3300 → 7050 ms). Inlining the guard is blocked because the raw-f64 check is a TLS hashmap lookup, not an O(1) header field.
  • numeric_array_downgrade: an obj_type pre-filter that skips the per-element typedarray/buffer/set/map registry lookups gave only ~5%; a "skip the in-bounds RuntimeHandleScope rooting" fast path made it worse. So the cost is note_array_slotlayout_note_slot (the TLS layout write), per element.

There is no cheap shortcut. The fix is to make per-object layout O(1)-loadable from the object/GC header so the layout check/update is an inline bit-test/bit-set instead of a thread-local hashmap op.

Proposed fix — "layout is canonical" header bit + static class mask

GcHeader is 8 bytes with a free _reserved: u16 and spare gc_flags bits. Add a bit meaning "this object's slot layout still matches its declared/canonical shape" (no downgrade yet):

  • Class instances: when set, the authoritative raw-f64 mask is the compile-time perry_typed_shape_raw_f64_mask_<class> global codegen already emits — a constant codegen can bit-test inline (no TLS). With the method already inlined (perf(transform): inline small this-using methods on exact receivers (1.6x on method_calls) #5092), LLVM LICM hoists the loop-invariant check out of the hot loop.
  • Arrays: when set, skip the per-write layout_note_slot for scalar-over-scalar in-bounds stores.
  • On downgrade (a pointer/string written into a canonical slot): clear the bit and fall back to today's per-object TYPED_LAYOUTS/LAYOUT_SLOT_MASKS path (unchanged). The GC scanner consults the bit.

The TLS hashmap exists today only to track downgrades; the common (no-downgrade) case does not need it.

Optional complement: store a compact inline mask in _reserved for small objects/arrays (≤16 slots) so even small downgraded shapes avoid the hashmap.

Correctness invariants (the crux — this is GC-internals, memory-corruption risk)

  1. GC scanner sees the truth. Per-slot pointer/raw-f64 determination after the change must equal today's (gc/trace.rs:758, gc/copying.rs:309,579 consult per-slot layout_kind). A wrong mask = trace a number as a pointer (crash) or miss a pointer (use-after-free).
  2. Representation. A slot read as raw double must hold raw f64; the canonical bit must be cleared before the first non-number is observable in a canonical slot (publish-order discipline, like descriptors_in_use).
  3. Downgrade is monotonic + complete; the fallback path stays byte-for-byte current behavior.
  4. GC moves transfer the bit + mask (gc/copying.rs:504 layout_transfer).

Phasing (each independently shippable + verifiable)

  1. Spike + microbench harness; prototype the bit read-only and confirm it tracks downgrade under GC stress.
  2. Arrays first (lowest blast radius): wire the bit for arrays; skip layout_note_slot + write barrier for scalar-over-scalar in-bounds writes; scanner honors the bit.
  3. Class fields: emit the inline guard in codegen (expr/property_get.rs:1551, property_set.rs) using the header bit + static class mask, by-name fallback for the cleared-bit case.
  4. Object property path if the mechanism generalizes.

Verification

  • Full local parity (./run_parity_tests.sh) — zero NEW regressions vs base (compare combined stdout+stderr per file). cargo test --release --workspace.
  • GC correctness: run under PERRY_GC_VERIFY_EVACUATION=1, PERRY_GC_FORCE_EVACUATE=1, PERRY_GC_DIAG=1, and PERRY_GEN_GC=0 (full mark-sweep) — these panic on the corruption modes a wrong mask causes.
  • Targeted tests: (a) write a non-number into a number-typed class field via an any alias, then read it back; (b) a numeric array that receives an object slot then is GC-evacuated mid-loop; (c) holey/sparse downgraded arrays.
  • Per-benchmark perf regression gate.

Risk: HIGH (memory-corruption class). Effort: L (GC-internals). Maintainer-driven; not for autonomous execution.

Prior groundwork on this line

Files

gc/types.rs (GcHeader), gc/layout.rs (TYPED_LAYOUTS/LAYOUT_SLOT_MASKS, TypedLayoutDescriptor, layout_note_slot, layout_typed_raw_f64_slot_for_user), gc/trace.rs + gc/copying.rs (scanner + layout_transfer), array/indexing.rs + array/header.rs (note_array_slot), typed_feedback/guards.rs (class_field_fast_contract), codegen expr/property_get.rs:1551, expr/property_set.rs, typed_shape.rs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions