You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Several of Perry's worst benchmark gaps share one systemic root cause, not separate ones: Perry tracks per-object / per-slot GC layout (which slots hold raw-f64 numbers vs NaN-boxed pointers) in thread-local hashmaps (TYPED_LAYOUTS, LAYOUT_SLOT_MASKS) and queries or updates them on every field/element access. On macOS each access is a _tlv_get_addr (TLS accessor) + a hashmap hash+lookup. The all-numeric "unboxed" fast paths bypass it, but any heterogeneous shape, downgraded array, or raw-f64 class field falls into the per-op hashmap path, which dominates.
numeric_array_downgrade: an obj_type pre-filter that skips the per-element typedarray/buffer/set/map registry lookups gave only ~5%; a "skip the in-bounds RuntimeHandleScope rooting" fast path made it worse. So the cost is note_array_slot → layout_note_slot (the TLS layout write), per element.
There is no cheap shortcut. The fix is to make per-object layout O(1)-loadable from the object/GC header so the layout check/update is an inline bit-test/bit-set instead of a thread-local hashmap op.
Proposed fix — "layout is canonical" header bit + static class mask
GcHeader is 8 bytes with a free _reserved: u16 and spare gc_flags bits. Add a bit meaning "this object's slot layout still matches its declared/canonical shape" (no downgrade yet):
Class instances: when set, the authoritative raw-f64 mask is the compile-time perry_typed_shape_raw_f64_mask_<class> global codegen already emits — a constant codegen can bit-test inline (no TLS). With the method already inlined (perf(transform): inline small this-using methods on exact receivers (1.6x on method_calls) #5092), LLVM LICM hoists the loop-invariant check out of the hot loop.
Arrays: when set, skip the per-write layout_note_slot for scalar-over-scalar in-bounds stores.
On downgrade (a pointer/string written into a canonical slot): clear the bit and fall back to today's per-object TYPED_LAYOUTS/LAYOUT_SLOT_MASKS path (unchanged). The GC scanner consults the bit.
The TLS hashmap exists today only to track downgrades; the common (no-downgrade) case does not need it.
Optional complement: store a compact inline mask in _reserved for small objects/arrays (≤16 slots) so even small downgraded shapes avoid the hashmap.
Correctness invariants (the crux — this is GC-internals, memory-corruption risk)
GC scanner sees the truth. Per-slot pointer/raw-f64 determination after the change must equal today's (gc/trace.rs:758, gc/copying.rs:309,579 consult per-slot layout_kind). A wrong mask = trace a number as a pointer (crash) or miss a pointer (use-after-free).
Representation. A slot read as raw double must hold raw f64; the canonical bit must be cleared before the first non-number is observable in a canonical slot (publish-order discipline, like descriptors_in_use).
Downgrade is monotonic + complete; the fallback path stays byte-for-byte current behavior.
GC moves transfer the bit + mask (gc/copying.rs:504layout_transfer).
Spike + microbench harness; prototype the bit read-only and confirm it tracks downgrade under GC stress.
Arrays first (lowest blast radius): wire the bit for arrays; skip layout_note_slot + write barrier for scalar-over-scalar in-bounds writes; scanner honors the bit.
Class fields: emit the inline guard in codegen (expr/property_get.rs:1551, property_set.rs) using the header bit + static class mask, by-name fallback for the cleared-bit case.
Object property path if the mechanism generalizes.
Verification
Full local parity (./run_parity_tests.sh) — zero NEW regressions vs base (compare combined stdout+stderr per file). cargo test --release --workspace.
GC correctness: run under PERRY_GC_VERIFY_EVACUATION=1, PERRY_GC_FORCE_EVACUATE=1, PERRY_GC_DIAG=1, and PERRY_GEN_GC=0 (full mark-sweep) — these panic on the corruption modes a wrong mask causes.
Targeted tests: (a) write a non-number into a number-typed class field via an any alias, then read it back; (b) a numeric array that receives an object slot then is GC-evacuated mid-loop; (c) holey/sparse downgraded arrays.
Per-benchmark perf regression gate.
Risk: HIGH (memory-corruption class). Effort: L (GC-internals). Maintainer-driven; not for autonomous execution.
Summary
Several of Perry's worst benchmark gaps share one systemic root cause, not separate ones: Perry tracks per-object / per-slot GC layout (which slots hold raw-f64 numbers vs NaN-boxed pointers) in thread-local hashmaps (
TYPED_LAYOUTS,LAYOUT_SLOT_MASKS) and queries or updates them on every field/element access. On macOS each access is a_tlv_get_addr(TLS accessor) + a hashmap hash+lookup. The all-numeric "unboxed" fast paths bypass it, but any heterogeneous shape, downgraded array, or raw-f64 class field falls into the per-op hashmap path, which dominates.method_calls(#5093)this.fieldget/set in class methodsjs_typed_feedback_class_field_{get,set}_guard→class_field_fast_contract→layout_typed_raw_f64_slot_for_user(TLSTYPED_LAYOUTS)bench_numeric_array_downgradearr[i]=…on heterogeneous/any[]arraysjs_array_set_f64_extend→note_array_slot→layout_note_slot(TLS), per writebench_object_propertyEvidence (measured)
sample): the downgrade hot loop is a storm of_tlv_get_addr(TLS) calls.register_sitecall (perf(codegen): make typed-feedback site registration opt-in (3.6x on dynamic property access) #5084) → ~2%; an MRU cache overlayout_typed_raw_f64_slot_for_usermade it 2× SLOWER (3300 → 7050 ms). Inlining the guard is blocked because the raw-f64 check is a TLS hashmap lookup, not an O(1) header field.obj_typepre-filter that skips the per-element typedarray/buffer/set/map registry lookups gave only ~5%; a "skip the in-boundsRuntimeHandleScoperooting" fast path made it worse. So the cost isnote_array_slot→layout_note_slot(the TLS layout write), per element.There is no cheap shortcut. The fix is to make per-object layout O(1)-loadable from the object/GC header so the layout check/update is an inline bit-test/bit-set instead of a thread-local hashmap op.
Proposed fix — "layout is canonical" header bit + static class mask
GcHeaderis 8 bytes with a free_reserved: u16and sparegc_flagsbits. Add a bit meaning "this object's slot layout still matches its declared/canonical shape" (no downgrade yet):perry_typed_shape_raw_f64_mask_<class>global codegen already emits — a constant codegen can bit-test inline (no TLS). With the method already inlined (perf(transform): inline small this-using methods on exact receivers (1.6x on method_calls) #5092), LLVM LICM hoists the loop-invariant check out of the hot loop.layout_note_slotfor scalar-over-scalar in-bounds stores.TYPED_LAYOUTS/LAYOUT_SLOT_MASKSpath (unchanged). The GC scanner consults the bit.The TLS hashmap exists today only to track downgrades; the common (no-downgrade) case does not need it.
Optional complement: store a compact inline mask in
_reservedfor small objects/arrays (≤16 slots) so even small downgraded shapes avoid the hashmap.Correctness invariants (the crux — this is GC-internals, memory-corruption risk)
gc/trace.rs:758,gc/copying.rs:309,579consult per-slotlayout_kind). A wrong mask = trace a number as a pointer (crash) or miss a pointer (use-after-free).doublemust hold raw f64; the canonical bit must be cleared before the first non-number is observable in a canonical slot (publish-order discipline, likedescriptors_in_use).gc/copying.rs:504layout_transfer).Phasing (each independently shippable + verifiable)
layout_note_slot+ write barrier for scalar-over-scalar in-bounds writes; scanner honors the bit.expr/property_get.rs:1551,property_set.rs) using the header bit + static class mask, by-name fallback for the cleared-bit case.Verification
./run_parity_tests.sh) — zero NEW regressions vs base (compare combined stdout+stderr per file).cargo test --release --workspace.PERRY_GC_VERIFY_EVACUATION=1,PERRY_GC_FORCE_EVACUATE=1,PERRY_GC_DIAG=1, andPERRY_GEN_GC=0(full mark-sweep) — these panic on the corruption modes a wrong mask causes.number-typed class field via ananyalias, then read it back; (b) a numeric array that receives an object slot then is GC-evacuated mid-loop; (c) holey/sparse downgraded arrays.Risk: HIGH (memory-corruption class). Effort: L (GC-internals). Maintainer-driven; not for autonomous execution.
Prior groundwork on this line
register_siteopt-in (object_property 3.6×).this.fieldmethods (method_calls 1.6×; prerequisite — the guard must be inlined into the loop for the phase-3 hoist to pay off).Files
gc/types.rs(GcHeader),gc/layout.rs(TYPED_LAYOUTS/LAYOUT_SLOT_MASKS,TypedLayoutDescriptor,layout_note_slot,layout_typed_raw_f64_slot_for_user),gc/trace.rs+gc/copying.rs(scanner +layout_transfer),array/indexing.rs+array/header.rs(note_array_slot),typed_feedback/guards.rs(class_field_fast_contract), codegenexpr/property_get.rs:1551,expr/property_set.rs,typed_shape.rs.