perf(GC): make per-object layout O(1)-loadable — kill per-operation thread-local layout tracking (umbrella: method_calls/array-downgrade/object-property)

## Summary

Several of Perry's worst benchmark gaps share **one** systemic root cause, not separate ones: Perry tracks **per-object / per-slot GC layout** (which slots hold raw-f64 numbers vs NaN-boxed pointers) in **thread-local hashmaps** (`TYPED_LAYOUTS`, `LAYOUT_SLOT_MASKS`) and **queries or updates them on every field/element access**. On macOS each access is a `_tlv_get_addr` (TLS accessor) + a hashmap hash+lookup. The all-numeric "unboxed" fast paths bypass it, but **any heterogeneous shape, downgraded array, or raw-f64 class field falls into the per-op hashmap path**, which dominates.

| Benchmark | Gap vs Node | Hot path | Per-op cost |
|---|---|---|---|
| `method_calls` (#5093) | ~290× (3300 ms vs 11 ms) | `this.field` get/set in class methods | `js_typed_feedback_class_field_{get,set}_guard` → `class_field_fast_contract` → `layout_typed_raw_f64_slot_for_user` (TLS `TYPED_LAYOUTS`) |
| `bench_numeric_array_downgrade` | ~781×; ~21× over the same-shape numeric array | `arr[i]=…` on heterogeneous/`any[]` arrays | `js_array_set_f64_extend` → `note_array_slot` → `layout_note_slot` (TLS), **per write** |
| `bench_object_property` | ~17× after #5084 | dynamic property writes | same field-access guard family |

## Evidence (measured)

- **Profile (`sample`)**: the downgrade hot loop is a storm of `_tlv_get_addr` (TLS) calls.
- **method_calls**: removing the per-access `register_site` call (#5084) → ~2%; an MRU cache over `layout_typed_raw_f64_slot_for_user` made it **2× SLOWER** (3300 → 7050 ms). Inlining the guard is blocked because the raw-f64 check is a TLS hashmap lookup, not an O(1) header field.
- **numeric_array_downgrade**: an `obj_type` pre-filter that skips the per-element typedarray/buffer/set/map registry lookups gave only ~5%; a "skip the in-bounds `RuntimeHandleScope` rooting" fast path made it **worse**. So the cost is `note_array_slot` → `layout_note_slot` (the TLS layout write), per element.

There is no cheap shortcut. The fix is to make per-object layout **O(1)-loadable from the object/GC header** so the layout check/update is an inline bit-test/bit-set instead of a thread-local hashmap op.

## Proposed fix — "layout is canonical" header bit + static class mask

`GcHeader` is 8 bytes with a free `_reserved: u16` and spare `gc_flags` bits. Add a bit meaning **"this object's slot layout still matches its declared/canonical shape" (no downgrade yet)**:

- **Class instances**: when set, the authoritative raw-f64 mask is the compile-time `perry_typed_shape_raw_f64_mask_<class>` global codegen already emits — a constant codegen can bit-test **inline** (no TLS). With the method already inlined (#5092), LLVM LICM hoists the loop-invariant check out of the hot loop.
- **Arrays**: when set, skip the per-write `layout_note_slot` for scalar-over-scalar in-bounds stores.
- **On downgrade** (a pointer/string written into a canonical slot): clear the bit and fall back to today's per-object `TYPED_LAYOUTS`/`LAYOUT_SLOT_MASKS` path (unchanged). The GC scanner consults the bit.

The TLS hashmap exists today only to track **downgrades**; the common (no-downgrade) case does not need it.

Optional complement: store a compact inline mask in `_reserved` for small objects/arrays (≤16 slots) so even small downgraded shapes avoid the hashmap.

## Correctness invariants (the crux — this is GC-internals, memory-corruption risk)

1. **GC scanner sees the truth.** Per-slot pointer/raw-f64 determination after the change must equal today's (`gc/trace.rs:758`, `gc/copying.rs:309,579` consult per-slot `layout_kind`). A wrong mask = trace a number as a pointer (crash) or miss a pointer (use-after-free).
2. **Representation.** A slot read as raw `double` must hold raw f64; the canonical bit must be cleared **before** the first non-number is observable in a canonical slot (publish-order discipline, like `descriptors_in_use`).
3. **Downgrade is monotonic + complete**; the fallback path stays byte-for-byte current behavior.
4. **GC moves transfer the bit + mask** (`gc/copying.rs:504` `layout_transfer`).

## Phasing (each independently shippable + verifiable)

1. Spike + microbench harness; prototype the bit read-only and confirm it tracks downgrade under GC stress.
2. **Arrays first** (lowest blast radius): wire the bit for arrays; skip `layout_note_slot` + write barrier for scalar-over-scalar in-bounds writes; scanner honors the bit.
3. **Class fields**: emit the inline guard in codegen (`expr/property_get.rs:1551`, `property_set.rs`) using the header bit + static class mask, by-name fallback for the cleared-bit case.
4. **Object property** path if the mechanism generalizes.

## Verification

- Full local parity (`./run_parity_tests.sh`) — zero NEW regressions vs base (compare combined stdout+stderr per file). `cargo test --release --workspace`.
- **GC correctness**: run under `PERRY_GC_VERIFY_EVACUATION=1`, `PERRY_GC_FORCE_EVACUATE=1`, `PERRY_GC_DIAG=1`, and `PERRY_GEN_GC=0` (full mark-sweep) — these panic on the corruption modes a wrong mask causes.
- Targeted tests: (a) write a non-number into a `number`-typed class field via an `any` alias, then read it back; (b) a numeric array that receives an object slot then is GC-evacuated mid-loop; (c) holey/sparse downgraded arrays.
- Per-benchmark perf regression gate.

**Risk: HIGH (memory-corruption class). Effort: L (GC-internals).** Maintainer-driven; not for autonomous execution.

## Prior groundwork on this line

- #5084 — typed-feedback `register_site` opt-in (object_property 3.6×).
- #5092 — inline small `this.field` methods (method_calls 1.6×; prerequisite — the guard must be inlined into the loop for the phase-3 hoist to pay off).

## Files

`gc/types.rs` (GcHeader), `gc/layout.rs` (`TYPED_LAYOUTS`/`LAYOUT_SLOT_MASKS`, `TypedLayoutDescriptor`, `layout_note_slot`, `layout_typed_raw_f64_slot_for_user`), `gc/trace.rs` + `gc/copying.rs` (scanner + `layout_transfer`), `array/indexing.rs` + `array/header.rs` (`note_array_slot`), `typed_feedback/guards.rs` (`class_field_fast_contract`), codegen `expr/property_get.rs:1551`, `expr/property_set.rs`, `typed_shape.rs`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(GC): make per-object layout O(1)-loadable — kill per-operation thread-local layout tracking (umbrella: method_calls/array-downgrade/object-property) #5094

Summary

Evidence (measured)

Proposed fix — "layout is canonical" header bit + static class mask

Correctness invariants (the crux — this is GC-internals, memory-corruption risk)

Phasing (each independently shippable + verifiable)

Verification

Prior groundwork on this line

Files

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Benchmark	Gap vs Node	Hot path	Per-op cost
`method_calls` (#5093)	~290× (3300 ms vs 11 ms)	`this.field` get/set in class methods	`js_typed_feedback_class_field_{get,set}_guard` → `class_field_fast_contract` → `layout_typed_raw_f64_slot_for_user` (TLS `TYPED_LAYOUTS`)
`bench_numeric_array_downgrade`	~781×; ~21× over the same-shape numeric array	`arr[i]=…` on heterogeneous/`any[]` arrays	`js_array_set_f64_extend` → `note_array_slot` → `layout_note_slot` (TLS), per write
`bench_object_property`	~17× after #5084	dynamic property writes	same field-access guard family

Uh oh!

perf(GC): make per-object layout O(1)-loadable — kill per-operation thread-local layout tracking (umbrella: method_calls/array-downgrade/object-property) #5094

Description

Summary

Evidence (measured)

Proposed fix — "layout is canonical" header bit + static class mask

Correctness invariants (the crux — this is GC-internals, memory-corruption risk)

Phasing (each independently shippable + verifiable)

Verification

Prior groundwork on this line

Files

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions