Skip to content

perf(method dispatch): method_calls ~290× Node — remaining cost is per-field-access shape-guard calls (plan + standby) #5093

@TheHypnoo

Description

@TheHypnoo

Summary

benchmarks/suite/09_method_calls.ts (10M calls to a trivial monomorphic counter.increment() where increment() is this.value = this.value + 1) runs in ~3300 ms vs Node's ~11 ms — still ~290× slower after the two improvements already landed/open. The remaining cost is per-field-access shape-guard calls, and closing it cleanly requires a GC typed-shape-layout change (high risk). This issue records the full analysis so the work can resume from a known state. Putting it in standby for now.

What already landed

Root cause of the remaining ~290×

Per counter.increment() iteration, after #5092 the body is inlined but each this.value read/write still goes through a typed-feedback class-field shape guard — a non-inlined cross-crate call:

  • read → js_typed_feedback_class_field_get_guard(...) (crates/perry-runtime/src/typed_feedback/guards.rs)
  • write → js_typed_feedback_class_field_set_guard(...)

When typed feedback is disabled (the default), each guard reduces to class_field_fast_contract (guards.rs:284): it validates class_id + keys_array + field_count, and for a number-typed (raw-f64) field additionally calls layout_typed_raw_f64_slot_for_user (crates/perry-runtime/src/gc/layout.rs:743), a thread-local PtrHashMap (TYPED_LAYOUTS) lookup. So per iteration: 2 non-inlined guard calls (+ the slot load/store the fast path already emits). ×10M ≈ the 290× gap.

Codegen emits this at crates/perry-codegen/src/expr/property_get.rs:1551 (class-field GET, known slot index) and the property_set.rs counterpart — both wrap a direct getelementptr+load/store (object header is 24 bytes; ObjectHeader #[repr(C)] at crates/perry-runtime/src/object/mod.rs:2293) behind the guard call.

Why the cheap/safe shortcuts do NOT work (measured)

  • Removing register_site (perf(codegen): make typed-feedback site registration opt-in (3.6x on dynamic property access) #5084): ~2% on method_calls — register_site was not the cost here.
  • MRU-caching the raw-f64 layout lookup (1-entry thread-local cache in layout_typed_raw_f64_slot_for_user): measured ~3300 ms → ~7050 ms (2× SLOWER). The PtrHashMap is already fast; adding a second thread-local access costs more TLS overhead than it saves. The dominant cost is the guard CALL itself, not what's inside it.

Conclusion: there is no intermediate win. Either the guard call stays, or the guard is inlined.

The plan (A2) — inline the class-field shape guard

Replace the js_typed_feedback_class_field_{get,set}_guard call with inline LLVM at the emission sites (property_get.rs:1551, property_set.rs):

  1. Inline the cheap part of the contract: load class_id (i32 @ obj+4) and keys_array (ptr @ obj+16), compare to the expected compile-time constants; plus a plain (hoistable) load of the process-global descriptors flag. Keep the by-name fallback for the guard-fail edge.
  2. Once both the method (already, perf(transform): inline small this-using methods on exact receivers (1.6x on method_calls) #5092) and the guard are inlined into the caller loop, all guard operands are loop-invariant → LLVM LICM can hoist the shape check out of the 10M-iteration loop, collapsing the body to a tight load/fadd/store.

Correctness trap — require_raw_f64

Counter.value: number is a raw-f64 candidate, so the guard passes require_raw_f64 = 1 and the contract additionally calls layout_typed_raw_f64_slot_for_user, which is a thread-local PtrHashMap lookup, not an O(1) header field — so it is NOT cheaply inline-able. A correct inline guard for raw-f64 fields requires the per-object raw-f64 slot mask to live somewhere O(1)-loadable (object header or GC header). Getting this wrong = reading a NaN-boxed value as a raw doublesilent memory/value corruption. This is the crux and the reason this is high-risk:

  • The per-object raw-f64 layout can downgrade (a non-number written to a number-typed field via an any alias makes the slot non-raw); the guard's layout check is what catches that. An inline guard must preserve this, or skip the raw-f64 fast path for fields that can downgrade.
  • descriptors_in_use() (accessor descriptors) must also gate the fast path.

Suggested approach

  • Move the per-object raw-f64 slot mask into an O(1)-loadable location (GC header / object header), updated wherever TYPED_LAYOUTS is mutated today (~20 sites in gc/layout.rs), so the inline guard can bit-test it cheaply.
  • Validate against the full local parity suite + cargo test workspace; add a targeted test that stores a non-number into a number-typed field via an any alias and reads it back (the downgrade case).
  • Target: method_calls ≤ ~50 ms (≤ ~3× Node).

Risk: HIGH (GC typed-shape layout change; memory-corruption class). Effort: L. Recommend a maintainer-driven change, not autonomous.

Files

  • Benchmark: benchmarks/suite/09_method_calls.ts
  • Guard emission: crates/perry-codegen/src/expr/property_get.rs:1551, crates/perry-codegen/src/expr/property_set.rs
  • Method dispatch: crates/perry-codegen/src/lower_call/property_get.rs:1314
  • Runtime guard + contract: crates/perry-runtime/src/typed_feedback/guards.rs:284,318
  • raw-f64 layout (the thread-local hashmap): crates/perry-runtime/src/gc/layout.rs:743
  • ObjectHeader layout: crates/perry-runtime/src/object/mod.rs:2293

This continues the method_calls line of work from #5084 and #5092.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions