Skip to content

Comments

purego: add various performance optimizations#400

Draft
tmc wants to merge 5 commits intoebitengine:mainfrom
tmc:perf-internal
Draft

purego: add various performance optimizations#400
tmc wants to merge 5 commits intoebitengine:mainfrom
tmc:perf-internal

Conversation

@tmc
Copy link
Contributor

@tmc tmc commented Jan 28, 2026

What issue is this addressing?

#399

What type of issue is this addressing?

performance

What this PR does | solves

Summary

This reduces per-call overhead in RegisterFunc and callback dispatch through five targeted optimizations. All changes are internal — no public API changes, no behavioral changes. The heaviest wins come from eliminating repeated reflect.StructOf calls and heap allocations in the Darwin ARM64 stack-packing path.

  • Cache reflect.StructOf results and pool struct instances in bundleStackArgs to avoid recreating packed struct types on every call
  • Stack-allocate the []reflect.Value args slice in callbackWrap instead of heap-allocating via make, eliminating the single largest allocator identified in profiling (38.5% of total allocs)
  • Remove defer from the RegisterFunc hot path, replacing deferred runtime.KeepAlive and thePool.Put with explicit calls after the syscall returns
  • Pre-compute a per-function argKind enum slice at registration time so the per-call dispatch switches on a uint8 instead of calling reflect.Kind() repeatedly
  • Pre-compute Darwin ARM64 bundle info (struct cache key, field indices, sync.Pool) at registration time to eliminate per-call cache key construction and sync.Map lookups

Benchmark results

Apple M4 Max, darwin/arm64, benchstat over 6 runs at 100ms each:

                                               │  baseline  │             after              │
                                               │   sec/op   │   sec/op    vs base            │
RegisterFunc/CFunc/1args-16                       159.5n ±1%   154.6n ±2%   -3.10% (p=0.002)
RegisterFunc/CFunc/5args-16                       233.8n ±2%   222.5n ±2%   -4.85% (p=0.002)
RegisterFunc/CFunc/10args-16                      802.9n ±1%   451.6n ±2%  -43.76% (p=0.002)
RegisterFunc/CFunc/15args-16                     1845.0n ±3%   564.4n ±2%  -69.41% (p=0.002)
RegisterFunc/Callback/5args-16                    483.0n ±1%   455.0n ±1%   -5.81% (p=0.002)
RegisterFunc/Callback/10args-16                   714.9n ±1%   652.0n ±1%   -8.79% (p=0.002)
RegisterFunc/Callback/15args-16                   912.8n ±2%   829.6n ±1%   -9.12% (p=0.002)
SyscallN/Callback/5args-16                        298.8n ±1%   283.8n ±1%   -5.02% (p=0.002)
SyscallN/Callback/10args-16                       423.1n ±1%   386.8n ±1%   -8.57% (p=0.002)
SyscallN/Callback/15args-16                       539.0n ±2%   497.6n ±1%   -7.66% (p=0.002)
geomean                                           264.2n       233.0n      -11.83%

                                               │  baseline   │             after              │
                                               │    B/op     │   B/op     vs base             │
RegisterFunc/CFunc/10args-16                      896.0 ±0%    464.0 ±0%  -48.21% (p=0.002)
RegisterFunc/CFunc/15args-16                     3098.0 ±0%    648.0 ±0%  -79.08% (p=0.002)
RegisterFunc/Callback/5args-16                    336.0 ±0%    208.0 ±0%  -38.10% (p=0.002)
RegisterFunc/Callback/10args-16                   600.0 ±0%    360.0 ±0%  -40.00% (p=0.002)
RegisterFunc/Callback/15args-16                   928.0 ±0%    544.0 ±0%  -41.38% (p=0.002)

                                               │  baseline  │            after              │
                                               │ allocs/op  │ allocs/op  vs base            │
RegisterFunc/CFunc/10args-16                      23.00 ±0%   16.00 ±0%  -30.43% (p=0.002)
RegisterFunc/CFunc/15args-16                      44.00 ±0%   21.00 ±0%  -52.27% (p=0.002)
RegisterFunc/Callback/5args-16                    10.00 ±0%    9.00 ±0%  -10.00% (p=0.002)
RegisterFunc/Callback/10args-16                   15.00 ±0%   14.00 ±0%   -6.67% (p=0.002)
RegisterFunc/Callback/15args-16                   20.00 ±0%   19.00 ±0%   -5.00% (p=0.002)

Low-arg-count paths (1 and 5 args for CFunc) are modestly faster since they don't hit the struct-packing path. The big wins are on 10+ arg calls where Darwin ARM64 stack bundling dominates.

Test plan

  • go test ./... passes on darwin/arm64
  • Verify no regressions on amd64 (struct_amd64.go signature change is a no-op stub)
  • Review struct_arm64.go caching logic for correctness with concurrent callers (uses sync.Map + sync.Pool)
  • Confirm callback paths are unaffected (bundling is guarded by !isCallback)

tmc added 5 commits January 28, 2026 15:11
Cache the reflect.StructOf result and pool reflect.New instances for
Darwin ARM64 stack argument bundling. This avoids recreating the packed
struct type and allocating a new instance on every call when arguments
spill to the stack.

CFunc/10args: ~810 → ~560 ns/op (-31%), 23 → 21 allocs
CFunc/15args: ~1840 → ~810 ns/op (-56%), 44 → 30 allocs
Callbacks unchanged (bundling path is guarded by !isCallback).
Replace make([]reflect.Value, fnType.NumIn()) with a stack-allocated
fixed-size array, eliminating a heap allocation on every callback
invocation. The profiling report identified this as the single largest
allocator at 38.5% of total allocations (5.87 GB).
Replace defer-based runtime.KeepAlive and thePool.Put with explicit
calls after the syscall returns. This eliminates deferred closure
overhead on every RegisterFunc invocation.
Build a per-function argKind slice during RegisterFunc instead of
type-switching via reflect.Kind on every call. This replaces the
generic addValue dispatch with a direct switch on pre-computed enum
values, reducing per-call reflect overhead.
…uffer

Pre-compute struct cache key and bundle info at RegisterFunc registration
time instead of rebuilding them on every call. This eliminates per-call
buildStructCacheKey allocations (521 MB in profiling) and sync.Map lookups
for Darwin ARM64 stack argument packing.

Also stack-allocate the collectStackArgs buffer by accepting a caller-
provided []reflect.Value slice, eliminating the per-call heap allocation
for stack argument collection (1.09 GB in profiling).

String arguments that spill to stack are mapped to *byte type in the
pre-computed info since collectStackArgs converts them via CString before
bundling. Variadic and struct-containing signatures fall back to runtime
computation.

RegisterFunc/CFunc/10args: -19% latency, -19% memory
RegisterFunc/CFunc/15args: -30% latency, -45% memory, -9 allocs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant