purego: add various performance optimizations by tmc · Pull Request #400 · ebitengine/purego

tmc · 2026-01-28T23:13:41Z

What issue is this addressing?

What type of issue is this addressing?

performance

What this PR does | solves

Summary

This reduces per-call overhead in RegisterFunc and callback dispatch through five targeted optimizations. All changes are internal — no public API changes, no behavioral changes. The heaviest wins come from eliminating repeated reflect.StructOf calls and heap allocations in the Darwin ARM64 stack-packing path.

Cache reflect.StructOf results and pool struct instances in bundleStackArgs to avoid recreating packed struct types on every call
Stack-allocate the []reflect.Value args slice in callbackWrap instead of heap-allocating via make, eliminating the single largest allocator identified in profiling (38.5% of total allocs)
Remove defer from the RegisterFunc hot path, replacing deferred runtime.KeepAlive and thePool.Put with explicit calls after the syscall returns
Pre-compute a per-function argKind enum slice at registration time so the per-call dispatch switches on a uint8 instead of calling reflect.Kind() repeatedly
Pre-compute Darwin ARM64 bundle info (struct cache key, field indices, sync.Pool) at registration time to eliminate per-call cache key construction and sync.Map lookups

Benchmark results

Apple M4 Max, darwin/arm64, benchstat over 6 runs at 100ms each:

                                               │  baseline  │             after              │
                                               │   sec/op   │   sec/op    vs base            │
RegisterFunc/CFunc/1args-16                       159.5n ±1%   154.6n ±2%   -3.10% (p=0.002)
RegisterFunc/CFunc/5args-16                       233.8n ±2%   222.5n ±2%   -4.85% (p=0.002)
RegisterFunc/CFunc/10args-16                      802.9n ±1%   451.6n ±2%  -43.76% (p=0.002)
RegisterFunc/CFunc/15args-16                     1845.0n ±3%   564.4n ±2%  -69.41% (p=0.002)
RegisterFunc/Callback/5args-16                    483.0n ±1%   455.0n ±1%   -5.81% (p=0.002)
RegisterFunc/Callback/10args-16                   714.9n ±1%   652.0n ±1%   -8.79% (p=0.002)
RegisterFunc/Callback/15args-16                   912.8n ±2%   829.6n ±1%   -9.12% (p=0.002)
SyscallN/Callback/5args-16                        298.8n ±1%   283.8n ±1%   -5.02% (p=0.002)
SyscallN/Callback/10args-16                       423.1n ±1%   386.8n ±1%   -8.57% (p=0.002)
SyscallN/Callback/15args-16                       539.0n ±2%   497.6n ±1%   -7.66% (p=0.002)
geomean                                           264.2n       233.0n      -11.83%

                                               │  baseline   │             after              │
                                               │    B/op     │   B/op     vs base             │
RegisterFunc/CFunc/10args-16                      896.0 ±0%    464.0 ±0%  -48.21% (p=0.002)
RegisterFunc/CFunc/15args-16                     3098.0 ±0%    648.0 ±0%  -79.08% (p=0.002)
RegisterFunc/Callback/5args-16                    336.0 ±0%    208.0 ±0%  -38.10% (p=0.002)
RegisterFunc/Callback/10args-16                   600.0 ±0%    360.0 ±0%  -40.00% (p=0.002)
RegisterFunc/Callback/15args-16                   928.0 ±0%    544.0 ±0%  -41.38% (p=0.002)

                                               │  baseline  │            after              │
                                               │ allocs/op  │ allocs/op  vs base            │
RegisterFunc/CFunc/10args-16                      23.00 ±0%   16.00 ±0%  -30.43% (p=0.002)
RegisterFunc/CFunc/15args-16                      44.00 ±0%   21.00 ±0%  -52.27% (p=0.002)
RegisterFunc/Callback/5args-16                    10.00 ±0%    9.00 ±0%  -10.00% (p=0.002)
RegisterFunc/Callback/10args-16                   15.00 ±0%   14.00 ±0%   -6.67% (p=0.002)
RegisterFunc/Callback/15args-16                   20.00 ±0%   19.00 ±0%   -5.00% (p=0.002)

Low-arg-count paths (1 and 5 args for CFunc) are modestly faster since they don't hit the struct-packing path. The big wins are on 10+ arg calls where Darwin ARM64 stack bundling dominates.

Test plan

go test ./... passes on darwin/arm64
Verify no regressions on amd64 (struct_amd64.go signature change is a no-op stub)
Review struct_arm64.go caching logic for correctness with concurrent callers (uses sync.Map + sync.Pool)
Confirm callback paths are unaffected (bundling is guarded by !isCallback)

Cache the reflect.StructOf result and pool reflect.New instances for Darwin ARM64 stack argument bundling. This avoids recreating the packed struct type and allocating a new instance on every call when arguments spill to the stack. CFunc/10args: ~810 → ~560 ns/op (-31%), 23 → 21 allocs CFunc/15args: ~1840 → ~810 ns/op (-56%), 44 → 30 allocs Callbacks unchanged (bundling path is guarded by !isCallback).

Replace make([]reflect.Value, fnType.NumIn()) with a stack-allocated fixed-size array, eliminating a heap allocation on every callback invocation. The profiling report identified this as the single largest allocator at 38.5% of total allocations (5.87 GB).

Replace defer-based runtime.KeepAlive and thePool.Put with explicit calls after the syscall returns. This eliminates deferred closure overhead on every RegisterFunc invocation.

Build a per-function argKind slice during RegisterFunc instead of type-switching via reflect.Kind on every call. This replaces the generic addValue dispatch with a direct switch on pre-computed enum values, reducing per-call reflect overhead.

…uffer Pre-compute struct cache key and bundle info at RegisterFunc registration time instead of rebuilding them on every call. This eliminates per-call buildStructCacheKey allocations (521 MB in profiling) and sync.Map lookups for Darwin ARM64 stack argument packing. Also stack-allocate the collectStackArgs buffer by accepting a caller- provided []reflect.Value slice, eliminating the per-call heap allocation for stack argument collection (1.09 GB in profiling). String arguments that spill to stack are mapped to *byte type in the pre-computed info since collectStackArgs converts them via CString before bundling. Variadic and struct-containing signatures fall back to runtime computation. RegisterFunc/CFunc/10args: -19% latency, -19% memory RegisterFunc/CFunc/15args: -30% latency, -45% memory, -9 allocs

tmc added 5 commits January 28, 2026 15:11

func: remove defer from RegisterFunc hot path

f2d7ee6

Replace defer-based runtime.KeepAlive and thePool.Put with explicit calls after the syscall returns. This eliminates deferred closure overhead on every RegisterFunc invocation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

purego: add various performance optimizations#400

purego: add various performance optimizations#400
tmc wants to merge 5 commits intoebitengine:mainfrom
tmc:perf-internal

tmc commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

tmc commented Jan 28, 2026

What issue is this addressing?

What type of issue is this addressing?

What this PR does | solves

Summary

Benchmark results

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant