purego: add various performance optimizations#400
Draft
tmc wants to merge 5 commits intoebitengine:mainfrom
Draft
purego: add various performance optimizations#400tmc wants to merge 5 commits intoebitengine:mainfrom
tmc wants to merge 5 commits intoebitengine:mainfrom
Conversation
Cache the reflect.StructOf result and pool reflect.New instances for Darwin ARM64 stack argument bundling. This avoids recreating the packed struct type and allocating a new instance on every call when arguments spill to the stack. CFunc/10args: ~810 → ~560 ns/op (-31%), 23 → 21 allocs CFunc/15args: ~1840 → ~810 ns/op (-56%), 44 → 30 allocs Callbacks unchanged (bundling path is guarded by !isCallback).
Replace make([]reflect.Value, fnType.NumIn()) with a stack-allocated fixed-size array, eliminating a heap allocation on every callback invocation. The profiling report identified this as the single largest allocator at 38.5% of total allocations (5.87 GB).
Replace defer-based runtime.KeepAlive and thePool.Put with explicit calls after the syscall returns. This eliminates deferred closure overhead on every RegisterFunc invocation.
Build a per-function argKind slice during RegisterFunc instead of type-switching via reflect.Kind on every call. This replaces the generic addValue dispatch with a direct switch on pre-computed enum values, reducing per-call reflect overhead.
…uffer Pre-compute struct cache key and bundle info at RegisterFunc registration time instead of rebuilding them on every call. This eliminates per-call buildStructCacheKey allocations (521 MB in profiling) and sync.Map lookups for Darwin ARM64 stack argument packing. Also stack-allocate the collectStackArgs buffer by accepting a caller- provided []reflect.Value slice, eliminating the per-call heap allocation for stack argument collection (1.09 GB in profiling). String arguments that spill to stack are mapped to *byte type in the pre-computed info since collectStackArgs converts them via CString before bundling. Variadic and struct-containing signatures fall back to runtime computation. RegisterFunc/CFunc/10args: -19% latency, -19% memory RegisterFunc/CFunc/15args: -30% latency, -45% memory, -9 allocs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What issue is this addressing?
#399
What type of issue is this addressing?
performance
What this PR does | solves
Summary
This reduces per-call overhead in
RegisterFuncand callback dispatch through five targeted optimizations. All changes are internal — no public API changes, no behavioral changes. The heaviest wins come from eliminating repeatedreflect.StructOfcalls and heap allocations in the Darwin ARM64 stack-packing path.reflect.StructOfresults and pool struct instances inbundleStackArgsto avoid recreating packed struct types on every call[]reflect.Valueargs slice incallbackWrapinstead of heap-allocating viamake, eliminating the single largest allocator identified in profiling (38.5% of total allocs)deferfrom theRegisterFunchot path, replacing deferredruntime.KeepAliveandthePool.Putwith explicit calls after the syscall returnsargKindenum slice at registration time so the per-call dispatch switches on auint8instead of callingreflect.Kind()repeatedlysync.MaplookupsBenchmark results
Apple M4 Max, darwin/arm64,
benchstatover 6 runs at 100ms each:Low-arg-count paths (1 and 5 args for CFunc) are modestly faster since they don't hit the struct-packing path. The big wins are on 10+ arg calls where Darwin ARM64 stack bundling dominates.
Test plan
go test ./...passes on darwin/arm64struct_arm64.gocaching logic for correctness with concurrent callers (usessync.Map+sync.Pool)!isCallback)