WIP: Interpreter performance optimizations (~16-24% speedup) by fglock · Pull Request #362 · fglock/PerlOnJava

fglock · 2026-03-24T09:20:58Z

Summary

Cherry-picked interpreter optimization work from PR #353 to a clean branch.

Optimizations included:

Phase 1: ThreadLocal Caching - Cache currentPackageScalar reference to avoid ThreadLocal lookups in hot loop
Phase 2: Lazy CallerStack (~19% speedup) - Defer line number computation until caller() is actually called via pushLazy()
Phase 3: Inline Apply Path (~2% speedup) - For InterpretedCode, bypass RuntimeCode.apply() and call BytecodeInterpreter.execute() directly
Phase 4: Register Array Pooling (~4% speedup) - Cache register arrays per-code-object via getRegisters()

Benchmark results (from original PR):

Phase	Time (s)	Improvement
Baseline	127	-
Phase 2 (Lazy CallerStack)	103	19%
Phase 3 (Inline Apply)	101	2%
Phase 4 (Register Pooling)	97	4%
Total	97	~24%

Files changed:

src/main/java/org/perlonjava/backend/bytecode/BytecodeInterpreter.java
src/main/java/org/perlonjava/backend/bytecode/InterpretedCode.java
src/main/java/org/perlonjava/runtime/runtimetypes/CallerStack.java
dev/design/INTERPRETER_OPTIMIZATION.md (design doc)

Status

All unit tests pass
Need more extensive testing with real-world workloads
Verify no regressions in caller() behavior

Generated with Devin

Profile analysis of benchmark_closure.pl in interpreter mode: - Identified ThreadLocal lookup overhead in CALL opcode - CallerStack push/pop on every call even when caller() unused - Deep call chain indirection for subroutine dispatch - TreeMap lookup for line numbers Optimization plan with 4 phases documented. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

…erence - Cache InterpreterState.currentPackage.get() at start of execute() - Reuse cached RuntimeScalar for SET_PACKAGE opcode - Avoid repeated ThreadLocal lookups in CALL_SUB opcodes No measurable speedup on benchmark_closure.pl, but cleaner code. Profile shows ~10% of time in getCallSiteInfo + getSourceLocationAccurate for caller() support - Phase 2 target. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Defer caller() info computation until actually needed: - Add CallerStack.pushLazy() with lambda-based resolution - CALL_SUB/CALL_METHOD now push lazy entries - Line number computation only happens when caller() is called - pop() skips resolution for unneeded entries Benchmark improvement: 127s -> 103s = ~19% speedup on benchmark_closure.pl Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

- Add fast path in CALL_SUB for InterpretedCode: call execute() directly - Bypass RuntimeCode.apply() indirection chain for interpreter-to-interpreter calls - Pass null for subroutineName to enable InterpreterFrame caching - Apply same optimization to TAILCALL handling Small improvement (~2%) combined with previous optimizations. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

- InterpretedCode.getRegisters() caches register arrays per-code-object - Uses ThreadLocal for thread safety with recursion detection - Recursive calls fallback to fresh allocation (no contention) - BytecodeInterpreter.execute() releases registers in finally block Benchmark: 97s from 101s baseline (4% improvement from allocation reduction) Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Total improvement: 127s → 97s (~24% speedup) - Phase 3: Inline apply path (2% speedup) - Phase 4: Register pooling (4% speedup) Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

fglock and others added 6 commits March 24, 2026 10:18

fglock marked this pull request as ready for review March 24, 2026 11:10

fglock merged commit 9d68652 into master Mar 24, 2026
2 checks passed

fglock deleted the perf/interpreter-optimization branch March 24, 2026 11:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Interpreter performance optimizations (~16-24% speedup)#362

WIP: Interpreter performance optimizations (~16-24% speedup)#362
fglock merged 6 commits into
masterfrom
perf/interpreter-optimization

fglock commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fglock commented Mar 24, 2026

Summary

Optimizations included:

Benchmark results (from original PR):

Files changed:

Status

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant