Skip to content

WIP: Interpreter performance optimizations (~16-24% speedup)#362

Merged
fglock merged 6 commits into
masterfrom
perf/interpreter-optimization
Mar 24, 2026
Merged

WIP: Interpreter performance optimizations (~16-24% speedup)#362
fglock merged 6 commits into
masterfrom
perf/interpreter-optimization

Conversation

@fglock
Copy link
Copy Markdown
Owner

@fglock fglock commented Mar 24, 2026

Summary

Cherry-picked interpreter optimization work from PR #353 to a clean branch.

Optimizations included:

  1. Phase 1: ThreadLocal Caching - Cache currentPackageScalar reference to avoid ThreadLocal lookups in hot loop

  2. Phase 2: Lazy CallerStack (~19% speedup) - Defer line number computation until caller() is actually called via pushLazy()

  3. Phase 3: Inline Apply Path (~2% speedup) - For InterpretedCode, bypass RuntimeCode.apply() and call BytecodeInterpreter.execute() directly

  4. Phase 4: Register Array Pooling (~4% speedup) - Cache register arrays per-code-object via getRegisters()

Benchmark results (from original PR):

Phase Time (s) Improvement
Baseline 127 -
Phase 2 (Lazy CallerStack) 103 19%
Phase 3 (Inline Apply) 101 2%
Phase 4 (Register Pooling) 97 4%
Total 97 ~24%

Files changed:

  • src/main/java/org/perlonjava/backend/bytecode/BytecodeInterpreter.java
  • src/main/java/org/perlonjava/backend/bytecode/InterpretedCode.java
  • src/main/java/org/perlonjava/runtime/runtimetypes/CallerStack.java
  • dev/design/INTERPRETER_OPTIMIZATION.md (design doc)

Status

  • All unit tests pass
  • Need more extensive testing with real-world workloads
  • Verify no regressions in caller() behavior

Generated with Devin

fglock and others added 6 commits March 24, 2026 10:18
Profile analysis of benchmark_closure.pl in interpreter mode:
- Identified ThreadLocal lookup overhead in CALL opcode
- CallerStack push/pop on every call even when caller() unused
- Deep call chain indirection for subroutine dispatch
- TreeMap lookup for line numbers

Optimization plan with 4 phases documented.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…erence

- Cache InterpreterState.currentPackage.get() at start of execute()
- Reuse cached RuntimeScalar for SET_PACKAGE opcode
- Avoid repeated ThreadLocal lookups in CALL_SUB opcodes

No measurable speedup on benchmark_closure.pl, but cleaner code.
Profile shows ~10% of time in getCallSiteInfo + getSourceLocationAccurate
for caller() support - Phase 2 target.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Defer caller() info computation until actually needed:
- Add CallerStack.pushLazy() with lambda-based resolution
- CALL_SUB/CALL_METHOD now push lazy entries
- Line number computation only happens when caller() is called
- pop() skips resolution for unneeded entries

Benchmark improvement: 127s -> 103s = ~19% speedup on benchmark_closure.pl

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- Add fast path in CALL_SUB for InterpretedCode: call execute() directly
- Bypass RuntimeCode.apply() indirection chain for interpreter-to-interpreter calls
- Pass null for subroutineName to enable InterpreterFrame caching
- Apply same optimization to TAILCALL handling

Small improvement (~2%) combined with previous optimizations.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- InterpretedCode.getRegisters() caches register arrays per-code-object
- Uses ThreadLocal for thread safety with recursion detection
- Recursive calls fallback to fresh allocation (no contention)
- BytecodeInterpreter.execute() releases registers in finally block

Benchmark: 97s from 101s baseline (4% improvement from allocation reduction)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Total improvement: 127s → 97s (~24% speedup)
- Phase 3: Inline apply path (2% speedup)
- Phase 4: Register pooling (4% speedup)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@fglock fglock marked this pull request as ready for review March 24, 2026 11:10
@fglock fglock merged commit 9d68652 into master Mar 24, 2026
2 checks passed
@fglock fglock deleted the perf/interpreter-optimization branch March 24, 2026 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant