Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
148 changes: 148 additions & 0 deletions dev/design/INTERPRETER_OPTIMIZATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# Interpreter Performance Optimization

## Profile Analysis (2026-03-23)

**Benchmark:** `./jperl --interpreter dev/bench/benchmark_closure.pl`
- Interpreter mode: ~127 seconds
- JVM mode: ~4 seconds
- Ratio: ~32x (expected for bytecode interpreter vs JIT-compiled code)

## Top Hotspots by Sample Count

| Samples | Location | Description |
|---------|----------|-------------|
| 90 | `BytecodeInterpreter.execute` | Main interpreter loop |
| 54 | `RuntimeCode.apply` | Subroutine dispatch |
| 39 | `InterpretedCode.apply` | Delegation to interpreter |
| 7 | `getCallSiteInfo` | TreeMap lookup for caller() |
| 5 | `getSourceLocationAccurate` | Line number computation |

## Detailed Hotspot Analysis

### CALL Opcode Handling (BytecodeInterpreter.java lines 816-838)

```
Line 816 (6 samples): ThreadLocal lookup - InterpreterState.currentPackage.get()
Line 834 (7 samples): getCallSiteInfo - TreeMap.floorEntry()
Line 835 (4 samples): CallerStack.push
Line 838 (10 samples): RuntimeCode.apply - actual call
```

### Call Chain Overhead

The subroutine call dispatch has deep indirection:

```
CALL opcode (BytecodeInterpreter)
→ RuntimeCode.apply (54 samples)
→ InterpretedCode.apply (39 samples)
→ BytecodeInterpreter.execute (90 samples)
```

Each call goes through multiple layers before reaching the actual interpreter execution.

## Optimization Plan

### Phase 1: ThreadLocal Caching (High Impact, Low Risk)

**Problem:** `InterpreterState.currentPackage.get()` is called on every CALL opcode.

**Solution:** Cache the package name at the start of execute() and pass it through or use a local variable.

**Files:** `BytecodeInterpreter.java`

### Phase 2: Lazy CallerStack (High Impact, Medium Risk)

**Problem:** `CallerStack.push/pop` and `getCallSiteInfo` happen on EVERY subroutine call, even when `caller()` is never invoked.

**Solution:** Defer CallerStack operations until caller() is actually called:
1. Store call site info in a lightweight structure
2. Only populate CallerStack on-demand when caller() executes
3. Use a "dirty" flag to track if stack needs updating

**Files:** `BytecodeInterpreter.java`, `CallerStack.java`

### Phase 3: Inline Apply Path (Medium Impact, Medium Risk)

**Problem:** Call dispatch goes through multiple indirection layers.

**Solution:** For InterpretedCode, bypass RuntimeCode.apply and call BytecodeInterpreter.execute directly from the CALL opcode handler.

**Files:** `BytecodeInterpreter.java`

### Phase 4: Cache pcToTokenIndex Lookup (Low Impact, Low Risk)

**Problem:** `TreeMap.floorEntry()` is O(log n) for line number lookups.

**Solution:** Cache last lookup result since sequential execution often hits nearby PCs.

**Files:** `BytecodeInterpreter.java`

## Implementation Status

### Completed
- [x] Profile analysis (2026-03-23)
- [x] Phase 1: ThreadLocal Caching (2026-03-23) - Cache RuntimeScalar reference, no measurable speedup but cleaner code
- [x] Phase 2: Lazy CallerStack (2026-03-23) - **~19% speedup** (127s → 103s)
- [x] Phase 3: Inline Apply Path (2026-03-23) - **~2% speedup** (103s → 101s)
- [x] Phase 4: Register Array Pooling (2026-03-23) - **~4% speedup** (101s → 97s)

### Pending
- [ ] Phase 5: Cache pcToTokenIndex Lookup (moved from Phase 4)

## Profile Results After Phase 1

Second profile showed `getCallSiteInfo` (16 samples) + `getSourceLocationAccurate` (15 samples) = ~10% overhead.
This is spent computing call site info for `caller()` support on every subroutine call.

## Phase 2 Results

Implemented lazy CallerStack:
- `CallerStack.pushLazy()` stores a lambda that computes CallerInfo on demand
- Line number computation deferred until `caller()` is actually called
- `pop()` doesn't resolve lazy entries (no computation needed)

**Benchmark improvement:** 127s → 103s = **~19% speedup**

## Phase 3 Results

Inline InterpretedCode apply path in CALL_SUB:
- Check if code is `InterpretedCode` and call `BytecodeInterpreter.execute()` directly
- Bypasses `RuntimeCode.apply()` → `InterpretedCode.apply()` chain

**Benchmark improvement:** 103s → 101s = **~2% speedup**

## Phase 4 Results

Register array pooling in InterpretedCode:
- `InterpretedCode.getRegisters()` caches register arrays per-code-object
- Uses ThreadLocal for thread safety with recursion detection
- Recursive calls fallback to fresh allocation (no contention)
- `BytecodeInterpreter.execute()` releases registers in finally block

**Benchmark improvement:** 101s → 97s = **~4% speedup**

## Total Performance Improvement

| Phase | Time (s) | Improvement |
|-------|----------|-------------|
| Baseline | 127 | - |
| Phase 2 (Lazy CallerStack) | 103 | 19% |
| Phase 3 (Inline Apply) | 101 | 2% |
| Phase 4 (Register Pooling) | 97 | 4% |
| **Total** | **97** | **~24%** |

## Verification

After each optimization:
1. Run `make` to ensure no regressions
2. Re-run benchmark to measure improvement
3. Re-profile to confirm hotspot reduction

## Related Files

- `src/main/java/org/perlonjava/backend/bytecode/BytecodeInterpreter.java`
- `src/main/java/org/perlonjava/backend/bytecode/InterpretedCode.java`
- `src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeCode.java`
- `src/main/java/org/perlonjava/runtime/runtimetypes/CallerStack.java`
- `src/main/java/org/perlonjava/backend/bytecode/InterpreterState.java`
Original file line number Diff line number Diff line change
Expand Up @@ -74,8 +74,8 @@ public static RuntimeList execute(InterpretedCode code, RuntimeArray args, int c
// Get PC holder for direct updates (avoids ThreadLocal lookups in hot loop)
int[] pcHolder = InterpreterState.push(code, framePackageName, frameSubName);

// Pure register file (NOT stack-based - matches compiler for control flow correctness)
RuntimeBase[] registers = new RuntimeBase[code.maxRegisters];
// Get register array from cache (avoids allocation for non-recursive calls)
RuntimeBase[] registers = code.getRegisters();

// Initialize special registers (same as compiler)
registers[0] = code; // $this (for closures - register 0)
Expand Down Expand Up @@ -111,8 +111,10 @@ public static RuntimeList execute(InterpretedCode code, RuntimeArray args, int c
// Record DVM level so the finally block can clean up everything pushed
// by this subroutine (local variables AND regex state snapshot).
int savedLocalLevel = usesLocalization ? DynamicVariableManager.getLocalLevel() : 0;
String savedPackage = InterpreterState.currentPackage.get().toString();
InterpreterState.currentPackage.get().set(framePackageName);
// Cache the currentPackage RuntimeScalar to avoid ThreadLocal lookups in hot loop
RuntimeScalar currentPackageScalar = InterpreterState.currentPackage.get();
String savedPackage = currentPackageScalar.toString();
currentPackageScalar.set(framePackageName);
if (usesLocalization) {
RegexState.save();
}
Expand Down Expand Up @@ -813,9 +815,9 @@ public static RuntimeList execute(InterpretedCode code, RuntimeArray args, int c
// This matches the JVM backend's call to codeDerefNonStrict()
// Only call for STRING/BYTE_STRING types (symbolic references)
// For CODE, REFERENCE, etc. let RuntimeCode.apply() handle errors
String currentPkg = InterpreterState.currentPackage.get().toString();
// Use cached RuntimeScalar to avoid ThreadLocal lookup
if (codeRef.type == RuntimeScalarType.STRING || codeRef.type == RuntimeScalarType.BYTE_STRING) {
codeRef = codeRef.codeDerefNonStrict(currentPkg);
codeRef = codeRef.codeDerefNonStrict(currentPackageScalar.toString());
}

RuntimeBase argsBase = registers[argsReg];
Expand All @@ -830,12 +832,24 @@ public static RuntimeList execute(InterpretedCode code, RuntimeArray args, int c
callArgs = new RuntimeArray((RuntimeScalar) argsBase);
}

// Push call site info to CallerStack for caller() to see the correct location
CallerStack.CallerInfo callSiteInfo = getCallSiteInfo(code, callSitePc, currentPkg);
CallerStack.push(callSiteInfo.packageName(), callSiteInfo.filename(), callSiteInfo.line());
// Push lazy call site info to CallerStack for caller() to see the correct location
// The actual line number computation is deferred until caller() is called
// Capture variables needed for lazy resolution
final String lazyPkg = currentPackageScalar.toString();
final int lazyPc = callSitePc;
CallerStack.pushLazy(lazyPkg, () -> getCallSiteInfo(code, lazyPc, lazyPkg));
RuntimeList result;
try {
result = RuntimeCode.apply(codeRef, "", callArgs, context);
// Fast path for InterpretedCode: call execute() directly,
// bypassing RuntimeCode.apply() indirection chain
if (codeRef.type == RuntimeScalarType.CODE && codeRef.value instanceof InterpretedCode interpCode) {
// Direct call to interpreter - skip RuntimeCode.apply overhead
// Pass null for subroutineName to enable frame caching
result = BytecodeInterpreter.execute(interpCode, callArgs, context, null);
} else {
// Slow path for JVM-compiled code, symbolic references, etc.
result = RuntimeCode.apply(codeRef, "", callArgs, context);
}

// Handle TAILCALL with trampoline loop (same as JVM backend)
while (result.isNonLocalGoto()) {
Expand All @@ -844,7 +858,12 @@ public static RuntimeList execute(InterpretedCode code, RuntimeArray args, int c
// Extract codeRef and args, call target
codeRef = flow.getTailCallCodeRef();
callArgs = flow.getTailCallArgs();
result = RuntimeCode.apply(codeRef, "tailcall", callArgs, context);
// Use fast path for InterpretedCode
if (codeRef.type == RuntimeScalarType.CODE && codeRef.value instanceof InterpretedCode interpCode) {
result = BytecodeInterpreter.execute(interpCode, callArgs, context, null);
} else {
result = RuntimeCode.apply(codeRef, "tailcall", callArgs, context);
}
// Loop to handle chained tail calls
} else {
// Not TAILCALL - check labeled blocks or propagate
Expand Down Expand Up @@ -914,10 +933,11 @@ public static RuntimeList execute(InterpretedCode code, RuntimeArray args, int c
callArgs = new RuntimeArray((RuntimeScalar) argsBase);
}

// Push call site info to CallerStack for caller() to see the correct location
String currentPkg = InterpreterState.currentPackage.get().toString();
CallerStack.CallerInfo callSiteInfo = getCallSiteInfo(code, callSitePc, currentPkg);
CallerStack.push(callSiteInfo.packageName(), callSiteInfo.filename(), callSiteInfo.line());
// Push lazy call site info to CallerStack for caller() to see the correct location
// Capture variables needed for lazy resolution
final String lazyPkg = currentPackageScalar.toString();
final int lazyPc = callSitePc;
CallerStack.pushLazy(lazyPkg, () -> getCallSiteInfo(code, lazyPc, lazyPkg));
RuntimeList result;
try {
result = RuntimeCode.call(invocant, method, currentSub, callArgs, context);
Expand Down Expand Up @@ -1005,9 +1025,9 @@ public static RuntimeList execute(InterpretedCode code, RuntimeArray args, int c
: codeRefBase.scalar();

// Dereference symbolic code references
// Use cached RuntimeScalar to avoid ThreadLocal lookup
if (codeRef.type == RuntimeScalarType.STRING || codeRef.type == RuntimeScalarType.BYTE_STRING) {
String currentPkg = InterpreterState.currentPackage.get().toString();
codeRef = codeRef.codeDerefNonStrict(currentPkg);
codeRef = codeRef.codeDerefNonStrict(currentPackageScalar.toString());
}

// Get args
Expand Down Expand Up @@ -1611,8 +1631,9 @@ public static RuntimeList execute(InterpretedCode code, RuntimeArray args, int c
case Opcodes.SET_PACKAGE -> {
// Non-scoped package declaration: package Foo;
// Update the runtime current-package tracker so caller() returns the right package.
// Uses cached RuntimeScalar reference to avoid ThreadLocal lookup
int nameIdx = bytecode[pc++];
InterpreterState.currentPackage.get().set(code.stringPool[nameIdx]);
currentPackageScalar.set(code.stringPool[nameIdx]);
}

case Opcodes.PUSH_PACKAGE -> {
Expand Down Expand Up @@ -1876,8 +1897,10 @@ public static RuntimeList execute(InterpretedCode code, RuntimeArray args, int c
if (usesLocalization) {
DynamicVariableManager.popToLocalLevel(savedLocalLevel);
}
InterpreterState.currentPackage.get().set(savedPackage);
currentPackageScalar.set(savedPackage);
InterpreterState.pop();
// Release cached registers for reuse
code.releaseRegisters();
}
}

Expand Down
31 changes: 31 additions & 0 deletions src/main/java/org/perlonjava/backend/bytecode/InterpretedCode.java
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,37 @@ public class InterpretedCode extends RuntimeCode implements PerlSubroutine {
// Created lazily on first use (after packageName/subName are set)
public volatile InterpreterState.InterpreterFrame cachedFrame;

// Cached register array for non-recursive calls (avoids allocation)
// Thread-safe via ThreadLocal for multi-threaded execution
private final ThreadLocal<RuntimeBase[]> cachedRegisters = new ThreadLocal<>();
// Flag to track if cached registers are currently in use (for recursion detection)
private final ThreadLocal<Boolean> registersInUse = ThreadLocal.withInitial(() -> false);

/**
* Get a register array for execution. Returns cached array if not in use (common case),
* otherwise allocates a new one (recursive call).
*/
public RuntimeBase[] getRegisters() {
if (registersInUse.get()) {
// Recursive call - need fresh array
return new RuntimeBase[maxRegisters];
}
RuntimeBase[] regs = cachedRegisters.get();
if (regs == null || regs.length != maxRegisters) {
regs = new RuntimeBase[maxRegisters];
cachedRegisters.set(regs);
}
registersInUse.set(true);
return regs;
}

/**
* Release the register array after execution completes.
*/
public void releaseRegisters() {
registersInUse.set(false);
}

// Lexical pragma state (for eval STRING to inherit)
public final int strictOptions; // Strict flags at compile time
public final int featureFlags; // Feature flags at compile time
Expand Down
2 changes: 1 addition & 1 deletion src/main/java/org/perlonjava/core/Configuration.java
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ public final class Configuration {
* Automatically populated by Gradle/Maven during build.
* DO NOT EDIT MANUALLY - this value is replaced at build time.
*/
public static final String gitCommitId = "e029ad8f2";
public static final String gitCommitId = "b75552677";

/**
* Git commit date of the build (ISO format: YYYY-MM-DD).
Expand Down
Loading
Loading