fglock · fglock · Mar 23, 2026 · Mar 23, 2026 · Mar 23, 2026 · Mar 23, 2026
diff --git a/.cognition/skills/debug-exiftool/SKILL.md b/.cognition/skills/debug-exiftool/SKILL.md
@@ -23,7 +23,7 @@ You are debugging failures in the Image::ExifTool test suite running under PerlO
 
 **IMPORTANT: Never push directly to master. Always use feature branches and PRs.**
 
-**IMPORTANT: Always commit or stash changes BEFORE switching branches.** If `git stash pop` has conflicts, uncommitted changes may be lost.
+**IMPORTANT: Always commit or save changes BEFORE switching branches.** Use `git diff > backup.patch` to save uncommitted work, or commit to a WIP branch.
 
 ```bash
 git checkout -b fix/exiftool-issue-name
@@ -42,7 +42,7 @@ gh pr create --title "Fix: description" --body "Details"
 - **ExifTool reference output**: `Image-ExifTool-13.44/t/<TestName>_N.out` (expected tag output per sub-test)
 - **PerlOnJava unit tests**: `src/test/resources/unit/*.t` (make suite, 154 tests)
 - **Perl5 core tests**: `perl5_t/t/` (Perl 5 compatibility suite, run via `make test-gradle`)
-- **Fat JAR**: `target/perlonjava-3.0.0.jar`
+- **Fat JAR**: `build/libs/perlonjava-*.jar` (version varies)
 - **Launcher script**: `./jperl` (resolves JAR path, sets `$^X`)
 
 ## Building PerlOnJava
@@ -64,16 +64,13 @@ make dev   # Quick build - compiles only, NO tests
 ### Single test
 ```bash
 cd Image-ExifTool-13.44
-java -jar ../target/perlonjava-3.0.0.jar -Ilib t/Writer.t
-# Or using the launcher:
-cd Image-ExifTool-13.44
 ../jperl -Ilib t/Writer.t
 ```
 
 ### Single test with timeout (prevents infinite loops)
 ```bash
 cd Image-ExifTool-13.44
-timeout 120 java -jar ../target/perlonjava-3.0.0.jar -Ilib t/XMP.t
+timeout 120 ../jperl -Ilib t/XMP.t
 ```
 
 ### All ExifTool tests in parallel with summary
@@ -82,7 +79,7 @@ cd Image-ExifTool-13.44
 mkdir -p /tmp/exiftool_results
 for t in t/*.t; do
     name=$(basename "$t" .t)
-    ( output=$(timeout 120 java -jar ../target/perlonjava-3.0.0.jar -Ilib "$t" 2>&1)
+    ( output=$(timeout 120 ../jperl -Ilib "$t" 2>&1)
       ec=$?
       if [ $ec -eq 124 ]; then echo "$name TIMEOUT"
       else
@@ -133,7 +130,7 @@ cd Image-ExifTool-13.44
 perl -Ilib t/Writer.t 2>&1 | grep -E '^(not )?ok ' > /tmp/perl_results.txt
 
 # Run with PerlOnJava
-java -jar ../target/perlonjava-3.0.0.jar -Ilib t/Writer.t 2>&1 | grep -E '^(not )?ok ' > /tmp/jperl_results.txt
+../jperl -Ilib t/Writer.t 2>&1 | grep -E '^(not )?ok ' > /tmp/jperl_results.txt
 
 # Diff
 diff /tmp/perl_results.txt /tmp/jperl_results.txt
@@ -145,7 +142,7 @@ For individual Perl constructs:
 perl -e 'my @a = (1,2,3); $_ *= 2 foreach @a; print "@a\n"'
 
 # PerlOnJava
-java -jar target/perlonjava-3.0.0.jar -e 'my @a = (1,2,3); $_ *= 2 foreach @a; print "@a\n"'
+./jperl -e 'my @a = (1,2,3); $_ *= 2 foreach @a; print "@a\n"'
 ```
 
 For comparing `.failed` output files against `.out` reference files:
@@ -188,8 +185,8 @@ diff t/Writer_11.out t/Writer_11.failed
 # Pass JVM options via JPERL_OPTS
 JPERL_OPTS="-Xmx512m" ./jperl script.pl
 
-# Combine env vars
-JPERL_SHOW_FALLBACK=1 JPERL_EVAL_TRACE=1 java -jar target/perlonjava-3.0.0.jar -Ilib t/Writer.t 2>&1
+# Combine env vars (inside ExifTool dir)
+JPERL_SHOW_FALLBACK=1 JPERL_EVAL_TRACE=1 ../jperl -Ilib t/Writer.t 2>&1
 ```
 
 ## Test File Anatomy
@@ -235,7 +232,7 @@ The `check()` function compares extracted tags against reference files `t/<TestN
 
 5. **Isolate the Perl construct** causing the failure. Write a minimal reproducer:
    ```bash
-   java -jar target/perlonjava-3.0.0.jar -e 'print pos("abc" =~ /b/g), "\n"'
+   ./jperl -e 'print pos("abc" =~ /b/g), "\n"'
    perl -e 'print pos("abc" =~ /b/g), "\n"'
    ```
 
@@ -348,7 +345,7 @@ All geotag tests except module loading and 2 others fail. All use `Time::Local`
 Writing non-default language entries to XMP lang-alt lists fails silently. Only `x-default` works. The write path in `WriteXMP.pl` uses `pos()` after `m//g` for path tracking. Test with:
 ```bash
 perl -e '"a/b/c" =~ m|/|g; print pos(), "\n"'  # should print 2
-java -jar target/perlonjava-3.0.0.jar -e '"a/b/c" =~ m|/|g; print pos(), "\n"'
+./jperl -e '"a/b/c" =~ m|/|g; print pos(), "\n"'
 ```
 
 #### P4: XMP lang-alt Bag index tracking (3 tests: XMP 36,38,50)
@@ -471,7 +468,7 @@ In PerlOnJava Java code (temporary, never commit):
 System.err.println("DEBUG: value=" + value);
 ```
 
-To trace which subs hit interpreter fallback:
+To trace which subs hit interpreter fallback (inside ExifTool dir):
 ```bash
-JPERL_SHOW_FALLBACK=1 java -jar target/perlonjava-3.0.0.jar -Ilib t/Writer.t 2>&1 | grep FALLBACK
+JPERL_SHOW_FALLBACK=1 ../jperl -Ilib t/Writer.t 2>&1 | grep FALLBACK
 ```
diff --git a/.cognition/skills/debug-perlonjava/SKILL.md b/.cognition/skills/debug-perlonjava/SKILL.md
@@ -35,7 +35,7 @@ gh pr create --title "Fix: description" --body "Details"
 - **PerlOnJava source**: `src/main/java/org/perlonjava/` (compiler, bytecode interpreter, runtime)
 - **Unit tests**: `src/test/resources/unit/*.t` (run via `make`)
 - **Perl5 core tests**: `perl5_t/t/` (Perl 5 compatibility suite)
-- **Fat JAR**: `target/perlonjava-3.0.0.jar`
+- **Fat JAR**: `build/libs/perlonjava-*.jar` (version varies)
 - **Launcher script**: `./jperl`
 
 ## Building
@@ -191,9 +191,10 @@ This helps identify operator precedence issues and incorrect parsing.
 
 ### 6. Profile with JFR (for performance issues)
 ```bash
-# Record profile
+# Record profile using wrapper script
+JAR=$(ls build/libs/perlonjava-*.jar | head -1)
 $JAVA_HOME/bin/java -XX:StartFlightRecording=duration=10s,filename=profile.jfr \
-  -jar target/perlonjava-3.0.0.jar script.pl
+  -jar $JAR script.pl
 
 # Analyze hotspots
 $JAVA_HOME/bin/jfr print --events jdk.ExecutionSample profile.jfr 2>&1 | \

diff --git a/.cognition/skills/interpreter-parity/SKILL.md b/.cognition/skills/interpreter-parity/SKILL.md
@@ -36,7 +36,7 @@ gh pr create --title "Fix interpreter: description" --body "Details"
 
 - **PerlOnJava source**: `src/main/java/org/perlonjava/` (compiler, bytecode interpreter, runtime)
 - **Unit tests**: `src/test/resources/unit/*.t` (155 tests, run via `make`)
-- **Fat JAR**: `target/perlonjava-3.0.0.jar`
+- **Fat JAR**: `build/libs/perlonjava-*.jar` (version varies)
 - **Launcher script**: `./jperl`
 
 ## Building

diff --git a/.cognition/skills/profile-perlonjava/SKILL.md b/.cognition/skills/profile-perlonjava/SKILL.md
@@ -14,7 +14,7 @@ Profile and optimize PerlOnJava runtime performance using Java Flight Recorder.
 
 **IMPORTANT: Never push directly to master. Always use feature branches and PRs.**
 
-**IMPORTANT: Always commit or stash changes BEFORE switching branches.** If `git stash pop` has conflicts, uncommitted changes may be lost.
+**IMPORTANT: Always commit or save changes BEFORE switching branches.** Use `git diff > backup.patch` to save uncommitted work, or commit to a WIP branch.
 
 ```bash
 git checkout -b perf/optimization-name
@@ -34,12 +34,23 @@ gh pr create --title "Perf: description" --body "Details"
 ### 1. Run with JFR Profiling
 
 ```bash
-cd /Users/fglock/projects/PerlOnJava2
+cd /Users/fglock/projects/PerlOnJava
+
+# Find the jar file (version changes with releases)
+JAR=$(ls build/libs/perlonjava-*.jar | head -1)
 
 # Profile a long-running script (adjust duration as needed)
 java -XX:+FlightRecorder \
   -XX:StartFlightRecording=duration=60s,filename=profile.jfr \
-  -jar target/perlonjava-3.0.0.jar <script.pl> [args...]
+  -jar $JAR <script.pl> [args...]
+
+# Or use the wrapper script with JFR options via JAVA_OPTS
+JAVA_OPTS="-XX:+FlightRecorder -XX:StartFlightRecording=duration=60s,filename=profile.jfr" \
+  ./jperl <script.pl> [args...]
+
+# For interpreter mode profiling
+JAVA_OPTS="-XX:+FlightRecorder -XX:StartFlightRecording=duration=60s,filename=profile.jfr" \
+  ./jperl --interpreter <script.pl> [args...]
 ```
 
 ### 2. Analyze with JFR Tools
@@ -65,22 +76,24 @@ $JFR print --events jdk.ExecutionSample profile.jfr 2>&1 | \
 
 | Category | Methods to Watch | Optimization Approach |
 |----------|------------------|----------------------|
-| **Number parsing** | `Long.parseLong`, `Double.parseDouble`, `NumberParser.parseNumber` | Cache numeric values, avoid string→number conversions |
+| **Number parsing** | `Long.parseLong`, `Double.parseDouble`, `NumberParser.parseNumber` | Cache numeric values, avoid string->number conversions |
 | **Type checking** | `ScalarUtils.looksLikeNumber`, `RuntimeScalar.getDefinedBoolean` | Fast-path for common types (INTEGER, DOUBLE) |
 | **Bitwise ops** | `BitwiseOperators.*` | Ensure values stay as INTEGER type |
 | **Regex** | `Pattern.match`, `Matcher.matches` | Reduce unnecessary regex checks |
 | **Loop control** | `RuntimeControlFlowRegistry.checkLoopAndGetAction` | ThreadLocal overhead |
 | **Array ops** | `ArrayList.grow`, `Arrays.copyOf` | Pre-size arrays, reduce allocations |
+| **Interpreter** | `BytecodeInterpreter.execute`, opcode handlers | Reduce dispatch overhead, inline hot paths |
 
 ### 4. Common Runtime Files
 
 | File | Purpose |
 |------|---------|
 | `src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeScalar.java` | Scalar value representation, getLong/getDouble/getInt |
 | `src/main/java/org/perlonjava/runtime/runtimetypes/ScalarUtils.java` | Utility functions like looksLikeNumber |
-| `src/main/java/org/perlonjava/runtime/operators/BitwiseOperators.java` | Bitwise operations (&, |, ^, ~, <<, >>) |
+| `src/main/java/org/perlonjava/runtime/operators/BitwiseOperators.java` | Bitwise operations (&, \|, ^, ~, <<, >>) |
 | `src/main/java/org/perlonjava/runtime/operators/Operator.java` | General operators |
 | `src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeArray.java` | Array operations |
+| `src/main/java/org/perlonjava/backend/bytecode/BytecodeInterpreter.java` | Interpreter main loop |
 
 ### 5. Optimization Patterns
 
@@ -110,14 +123,20 @@ if (runtimeScalar.type == INTEGER) {
 ### 6. Benchmark Commands
 
 ```bash
+cd /Users/fglock/projects/PerlOnJava
+
+# Quick benchmark with closure test
+./jperl dev/bench/benchmark_closure.pl
+
+# Interpreter mode benchmark (slower, good for profiling interpreter)
+./jperl --interpreter dev/bench/benchmark_closure.pl
+
 # Quick benchmark with life_bitpacked.pl
-java -jar target/perlonjava-3.0.0.jar examples/life_bitpacked.pl \
-  -w 200 -h 200 -g 10000 -r none
+./jperl examples/life_bitpacked.pl -w 200 -h 200 -g 10000 -r none
 
 # Multiple runs for consistency
 for i in 1 2 3; do
-  java -jar target/perlonjava-3.0.0.jar examples/life_bitpacked.pl \
-    -w 200 -h 200 -g 10000 -r none 2>&1 | grep "per second"
+  ./jperl examples/life_bitpacked.pl -w 200 -h 200 -g 10000 -r none 2>&1 | grep "per second"
 done
 ```
 
@@ -147,3 +166,15 @@ make dev   # Quick build - compiles only, NO tests
 7. Profile again to verify improvement
 8. Run tests to ensure correctness
 ```
+
+## JVM vs Interpreter Performance
+
+The interpreter mode (`--interpreter`) is typically 20-40x slower than JVM-compiled mode.
+This is expected and useful for:
+- Testing interpreter-specific code paths
+- Debugging interpreter behavior
+- Profiling interpreter bottlenecks
+
+Example typical performance:
+- JVM mode: ~4 seconds for benchmark_closure.pl
+- Interpreter mode: ~120-130 seconds for the same benchmark
diff --git a/dev/design/INTERPRETER_OPTIMIZATION.md b/dev/design/INTERPRETER_OPTIMIZATION.md
@@ -0,0 +1,148 @@
+# Interpreter Performance Optimization
+
+## Profile Analysis (2026-03-23)
+
+**Benchmark:** `./jperl --interpreter dev/bench/benchmark_closure.pl`
+- Interpreter mode: ~127 seconds
+- JVM mode: ~4 seconds
+- Ratio: ~32x (expected for bytecode interpreter vs JIT-compiled code)
+
+## Top Hotspots by Sample Count
+
+| Samples | Location | Description |
+|---------|----------|-------------|
+| 90 | `BytecodeInterpreter.execute` | Main interpreter loop |
+| 54 | `RuntimeCode.apply` | Subroutine dispatch |
+| 39 | `InterpretedCode.apply` | Delegation to interpreter |
+| 7 | `getCallSiteInfo` | TreeMap lookup for caller() |
+| 5 | `getSourceLocationAccurate` | Line number computation |
+
+## Detailed Hotspot Analysis
+
+### CALL Opcode Handling (BytecodeInterpreter.java lines 816-838)
+
+```
+Line 816 (6 samples): ThreadLocal lookup - InterpreterState.currentPackage.get()
+Line 834 (7 samples): getCallSiteInfo - TreeMap.floorEntry() 
+Line 835 (4 samples): CallerStack.push
+Line 838 (10 samples): RuntimeCode.apply - actual call
+```
+
+### Call Chain Overhead
+
+The subroutine call dispatch has deep indirection:
+
+```
+CALL opcode (BytecodeInterpreter)
+    → RuntimeCode.apply (54 samples)
+        → InterpretedCode.apply (39 samples)  
+            → BytecodeInterpreter.execute (90 samples)
+```
+
+Each call goes through multiple layers before reaching the actual interpreter execution.
+
+## Optimization Plan
+
+### Phase 1: ThreadLocal Caching (High Impact, Low Risk)
+
+**Problem:** `InterpreterState.currentPackage.get()` is called on every CALL opcode.
+
+**Solution:** Cache the package name at the start of execute() and pass it through or use a local variable.
+
+**Files:** `BytecodeInterpreter.java`
+
+### Phase 2: Lazy CallerStack (High Impact, Medium Risk)
+
+**Problem:** `CallerStack.push/pop` and `getCallSiteInfo` happen on EVERY subroutine call, even when `caller()` is never invoked.
+
+**Solution:** Defer CallerStack operations until caller() is actually called:
+1. Store call site info in a lightweight structure
+2. Only populate CallerStack on-demand when caller() executes
+3. Use a "dirty" flag to track if stack needs updating
+
+**Files:** `BytecodeInterpreter.java`, `CallerStack.java`
+
+### Phase 3: Inline Apply Path (Medium Impact, Medium Risk)
+
+**Problem:** Call dispatch goes through multiple indirection layers.
+
+**Solution:** For InterpretedCode, bypass RuntimeCode.apply and call BytecodeInterpreter.execute directly from the CALL opcode handler.
+
+**Files:** `BytecodeInterpreter.java`
+
+### Phase 4: Cache pcToTokenIndex Lookup (Low Impact, Low Risk)
+
+**Problem:** `TreeMap.floorEntry()` is O(log n) for line number lookups.
+
+**Solution:** Cache last lookup result since sequential execution often hits nearby PCs.
+
+**Files:** `BytecodeInterpreter.java`
+
+## Implementation Status
+
+### Completed
+- [x] Profile analysis (2026-03-23)
+- [x] Phase 1: ThreadLocal Caching (2026-03-23) - Cache RuntimeScalar reference, no measurable speedup but cleaner code
+- [x] Phase 2: Lazy CallerStack (2026-03-23) - **~19% speedup** (127s → 103s)
+- [x] Phase 3: Inline Apply Path (2026-03-23) - **~2% speedup** (103s → 101s)
+- [x] Phase 4: Register Array Pooling (2026-03-23) - **~4% speedup** (101s → 97s)
+
+### Pending
+- [ ] Phase 5: Cache pcToTokenIndex Lookup (moved from Phase 4)
+
+## Profile Results After Phase 1
+
+Second profile showed `getCallSiteInfo` (16 samples) + `getSourceLocationAccurate` (15 samples) = ~10% overhead.
+This is spent computing call site info for `caller()` support on every subroutine call.
+
+## Phase 2 Results
+
+Implemented lazy CallerStack:
+- `CallerStack.pushLazy()` stores a lambda that computes CallerInfo on demand
+- Line number computation deferred until `caller()` is actually called
+- `pop()` doesn't resolve lazy entries (no computation needed)
+
+**Benchmark improvement:** 127s → 103s = **~19% speedup**
+
+## Phase 3 Results
+
+Inline InterpretedCode apply path in CALL_SUB:
+- Check if code is `InterpretedCode` and call `BytecodeInterpreter.execute()` directly
+- Bypasses `RuntimeCode.apply()` → `InterpretedCode.apply()` chain
+
+**Benchmark improvement:** 103s → 101s = **~2% speedup**
+
+## Phase 4 Results
+
+Register array pooling in InterpretedCode:
+- `InterpretedCode.getRegisters()` caches register arrays per-code-object
+- Uses ThreadLocal for thread safety with recursion detection
+- Recursive calls fallback to fresh allocation (no contention)
+- `BytecodeInterpreter.execute()` releases registers in finally block
+
+**Benchmark improvement:** 101s → 97s = **~4% speedup**
+
+## Total Performance Improvement
+
+| Phase | Time (s) | Improvement |
+|-------|----------|-------------|
+| Baseline | 127 | - |
+| Phase 2 (Lazy CallerStack) | 103 | 19% |
+| Phase 3 (Inline Apply) | 101 | 2% |
+| Phase 4 (Register Pooling) | 97 | 4% |
+| **Total** | **97** | **~24%** |
+
+## Verification
+
+After each optimization:
+1. Run `make` to ensure no regressions
+2. Re-run benchmark to measure improvement
+3. Re-profile to confirm hotspot reduction
+
+## Related Files
+
+- `src/main/java/org/perlonjava/backend/bytecode/BytecodeInterpreter.java`
+- `src/main/java/org/perlonjava/backend/bytecode/InterpretedCode.java`
+- `src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeCode.java`
+- `src/main/java/org/perlonjava/runtime/runtimetypes/CallerStack.java`
+- `src/main/java/org/perlonjava/backend/bytecode/InterpreterState.java`