Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 12 additions & 15 deletions .cognition/skills/debug-exiftool/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ You are debugging failures in the Image::ExifTool test suite running under PerlO

**IMPORTANT: Never push directly to master. Always use feature branches and PRs.**

**IMPORTANT: Always commit or stash changes BEFORE switching branches.** If `git stash pop` has conflicts, uncommitted changes may be lost.
**IMPORTANT: Always commit or save changes BEFORE switching branches.** Use `git diff > backup.patch` to save uncommitted work, or commit to a WIP branch.

```bash
git checkout -b fix/exiftool-issue-name
Expand All @@ -42,7 +42,7 @@ gh pr create --title "Fix: description" --body "Details"
- **ExifTool reference output**: `Image-ExifTool-13.44/t/<TestName>_N.out` (expected tag output per sub-test)
- **PerlOnJava unit tests**: `src/test/resources/unit/*.t` (make suite, 154 tests)
- **Perl5 core tests**: `perl5_t/t/` (Perl 5 compatibility suite, run via `make test-gradle`)
- **Fat JAR**: `target/perlonjava-3.0.0.jar`
- **Fat JAR**: `build/libs/perlonjava-*.jar` (version varies)
- **Launcher script**: `./jperl` (resolves JAR path, sets `$^X`)

## Building PerlOnJava
Expand All @@ -64,16 +64,13 @@ make dev # Quick build - compiles only, NO tests
### Single test
```bash
cd Image-ExifTool-13.44
java -jar ../target/perlonjava-3.0.0.jar -Ilib t/Writer.t
# Or using the launcher:
cd Image-ExifTool-13.44
../jperl -Ilib t/Writer.t
```

### Single test with timeout (prevents infinite loops)
```bash
cd Image-ExifTool-13.44
timeout 120 java -jar ../target/perlonjava-3.0.0.jar -Ilib t/XMP.t
timeout 120 ../jperl -Ilib t/XMP.t
```

### All ExifTool tests in parallel with summary
Expand All @@ -82,7 +79,7 @@ cd Image-ExifTool-13.44
mkdir -p /tmp/exiftool_results
for t in t/*.t; do
name=$(basename "$t" .t)
( output=$(timeout 120 java -jar ../target/perlonjava-3.0.0.jar -Ilib "$t" 2>&1)
( output=$(timeout 120 ../jperl -Ilib "$t" 2>&1)
ec=$?
if [ $ec -eq 124 ]; then echo "$name TIMEOUT"
else
Expand Down Expand Up @@ -133,7 +130,7 @@ cd Image-ExifTool-13.44
perl -Ilib t/Writer.t 2>&1 | grep -E '^(not )?ok ' > /tmp/perl_results.txt

# Run with PerlOnJava
java -jar ../target/perlonjava-3.0.0.jar -Ilib t/Writer.t 2>&1 | grep -E '^(not )?ok ' > /tmp/jperl_results.txt
../jperl -Ilib t/Writer.t 2>&1 | grep -E '^(not )?ok ' > /tmp/jperl_results.txt

# Diff
diff /tmp/perl_results.txt /tmp/jperl_results.txt
Expand All @@ -145,7 +142,7 @@ For individual Perl constructs:
perl -e 'my @a = (1,2,3); $_ *= 2 foreach @a; print "@a\n"'

# PerlOnJava
java -jar target/perlonjava-3.0.0.jar -e 'my @a = (1,2,3); $_ *= 2 foreach @a; print "@a\n"'
./jperl -e 'my @a = (1,2,3); $_ *= 2 foreach @a; print "@a\n"'
```

For comparing `.failed` output files against `.out` reference files:
Expand Down Expand Up @@ -188,8 +185,8 @@ diff t/Writer_11.out t/Writer_11.failed
# Pass JVM options via JPERL_OPTS
JPERL_OPTS="-Xmx512m" ./jperl script.pl

# Combine env vars
JPERL_SHOW_FALLBACK=1 JPERL_EVAL_TRACE=1 java -jar target/perlonjava-3.0.0.jar -Ilib t/Writer.t 2>&1
# Combine env vars (inside ExifTool dir)
JPERL_SHOW_FALLBACK=1 JPERL_EVAL_TRACE=1 ../jperl -Ilib t/Writer.t 2>&1
```

## Test File Anatomy
Expand Down Expand Up @@ -235,7 +232,7 @@ The `check()` function compares extracted tags against reference files `t/<TestN

5. **Isolate the Perl construct** causing the failure. Write a minimal reproducer:
```bash
java -jar target/perlonjava-3.0.0.jar -e 'print pos("abc" =~ /b/g), "\n"'
./jperl -e 'print pos("abc" =~ /b/g), "\n"'
perl -e 'print pos("abc" =~ /b/g), "\n"'
```

Expand Down Expand Up @@ -348,7 +345,7 @@ All geotag tests except module loading and 2 others fail. All use `Time::Local`
Writing non-default language entries to XMP lang-alt lists fails silently. Only `x-default` works. The write path in `WriteXMP.pl` uses `pos()` after `m//g` for path tracking. Test with:
```bash
perl -e '"a/b/c" =~ m|/|g; print pos(), "\n"' # should print 2
java -jar target/perlonjava-3.0.0.jar -e '"a/b/c" =~ m|/|g; print pos(), "\n"'
./jperl -e '"a/b/c" =~ m|/|g; print pos(), "\n"'
```

#### P4: XMP lang-alt Bag index tracking (3 tests: XMP 36,38,50)
Expand Down Expand Up @@ -471,7 +468,7 @@ In PerlOnJava Java code (temporary, never commit):
System.err.println("DEBUG: value=" + value);
```

To trace which subs hit interpreter fallback:
To trace which subs hit interpreter fallback (inside ExifTool dir):
```bash
JPERL_SHOW_FALLBACK=1 java -jar target/perlonjava-3.0.0.jar -Ilib t/Writer.t 2>&1 | grep FALLBACK
JPERL_SHOW_FALLBACK=1 ../jperl -Ilib t/Writer.t 2>&1 | grep FALLBACK
```
7 changes: 4 additions & 3 deletions .cognition/skills/debug-perlonjava/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ gh pr create --title "Fix: description" --body "Details"
- **PerlOnJava source**: `src/main/java/org/perlonjava/` (compiler, bytecode interpreter, runtime)
- **Unit tests**: `src/test/resources/unit/*.t` (run via `make`)
- **Perl5 core tests**: `perl5_t/t/` (Perl 5 compatibility suite)
- **Fat JAR**: `target/perlonjava-3.0.0.jar`
- **Fat JAR**: `build/libs/perlonjava-*.jar` (version varies)
- **Launcher script**: `./jperl`

## Building
Expand Down Expand Up @@ -191,9 +191,10 @@ This helps identify operator precedence issues and incorrect parsing.

### 6. Profile with JFR (for performance issues)
```bash
# Record profile
# Record profile using wrapper script
JAR=$(ls build/libs/perlonjava-*.jar | head -1)
$JAVA_HOME/bin/java -XX:StartFlightRecording=duration=10s,filename=profile.jfr \
-jar target/perlonjava-3.0.0.jar script.pl
-jar $JAR script.pl

# Analyze hotspots
$JAVA_HOME/bin/jfr print --events jdk.ExecutionSample profile.jfr 2>&1 | \
Expand Down
2 changes: 1 addition & 1 deletion .cognition/skills/interpreter-parity/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ gh pr create --title "Fix interpreter: description" --body "Details"

- **PerlOnJava source**: `src/main/java/org/perlonjava/` (compiler, bytecode interpreter, runtime)
- **Unit tests**: `src/test/resources/unit/*.t` (155 tests, run via `make`)
- **Fat JAR**: `target/perlonjava-3.0.0.jar`
- **Fat JAR**: `build/libs/perlonjava-*.jar` (version varies)
- **Launcher script**: `./jperl`

## Building
Expand Down
49 changes: 40 additions & 9 deletions .cognition/skills/profile-perlonjava/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Profile and optimize PerlOnJava runtime performance using Java Flight Recorder.

**IMPORTANT: Never push directly to master. Always use feature branches and PRs.**

**IMPORTANT: Always commit or stash changes BEFORE switching branches.** If `git stash pop` has conflicts, uncommitted changes may be lost.
**IMPORTANT: Always commit or save changes BEFORE switching branches.** Use `git diff > backup.patch` to save uncommitted work, or commit to a WIP branch.

```bash
git checkout -b perf/optimization-name
Expand All @@ -34,12 +34,23 @@ gh pr create --title "Perf: description" --body "Details"
### 1. Run with JFR Profiling

```bash
cd /Users/fglock/projects/PerlOnJava2
cd /Users/fglock/projects/PerlOnJava

# Find the jar file (version changes with releases)
JAR=$(ls build/libs/perlonjava-*.jar | head -1)

# Profile a long-running script (adjust duration as needed)
java -XX:+FlightRecorder \
-XX:StartFlightRecording=duration=60s,filename=profile.jfr \
-jar target/perlonjava-3.0.0.jar <script.pl> [args...]
-jar $JAR <script.pl> [args...]

# Or use the wrapper script with JFR options via JAVA_OPTS
JAVA_OPTS="-XX:+FlightRecorder -XX:StartFlightRecording=duration=60s,filename=profile.jfr" \
./jperl <script.pl> [args...]

# For interpreter mode profiling
JAVA_OPTS="-XX:+FlightRecorder -XX:StartFlightRecording=duration=60s,filename=profile.jfr" \
./jperl --interpreter <script.pl> [args...]
```

### 2. Analyze with JFR Tools
Expand All @@ -65,22 +76,24 @@ $JFR print --events jdk.ExecutionSample profile.jfr 2>&1 | \

| Category | Methods to Watch | Optimization Approach |
|----------|------------------|----------------------|
| **Number parsing** | `Long.parseLong`, `Double.parseDouble`, `NumberParser.parseNumber` | Cache numeric values, avoid stringnumber conversions |
| **Number parsing** | `Long.parseLong`, `Double.parseDouble`, `NumberParser.parseNumber` | Cache numeric values, avoid string->number conversions |
| **Type checking** | `ScalarUtils.looksLikeNumber`, `RuntimeScalar.getDefinedBoolean` | Fast-path for common types (INTEGER, DOUBLE) |
| **Bitwise ops** | `BitwiseOperators.*` | Ensure values stay as INTEGER type |
| **Regex** | `Pattern.match`, `Matcher.matches` | Reduce unnecessary regex checks |
| **Loop control** | `RuntimeControlFlowRegistry.checkLoopAndGetAction` | ThreadLocal overhead |
| **Array ops** | `ArrayList.grow`, `Arrays.copyOf` | Pre-size arrays, reduce allocations |
| **Interpreter** | `BytecodeInterpreter.execute`, opcode handlers | Reduce dispatch overhead, inline hot paths |

### 4. Common Runtime Files

| File | Purpose |
|------|---------|
| `src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeScalar.java` | Scalar value representation, getLong/getDouble/getInt |
| `src/main/java/org/perlonjava/runtime/runtimetypes/ScalarUtils.java` | Utility functions like looksLikeNumber |
| `src/main/java/org/perlonjava/runtime/operators/BitwiseOperators.java` | Bitwise operations (&, |, ^, ~, <<, >>) |
| `src/main/java/org/perlonjava/runtime/operators/BitwiseOperators.java` | Bitwise operations (&, \|, ^, ~, <<, >>) |
| `src/main/java/org/perlonjava/runtime/operators/Operator.java` | General operators |
| `src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeArray.java` | Array operations |
| `src/main/java/org/perlonjava/backend/bytecode/BytecodeInterpreter.java` | Interpreter main loop |

### 5. Optimization Patterns

Expand Down Expand Up @@ -110,14 +123,20 @@ if (runtimeScalar.type == INTEGER) {
### 6. Benchmark Commands

```bash
cd /Users/fglock/projects/PerlOnJava

# Quick benchmark with closure test
./jperl dev/bench/benchmark_closure.pl

# Interpreter mode benchmark (slower, good for profiling interpreter)
./jperl --interpreter dev/bench/benchmark_closure.pl

# Quick benchmark with life_bitpacked.pl
java -jar target/perlonjava-3.0.0.jar examples/life_bitpacked.pl \
-w 200 -h 200 -g 10000 -r none
./jperl examples/life_bitpacked.pl -w 200 -h 200 -g 10000 -r none

# Multiple runs for consistency
for i in 1 2 3; do
java -jar target/perlonjava-3.0.0.jar examples/life_bitpacked.pl \
-w 200 -h 200 -g 10000 -r none 2>&1 | grep "per second"
./jperl examples/life_bitpacked.pl -w 200 -h 200 -g 10000 -r none 2>&1 | grep "per second"
done
```

Expand Down Expand Up @@ -147,3 +166,15 @@ make dev # Quick build - compiles only, NO tests
7. Profile again to verify improvement
8. Run tests to ensure correctness
```

## JVM vs Interpreter Performance

The interpreter mode (`--interpreter`) is typically 20-40x slower than JVM-compiled mode.
This is expected and useful for:
- Testing interpreter-specific code paths
- Debugging interpreter behavior
- Profiling interpreter bottlenecks

Example typical performance:
- JVM mode: ~4 seconds for benchmark_closure.pl
- Interpreter mode: ~120-130 seconds for the same benchmark
148 changes: 148 additions & 0 deletions dev/design/INTERPRETER_OPTIMIZATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# Interpreter Performance Optimization

## Profile Analysis (2026-03-23)

**Benchmark:** `./jperl --interpreter dev/bench/benchmark_closure.pl`
- Interpreter mode: ~127 seconds
- JVM mode: ~4 seconds
- Ratio: ~32x (expected for bytecode interpreter vs JIT-compiled code)

## Top Hotspots by Sample Count

| Samples | Location | Description |
|---------|----------|-------------|
| 90 | `BytecodeInterpreter.execute` | Main interpreter loop |
| 54 | `RuntimeCode.apply` | Subroutine dispatch |
| 39 | `InterpretedCode.apply` | Delegation to interpreter |
| 7 | `getCallSiteInfo` | TreeMap lookup for caller() |
| 5 | `getSourceLocationAccurate` | Line number computation |

## Detailed Hotspot Analysis

### CALL Opcode Handling (BytecodeInterpreter.java lines 816-838)

```
Line 816 (6 samples): ThreadLocal lookup - InterpreterState.currentPackage.get()
Line 834 (7 samples): getCallSiteInfo - TreeMap.floorEntry()
Line 835 (4 samples): CallerStack.push
Line 838 (10 samples): RuntimeCode.apply - actual call
```

### Call Chain Overhead

The subroutine call dispatch has deep indirection:

```
CALL opcode (BytecodeInterpreter)
→ RuntimeCode.apply (54 samples)
→ InterpretedCode.apply (39 samples)
→ BytecodeInterpreter.execute (90 samples)
```

Each call goes through multiple layers before reaching the actual interpreter execution.

## Optimization Plan

### Phase 1: ThreadLocal Caching (High Impact, Low Risk)

**Problem:** `InterpreterState.currentPackage.get()` is called on every CALL opcode.

**Solution:** Cache the package name at the start of execute() and pass it through or use a local variable.

**Files:** `BytecodeInterpreter.java`

### Phase 2: Lazy CallerStack (High Impact, Medium Risk)

**Problem:** `CallerStack.push/pop` and `getCallSiteInfo` happen on EVERY subroutine call, even when `caller()` is never invoked.

**Solution:** Defer CallerStack operations until caller() is actually called:
1. Store call site info in a lightweight structure
2. Only populate CallerStack on-demand when caller() executes
3. Use a "dirty" flag to track if stack needs updating

**Files:** `BytecodeInterpreter.java`, `CallerStack.java`

### Phase 3: Inline Apply Path (Medium Impact, Medium Risk)

**Problem:** Call dispatch goes through multiple indirection layers.

**Solution:** For InterpretedCode, bypass RuntimeCode.apply and call BytecodeInterpreter.execute directly from the CALL opcode handler.

**Files:** `BytecodeInterpreter.java`

### Phase 4: Cache pcToTokenIndex Lookup (Low Impact, Low Risk)

**Problem:** `TreeMap.floorEntry()` is O(log n) for line number lookups.

**Solution:** Cache last lookup result since sequential execution often hits nearby PCs.

**Files:** `BytecodeInterpreter.java`

## Implementation Status

### Completed
- [x] Profile analysis (2026-03-23)
- [x] Phase 1: ThreadLocal Caching (2026-03-23) - Cache RuntimeScalar reference, no measurable speedup but cleaner code
- [x] Phase 2: Lazy CallerStack (2026-03-23) - **~19% speedup** (127s → 103s)
- [x] Phase 3: Inline Apply Path (2026-03-23) - **~2% speedup** (103s → 101s)
- [x] Phase 4: Register Array Pooling (2026-03-23) - **~4% speedup** (101s → 97s)

### Pending
- [ ] Phase 5: Cache pcToTokenIndex Lookup (moved from Phase 4)

## Profile Results After Phase 1

Second profile showed `getCallSiteInfo` (16 samples) + `getSourceLocationAccurate` (15 samples) = ~10% overhead.
This is spent computing call site info for `caller()` support on every subroutine call.

## Phase 2 Results

Implemented lazy CallerStack:
- `CallerStack.pushLazy()` stores a lambda that computes CallerInfo on demand
- Line number computation deferred until `caller()` is actually called
- `pop()` doesn't resolve lazy entries (no computation needed)

**Benchmark improvement:** 127s → 103s = **~19% speedup**

## Phase 3 Results

Inline InterpretedCode apply path in CALL_SUB:
- Check if code is `InterpretedCode` and call `BytecodeInterpreter.execute()` directly
- Bypasses `RuntimeCode.apply()` → `InterpretedCode.apply()` chain

**Benchmark improvement:** 103s → 101s = **~2% speedup**

## Phase 4 Results

Register array pooling in InterpretedCode:
- `InterpretedCode.getRegisters()` caches register arrays per-code-object
- Uses ThreadLocal for thread safety with recursion detection
- Recursive calls fallback to fresh allocation (no contention)
- `BytecodeInterpreter.execute()` releases registers in finally block

**Benchmark improvement:** 101s → 97s = **~4% speedup**

## Total Performance Improvement

| Phase | Time (s) | Improvement |
|-------|----------|-------------|
| Baseline | 127 | - |
| Phase 2 (Lazy CallerStack) | 103 | 19% |
| Phase 3 (Inline Apply) | 101 | 2% |
| Phase 4 (Register Pooling) | 97 | 4% |
| **Total** | **97** | **~24%** |

## Verification

After each optimization:
1. Run `make` to ensure no regressions
2. Re-run benchmark to measure improvement
3. Re-profile to confirm hotspot reduction

## Related Files

- `src/main/java/org/perlonjava/backend/bytecode/BytecodeInterpreter.java`
- `src/main/java/org/perlonjava/backend/bytecode/InterpretedCode.java`
- `src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeCode.java`
- `src/main/java/org/perlonjava/runtime/runtimetypes/CallerStack.java`
- `src/main/java/org/perlonjava/backend/bytecode/InterpreterState.java`
Loading