Skip to content

Latest commit

 

History

History
671 lines (515 loc) · 20.8 KB

File metadata and controls

671 lines (515 loc) · 20.8 KB

Control Flow Implementation - Complete Guide

Last Updated: 2026-02-04 Status: ✅ PRODUCTION READY - FULLY OPTIMIZED WITH BLOCK-LEVEL DISPATCHERS Test Pass Rate: 100% (2006/2006 unit tests)


Overview

PerlOnJava implements Perl's control flow operators (last, next, redo, goto) using a tagged return value approach with block-level shared dispatchers. This provides:

  • Zero-overhead local control flow (plain JVM GOTO)
  • Efficient non-local control flow (shared block-level dispatchers)
  • Tail call optimization (constant stack space)
  • Perfect Perl semantics (all control flow works correctly)
  • Optimal bytecode size (dispatcher sharing eliminates redundancy)

Block-Level Dispatcher Optimization

The Problem with Per-Call Dispatchers

Original approach: Each call site had its own complete dispatcher (~150 bytes each).

Example:

for (1..3) {
    A();  # 150 bytes of dispatcher code
    B();  # 150 bytes of dispatcher code (identical!)
    C();  # 150 bytes of dispatcher code (identical!)
    D();  # 150 bytes of dispatcher code (identical!)
}

Total: 600 bytes of mostly redundant code.

Block-Level Dispatcher Solution ✅ IMPLEMENTED

Key insight: All calls within the same block with the same visible loops can share ONE dispatcher!

Implementation:

Call sites with same loop state:
  A(): check (~20 bytes) + GOTO blockDispatcher
  B(): check (~20 bytes) + GOTO blockDispatcher (reuses same!)
  C(): check (~20 bytes) + GOTO blockDispatcher (reuses same!)
  D(): check (~20 bytes) + GOTO blockDispatcher (reuses same!)

Block dispatcher (emitted once): ~150 bytes
Skip GOTO: ~3 bytes

Total: 4×20 + 150 + 3 = 233 bytes
Savings: 600 - 233 = 367 bytes (61% reduction!)

How it works:

  1. Compute a signature for current loop state (visible loops)
  2. Check if dispatcher already exists for that signature
  3. If not, create new dispatcher label and emit it after first use
  4. All subsequent calls with same signature jump to shared dispatcher
  5. Dispatcher stays within loop scope (no frame computation issues)

Real-world measurements:

  • Test with 4 sequential calls in a loop
  • Master (per-call dispatchers): 2232 bytecode lines
  • Block-level dispatchers: 2139 bytecode lines
  • Savings: 93 lines (4.2%)
  • CHECKCAST operations: 23 → 17 (26% reduction)

Why Method-Level Centralization Doesn't Work

We also investigated centralizing to a single dispatcher at the method's returnLabel. Here's why it was rejected:

Problems:

  1. Frame computation issues: Jumping from returnLabel (outside loop scope) back to loop labels (inside loop scope) causes "Bad local variable type" errors
  2. Larger bytecode: Central dispatcher must check ALL loops in the method, not just visible ones
  3. Less optimal: For typical methods, the overhead exceeds savings

Block-level approach is superior because:

  1. Dispatcher stays WITHIN loop scope (no frame issues)
  2. Only checks loops visible at that block level
  3. Backward jumps (redo) work correctly
  4. Achieves best of both worlds: sharing where beneficial, localized where necessary

Current Implementation Details

  1. Local Control Flow (within same method)

    • Direct JVM GOTO instructions
    • Zero runtime overhead
    • Handles 99% of control flow cases
  2. Non-Local Control Flow (across method boundaries)

    • Returns RuntimeControlFlowList marker
    • Block-level shared dispatchers
    • ~20 bytes per call site (check only)
    • ~150 bytes per unique loop state (dispatcher)
  3. Tail Call Optimization

    • Trampoline loop at method's returnLabel
    • Prevents stack overflow for recursive goto &NAME
    • Constant stack space

Core Components

Runtime Classes

ControlFlowType Enum

public enum ControlFlowType {
    LAST,      // Exit loop (ordinal 0)
    NEXT,      // Continue to next iteration (ordinal 1)
    REDO,      // Restart current iteration (ordinal 2)
    GOTO,      // Jump to label or named goto (ordinal 3)
    TAILCALL,  // Tail call optimization (ordinal 4)
    RETURN     // Non-local return from map/grep block (ordinal 5)
}

ControlFlowMarker

public class ControlFlowMarker {
    ControlFlowType type;
    String label;              // Loop/block label (may be null)
    RuntimeScalar codeRef;     // Code reference for TAILCALL
    RuntimeArray args;         // Arguments for TAILCALL
    String fileName;           // Source location for errors
    int lineNumber;
}

RuntimeControlFlowList

public class RuntimeControlFlowList extends RuntimeList {
    ControlFlowMarker marker;
    RuntimeBase returnValue;   // For RETURN type (non-local return from map/grep)
}

Code Generation

EmitControlFlow.java

Emits control flow operators:

  • Checks if label is visible in current scope
  • Local: Emits JVM GOTO directly
  • Non-local: Creates RuntimeControlFlowList and returns it

EmitSubroutine.java

Block-level shared dispatcher (~20 bytes per call + ~150 bytes per unique loop state):

At each call site:

// After RuntimeCode.apply() returns:
ASTORE controlFlowTempSlot     // Store result
ALOAD controlFlowTempSlot
INVOKEVIRTUAL isNonLocalGoto()
IFEQ notControlFlow            // Not marked, continue

// Marked: jump to block-level dispatcher
GOTO blockDispatcher           // ~20 bytes per call

notControlFlow:
ALOAD controlFlowTempSlot
// Continue with normal processing
GOTO skipDispatcher            // Skip over dispatcher code

Block dispatcher (emitted once per unique loop state):

blockDispatcher:
// Get control flow type ordinal
ALOAD controlFlowTempSlot
CHECKCAST RuntimeControlFlowList
INVOKEVIRTUAL getControlFlowType()
INVOKEVIRTUAL ordinal()
ISTORE controlFlowActionSlot

// Only handle LAST/NEXT/REDO locally. Others propagate.
ILOAD controlFlowActionSlot
ICONST_2
IF_ICMPGT propagateToCaller
// Higher ordinals (GOTO=3, TAILCALL=4, RETURN=5) are handled separately:
// GOTO and TAILCALL propagate to caller; RETURN is unwrapped for
// non-map/grep blocks or propagated for map/grep blocks.

// Loop through visible loop labels
for each visible loop {
    // Check if marker matches label
    ALOAD controlFlowTempSlot
    CHECKCAST RuntimeControlFlowList
    LDC loopLabel (or ACONST_NULL)
    INVOKEVIRTUAL matchesLabel()
    IFEQ nextLoopCheck

    // Match found: dispatch by type
    ILOAD controlFlowActionSlot
    IF (type == LAST) GOTO lastLabel
    IF (type == NEXT) GOTO nextLabel
    IF (type == REDO) GOTO redoLabel

    nextLoopCheck:
}

// No match: propagate to caller
propagateToCaller:
ALOAD controlFlowTempSlot
ASTORE returnValueSlot
GOTO returnLabel

skipDispatcher:
// Normal execution continues here

Key advantages:

  • Multiple calls share ONE dispatcher (massive bytecode savings)
  • Checks only loops visible at call site
  • All jumps are within loop scope (no frame issues)
  • Backward jumps (redo) work because local variables are still in scope
  • Loop state signature uses identity hash to identify unique loop states

Implementation details:

  • JavaClassInfo maintains a map of loop state signatures to dispatcher labels
  • Loop state signature: concatenation of loop label names + identity hash codes
  • First call with a signature creates and emits the dispatcher
  • Subsequent calls with same signature reuse the existing dispatcher

EmitterMethodCreator.java

Return label — handles normal returns and propagation:

returnLabel:
ALOAD returnListSlot
INVOKEVIRTUAL isNonLocalGoto()
IFEQ normalReturn

// For eval blocks: check if RETURN type — if so, propagate
// Otherwise: non-local control flow escaped eval, set $@ and return empty
// For regular subs: propagate marker to caller
ARETURN

normalReturn:
ARETURN

Tail call trampoline — emitted at each call site in EmitSubroutine.java (not at returnLabel):

// At each call site, after isNonLocalGoto check:
ALOAD controlFlowTempSlot
CHECKCAST RuntimeControlFlowList
INVOKEVIRTUAL getControlFlowType()
INVOKEVIRTUAL ordinal()
ICONST_4                          // TAILCALL ordinal
IF_ICMPNE blockDispatcher         // Not a tail call, go to dispatcher

tailcallLoop:
ALOAD controlFlowTempSlot
CHECKCAST RuntimeControlFlowList
INVOKEVIRTUAL getTailCallCodeRef()
ASTORE codeRefSlot
INVOKEVIRTUAL getTailCallArgs()
ASTORE argsSlot
// Re-invoke
ALOAD codeRefSlot
ALOAD argsSlot
INVOKESTATIC RuntimeCode.apply(...)
ASTORE controlFlowTempSlot

// Check if result is another tail call
ALOAD controlFlowTempSlot
INVOKEVIRTUAL isNonLocalGoto()
IFEQ notControlFlow
// Get type ordinal, check if still TAILCALL
ICONST_4
IF_ICMPEQ tailcallLoop           // Loop if still TAILCALL
GOTO blockDispatcher              // Otherwise dispatch normally

How It Works

Example 1: Local Control Flow (Fast Path)

for my $i (1..10) {
    last if $i > 5;  # Local control flow
}

Generated bytecode:

ILOAD i
ICONST 5
IF_ICMPLE continueLoop
GOTO lastLabel        # Direct JVM GOTO - zero overhead!

Example 2: Non-Local Control Flow

sub inner { last }

for my $i (1..10) {
    inner();         # Non-local control flow
    print "$i\n";
}

Generated bytecode:

At call site:

INVOKESTATIC RuntimeCode.apply(...)  # Call inner()
ASTORE controlFlowTempSlot          # Store result
ALOAD controlFlowTempSlot
INVOKEVIRTUAL isNonLocalGoto()
IFEQ notControlFlow

GOTO blockDispatcher                 # Jump to block-level dispatcher
notControlFlow:
ALOAD controlFlowTempSlot

Example 3: Tail Call

sub factorial {
    my ($n, $acc) = @_;
    return $acc if $n <= 1;
    goto &factorial, $n-1, $n*$acc;  # Tail call
}

Trampoline loop at returnLabel executes tail calls iteratively:

tailcallLoop:
ALOAD controlFlowTempSlot
INVOKEVIRTUAL getTailCallCodeRef()
ASTORE codeRefSlot
INVOKEVIRTUAL getTailCallArgs()
ASTORE argsSlot

# Re-invoke
ALOAD codeRefSlot
ALOAD argsSlot
INVOKESTATIC RuntimeCode.apply(...)
ASTORE returnListSlot

# Check if result is another tail call
ALOAD returnListSlot
INVOKEVIRTUAL isNonLocalGoto()
IFEQ normalReturn
# Get ordinal
ICONST_4
IF_ICMPEQ tailcallLoop  # Loop if still TAILCALL

# Not TAILCALL anymore, dispatch via block dispatcher

Non-Local Return from Map/Grep Blocks

When return is used inside a map or grep block, it should return from the enclosing subroutine, not just the block. This uses the RETURN control flow type:

  • Inside map/grep blocks (isMapGrepBlock), return creates a RuntimeControlFlowList with ControlFlowType.RETURN and carries the return value in returnValue
  • The block dispatcher recognizes RETURN (ordinal 5) and unwraps the return value if the current context is a normal subroutine (not map/grep)
  • If unwrapping fails, PerlNonLocalReturnException is thrown for stack unwinding through Java-level map/grep calls

Performance

Bytecode Size

Block-Level Dispatcher Optimization:

Per call site:

  • Simple check: ~20 bytes (ASTORE, ALOAD, INVOKEVIRTUAL isNonLocalGoto, IFEQ, GOTO)
  • Block dispatcher (shared): ~150 bytes (emitted once per unique loop state)
  • Skip GOTO: ~3 bytes

For a block with N calls sharing the same loop state:

  • Total: 20N + 150 + 3 bytes
  • Compare to old approach: 150N bytes
  • Net savings: 130N - 153 bytes

Examples:

  • N=1 (single call): 173 bytes vs 150 bytes = 23 bytes WORSE (acceptable for simplicity)
  • N=2 (two calls): 193 bytes vs 300 bytes = 107 bytes saved (36%)
  • N=4 (four calls): 233 bytes vs 600 bytes = 367 bytes saved (61%)
  • N=10 (ten calls): 353 bytes vs 1500 bytes = 1147 bytes saved (76%)

Real-world measurements:

  • Test with 4 sequential calls in a loop: for { A(); B(); C(); D(); }
    • Master (per-call dispatchers): 2232 bytecode lines
    • Block-level dispatchers: 2139 bytecode lines
    • Savings: 93 lines (4.2%)
  • CHECKCAST operations: 23 → 17 (26% reduction)
  • Complex nested loops (3 levels, 2 calls): 1374 lines (no regression)

When it helps most:

  • Multiple calls in tight sequence (common in real code)
  • Loops with multiple function calls in body
  • Blocks with 2+ calls that could trigger control flow

Trade-offs:

  • Single call: slightly worse (23 bytes overhead for dispatcher infrastructure)
  • Multiple calls: increasingly better as N grows
  • Overall: net win for typical Perl code patterns

Runtime Performance

  • Local control flow: Zero overhead (plain JVM GOTO)
  • Non-local control flow:
    • One isNonLocalGoto() check per call site (~5 CPU cycles)
    • One GOTO to shared dispatcher (if marked)
    • Shared dispatcher logic executes once (not per call)
    • O(1) dispatch regardless of loop depth
    • Conditional branch chain at dispatcher
  • Tail calls: Iterative trampoline (constant stack space)

Critical Design Decisions

Why No Stack Manipulation?

Early attempts used DUP, POP, SWAP operations which caused ASM frame computation failures:

// BAD - breaks ASM:
DUP                           // Stack: [result, result]
INSTANCEOF RuntimeControlFlowList
IFEQ notMarked
POP                           // Stack heights differ at merge point!

Solution: Use only local variable slots (ALOAD/ASTORE):

// GOOD - ASM-friendly:
ASTORE tempSlot               // Stack: []
ALOAD tempSlot                // Stack: [result]
INSTANCEOF RuntimeControlFlowList
IFEQ notMarked
ALOAD tempSlot                // Stack: [result]

Key principle: All control flow paths must arrive at labels with identical stack heights.

Why Block-Level Shared Dispatchers?

Old approach: Check and dispatch at each call site (150 bytes each)

New approach:

  1. Call site: Simple check + jump to block dispatcher (~20 bytes)
  2. Block-level dispatcher handles all types using conditional branches (emitted once per unique loop state)

Benefits:

  • Massive bytecode savings (130 bytes per call)
  • Efficient dispatch using conditional branch chain
  • Better JIT compilation (less bytecode to optimize)
  • Single point of control (easier to maintain/debug)

Why Separate controlFlowTempSlot?

The controlFlowTempSlot holds the RuntimeControlFlowList during dispatch, separate from returnListSlot. This is necessary because:

  1. Label matching needs the original marker
  2. Tail call loop re-uses the slot for iteration
  3. Error handling needs to build error messages from marker

Feature Flags

// EmitSubroutine.java
ENABLE_CONTROL_FLOW_CHECKS = true;  // Call-site checks

// EmitterMethodCreator.java
ENABLE_TAILCALL_TRAMPOLINE = true;  // Tail call optimization

// EmitControlFlow.java
ENABLE_TAGGED_RETURNS = true;       // Tagged return values
DEBUG_CONTROL_FLOW = false;         // Debug output

Test Coverage

Unit Tests: 100% Pass (2006/2006)

Test files:

  • unit/control_flow.t - Comprehensive control flow tests
  • unit/tail_calls.t - Tail call optimization
  • unit/loop_modifiers.t - Statement modifiers
  • Plus 150+ other test files exercising control flow

Coverage:

  • ✅ Local last/next/redo (all loop types)
  • ✅ Labeled control flow
  • ✅ Non-local control flow through subroutines
  • ✅ goto LABEL, goto &NAME, goto SUB
  • ✅ Tail call optimization (recursive, mutual recursion)
  • ✅ Error messages (invalid usage)
  • ✅ Nested loops
  • ✅ eval blocks
  • ✅ Mixed control flow scenarios

Design Lessons

  1. ASM's COMPUTE_FRAMES is fragile with stack manipulation after method calls
  2. Local variable slots are ASM-friendly, stack operations are not
  3. Centralized dispatch is more efficient than per-call-site dispatch
  4. Conditional branch chains efficiently dispatch control flow types
  5. Zero-overhead local flow is achievable with direct GOTO

Implementation Files

Core Implementation

  • src/main/java/org/perlonjava/runtime/runtimetypes/ControlFlowType.java
  • src/main/java/org/perlonjava/runtime/runtimetypes/ControlFlowMarker.java
  • src/main/java/org/perlonjava/runtime/runtimetypes/RuntimeControlFlowList.java

Code Generation

  • src/main/java/org/perlonjava/backend/jvm/EmitControlFlow.java - Emit control flow operators
  • src/main/java/org/perlonjava/backend/jvm/EmitSubroutine.java - Call-site checks
  • src/main/java/org/perlonjava/backend/jvm/EmitterMethodCreator.java - Return label and eval handling

Tests

  • src/test/resources/unit/control_flow.t
  • src/test/resources/unit/tail_calls.t
  • src/test/resources/unit/loop_modifiers.t

Current Optimizations

Fast Path for Unlabeled Control Flow ✅ IMPLEMENTED

Most last, next, redo statements don't use labels and target the innermost loop. The implementation now includes a fast path:

// Before the loop: check if marker.label == null
if (marker.getControlFlowLabel() == null) {
    // Dispatch directly to innermost loop
    // Saves ~13N bytes per call where N = number of visible labels
}
// Otherwise, do full loop label search

Benefits:

  • Optimizes the 95% case (unlabeled control flow)
  • Saves ~10-15 bytes per unlabeled control flow call per visible label
  • For methods with 50 calls and 5 labels each: saves ~3,000+ bytes

Implementation: EmitSubroutine.java lines 429-488


Future Optimizations (Optional)

1. Call-Site Optimization

Skip control flow checks for calls that provably never return markers:

  • Built-in functions (print, scalar, etc.)
  • Methods marked as "control-flow-safe"

Benefit: Eliminate ~20 bytes per safe call

2. Dispatcher Optimization

Use lookup tables for label matching instead of linear search:

// Pre-compute at compile time:
Map<String, Label> labelMap = {
    "SKIP" -> skipLastLabel,
    "OUTER" -> outerLastLabel,
    ...
}

Benefit: O(1) label lookup vs O(N) search

3. Selective Control Flow

Only emit call-site checks in methods that have visible labeled blocks:

if (ctx.javaClassInfo.hasLabeledBlocks) {
    emitControlFlowCheck();
}

Benefit: Eliminate checks in 95%+ of methods


Why Full Centralization Doesn't Work

Initial idea: Move all control flow checking to a single centralized dispatcher at the method's returnLabel to reduce per-call-site bytecode.

Problem: The centralized dispatcher would need to check ALL loop labels in the entire method, not just the labels visible at each call site. For complex methods with many nested loops:

  • Distributed block-level approach: Each of N calls checks M visible labels = N × M × ~13 bytes
  • Fully centralized approach: Each of N calls: ~20 bytes + central dispatcher checking ALL L labels = N × 20 + L × 3 × ~13 bytes

The fully centralized approach only helps when:

N × M × 13 > N × 20 + L × 3 × 13
N × (M × 13 - 20) > L × 39
N > L × 39 / (M × 13 - 20)

For typical values (M=5, L=20): N > 20 × 39 / 45 ≈ 17.3

So full centralization only helps when there are 18+ call sites AND each call site has fewer visible labels than the method has total labels. In practice, this rarely occurs.

Conclusion: The distributed block-level dispatcher approach with fast-path optimization (implemented above) is superior.


Comparison with Other Perl Implementations

Feature PerlOnJava JPerl perl5 (C)
Local control flow ✅ Zero overhead
Non-local control flow ✅ Full support ❌ Limited
Tail call optimization ✅ Trampoline
Bytecode size ✅ Optimized with sharing ⚠️ Large N/A
Dispatcher sharing ✅ Block-level N/A
Test compatibility ✅ 100% ⚠️ ~80% ✅ 100%

Conclusion

The current control flow implementation represents a mature, production-ready solution that:

  • Achieves 100% test pass rate (2006/2006 unit tests)
  • Provides zero-overhead local control flow (direct JVM GOTO)
  • Implements efficient non-local control flow with block-level dispatcher sharing
  • Saves up to 61% bytecode for blocks with multiple calls (4+ calls)
  • Includes tail call optimization for recursive goto &NAME
  • Uses ASM-friendly bytecode patterns (no frame computation issues)
  • Has minimal code footprint with intelligent sharing

Key Innovation: Block-level dispatcher sharing is a significant breakthrough that provides both correctness (all tests pass) and efficiency (massive bytecode savings for common patterns). By sharing dispatchers among calls with the same visible loops, we achieve the best of both worlds:

  • Local scope (no frame issues)
  • Code sharing (eliminate redundancy)
  • Optimal performance (only check visible loops)

When it shines:

  • Multiple function calls in loop bodies (common pattern)
  • Sequential calls like A(); B(); C(); D();
  • Real-world code with 2+ calls per block: 36-76% bytecode savings

Status: Ready for production use. No known limitations.


References