Skip to content

[Bug] Unicode property escapes (\p{L}, \p{N}, etc.) not supported in preg_* functions #181

@nahime0

Description

@nahime0

Description

PCRE patterns containing Unicode property escapes (\p{L}, \p{N}, \p{Lu}, \p{Ll}, \p{...} etc.) with the /u modifier do not work correctly. The pcre-to-POSIX translator appears to ignore or strip these escapes, leading to incorrect match results and replacement behavior.

Reproduction

echo preg_match('/\p{L}+/u', '日本語123');           // Expected: 1
echo preg_replace('/\p{N}+/u', 'X', 'abc123def456'); // Expected: abcXdefX

Expected behavior (PHP 8.4)

  • preg_match returns 1 (the Japanese characters match \p{L}+)
  • preg_replace returns abcXdefX

Actual behavior (elephc)

  • preg_match returns 0
  • preg_replace returns abc123def456 (no replacement performed)

Environment

  • elephc 0.21.8
  • PHP 8.4.20

Possible root cause

The PCRE compatibility layer (src/codegen/runtime/system/pcre_to_posix.rs and related files such as preg_match.rs, preg_replace.rs) does not implement translation for \p{...} Unicode property escapes.

When the translator encounters these sequences it likely either drops them or treats the backslash literally, breaking the intended regex semantics.

Additional context

Found during Round 2 stress testing focused on complex preg_* usage.

This is a PHP compatibility regression for anyone using modern Unicode-aware regular expressions.

Suggested labels: bug, php-compatibility, regex, runtime

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingphp-compatibilityPHP compatibility / deviation from PHP behaviorregexRegular expressions and preg_* functionsruntimeRuntime library / GC / ownership

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions