Skip to content

fix(regex): \cX inside character class consumes next char even if special#558

Merged
fglock merged 1 commit intomasterfrom
fix/regex-control-char-class
Apr 25, 2026
Merged

fix(regex): \cX inside character class consumes next char even if special#558
fglock merged 1 commit intomasterfrom
fix/regex-control-char-class

Conversation

@fglock
Copy link
Copy Markdown
Owner

@fglock fglock commented Apr 25, 2026

Summary

Fixes the regex preprocessor's handling of \cX inside character classes when X is a regex-significant character (], \, [, ^, etc.).

In Perl, \cX is a control-character escape that always consumes the next source character, regardless of what it is — so \c] is U+001D (control-]), \c\ is U+001C (control-\), and so on. Our preprocessor treated \c as a plain one-char escape, which caused two problems:

  • Bracket-end finder: scanning for the closing ] of a character class, \c] was misread as the class terminator.
  • Class-content emitter: \cX was emitted to Java as-is, so Java's Pattern re-parsed sequences like \c\[ as control-backslash followed by an opening bracket — yielding a misleading Unclosed character class error.

Fix

  • RegexPreprocessor.handleCharacterClass: skip the character following a \c escape when locating the closing ].
  • RegexPreprocessorHelper.handleRegexCharacterClassEscape: detect \cX and convert it to \x{HH} using Perl semantics (uc(X) XOR 0x40, low byte). Java then sees a plain hex escape with no special semantics to misinterpret.

Motivation

XML/Dumper.pm:685 strips control chars with:

s/[\0\ca\cb\cc...\cz\c[\c\\c]\c^\c_]//g;

This regex previously failed with Regex compilation failed: Unclosed character class near index 89, taking down 8 of the 32 test files in jcpan -t Data::Serializer mid-run with non-zero exit status (the parent test scripts each die on regex compilation failure, so subtests after the failure were never reported).

Impact on jcpan -t Data::Serializer

Before After
Failed test programs 10/32 9/32
Subtests run 1250 1842
Subtests reported failed 13 13
Test files aborted 8 0

The remaining 9 failing programs are unrelated logic issues (Config::General uses (?(...)...) conditional patterns we don't implement; several files share a PHP::Serialization issue).

Test plan

  • make (all unit tests pass)
  • Added manual coverage: \ca..\cZ, \c[, \c\, \c], \c^, \c_, \c? all match the documented Perl values.
  • jcpan -t Data::SerializerXML::Dumper now passes; cascading regex crashes eliminated.
  • jcpan -t Math::Base::Convert — still 5350/5350 passing (no regression).

Generated with Devin

…cial

In a Perl regex character class, \cX is the control-character escape that
consumes the next source character regardless of what it is, including
regex-significant ones like ], \, [, ^, _ (e.g. \c] = U+001D, \c\ = U+001C).

The preprocessor's character-class bracket-end finder treated `\c` as a
plain one-char escape, so `\c]` ended the character class prematurely.
The class-content emitter likewise emitted `\c` literally, leaving Java's
regex engine to reinterpret sequences like `\c\[` as control-backslash
followed by a (nested) opening bracket — producing a misleading
"Unclosed character class" error from Pattern.compile.

Two fixes:
- handleCharacterClass(): when scanning for the closing ], skip the
  character following a `\c` escape so `\c]` is not mistaken for a class
  terminator.
- handleRegexCharacterClassEscape(): when emitting an escape, recognise
  `\cX` and convert it to `\x{HH}` (uc(X) XOR 0x40, low byte) so the Java
  regex engine sees a plain hex escape and never has to reason about the
  Perl-specific consume-next-char semantics.

This unblocks XML::Dumper's strip-control-chars regex on line 685 of
XML/Dumper.pm and removes a cascading failure that aborted 8 of the 32
test files in `jcpan -t Data::Serializer` mid-run.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@fglock fglock merged commit 1d5def2 into master Apr 25, 2026
2 checks passed
@fglock fglock deleted the fix/regex-control-char-class branch April 25, 2026 08:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant