fix(regex): \cX inside character class consumes next char even if special#558
Merged
fix(regex): \cX inside character class consumes next char even if special#558
Conversation
…cial
In a Perl regex character class, \cX is the control-character escape that
consumes the next source character regardless of what it is, including
regex-significant ones like ], \, [, ^, _ (e.g. \c] = U+001D, \c\ = U+001C).
The preprocessor's character-class bracket-end finder treated `\c` as a
plain one-char escape, so `\c]` ended the character class prematurely.
The class-content emitter likewise emitted `\c` literally, leaving Java's
regex engine to reinterpret sequences like `\c\[` as control-backslash
followed by a (nested) opening bracket — producing a misleading
"Unclosed character class" error from Pattern.compile.
Two fixes:
- handleCharacterClass(): when scanning for the closing ], skip the
character following a `\c` escape so `\c]` is not mistaken for a class
terminator.
- handleRegexCharacterClassEscape(): when emitting an escape, recognise
`\cX` and convert it to `\x{HH}` (uc(X) XOR 0x40, low byte) so the Java
regex engine sees a plain hex escape and never has to reason about the
Perl-specific consume-next-char semantics.
This unblocks XML::Dumper's strip-control-chars regex on line 685 of
XML/Dumper.pm and removes a cascading failure that aborted 8 of the 32
test files in `jcpan -t Data::Serializer` mid-run.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the regex preprocessor's handling of
\cXinside character classes whenXis a regex-significant character (],\,[,^, etc.).In Perl,
\cXis a control-character escape that always consumes the next source character, regardless of what it is — so\c]is U+001D (control-]),\c\is U+001C (control-\), and so on. Our preprocessor treated\cas a plain one-char escape, which caused two problems:]of a character class,\c]was misread as the class terminator.\cXwas emitted to Java as-is, so Java'sPatternre-parsed sequences like\c\[as control-backslash followed by an opening bracket — yielding a misleadingUnclosed character classerror.Fix
RegexPreprocessor.handleCharacterClass: skip the character following a\cescape when locating the closing].RegexPreprocessorHelper.handleRegexCharacterClassEscape: detect\cXand convert it to\x{HH}using Perl semantics (uc(X) XOR 0x40, low byte). Java then sees a plain hex escape with no special semantics to misinterpret.Motivation
XML/Dumper.pm:685strips control chars with:This regex previously failed with
Regex compilation failed: Unclosed character class near index 89, taking down 8 of the 32 test files injcpan -t Data::Serializermid-run with non-zero exit status (the parent test scripts eachdieon regex compilation failure, so subtests after the failure were never reported).Impact on
jcpan -t Data::SerializerThe remaining 9 failing programs are unrelated logic issues (
Config::Generaluses(?(...)...)conditional patterns we don't implement; several files share aPHP::Serializationissue).Test plan
make(all unit tests pass)\ca..\cZ,\c[,\c\,\c],\c^,\c_,\c?all match the documented Perl values.jcpan -t Data::Serializer—XML::Dumpernow passes; cascading regex crashes eliminated.jcpan -t Math::Base::Convert— still 5350/5350 passing (no regression).Generated with Devin