Fix: invisible unicode and asterisk censoring bypasses#49
Conversation
…ring
Two bypass vectors were reported where profanity went undetected:
1. Invisible Unicode characters between letters (e.g. f\u{2063}uck)
2. Asterisk-censored words (e.g. f*g, s**t)
Fixes: strip \p{Cf} format characters before processing, add '*' as a
universal letter substitution, and use \x01 for internal masking to
prevent re-matching masked text.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughPreprocessing now strips Unicode format (invisible) characters before detection; the substitution mapping in config adds Changes
Sequence Diagram(s)sequenceDiagram
participant User as Input Text
participant Analyzer as Analyzer (preprocess)
participant Driver as RegexDriver (normalize/subst)
participant Regex as Regex Matcher
participant Mask as Masking (SOH)
participant Result as Detection Result
User->>Analyzer: submit raw text (may include \p{Cf}, asterisks)
Analyzer->>Analyzer: remove \p{Cf} chars (preg_replace)
Analyzer->>Driver: pass cleaned text
Driver->>Driver: normalize & apply substitution mapping (includes '*')
Driver->>Regex: perform pattern matching
Regex->>Mask: report matched spans
Mask->>Mask: replace matched spans with SOH ("\\x01") to avoid re-detection
Mask->>Result: return occurrences and cleaned output
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/Drivers/RegexDriver.php`:
- Around line 30-31: The removal of Unicode format chars via
preg_replace('/\p{Cf}/u', '', $text) shifts offsets so MatchedWord ranges no
longer map to the original input; either build and use a normalized→original
index map or use a position-stable normalization before creating MatchedWord
ranges. Concretely: in the code around the $text preprocessing in RegexDriver
(the preg_replace call), replace the destructive deletion with one of two fixes
— (A) construct an array mapping each character index in the normalized text
back to its original index (track removed positions while iterating the original
string) and after matches are produced translate MatchedWord start/length back
to original offsets, or (B) keep text length stable by replacing each \p{Cf}
with a placeholder of the same byte/char width (e.g., a space or neutral
codepoint) so existing match offsets remain correct — then ensure MatchedWord
uses the mapped/original offsets when creating ranges.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3f3b1801-1122-4202-a2d8-217e9e4d53e5
📒 Files selected for processing (3)
config/blasp.phpsrc/Drivers/RegexDriver.phptests/BypassVulnerabilityTest.php
…sition drift
Moves the \p{Cf} stripping from RegexDriver to Analyzer so all drivers
in a pipeline receive already-stripped text. This prevents position
misalignment when PipelineDriver applies RegexDriver positions to the
original (unstripped) input.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/Core/Analyzer.php`:
- Line 22: preg_replace('/\p{Cf}/u', '', $text) can return null on malformed
UTF-8 and currently silently replaces $text with null/empty; update the
Analyzer.php logic to assign the preg_replace result to a temporary (e.g. $clean
= preg_replace(...)); check if $clean === null and if so preserve the original
$text (or optionally attempt to normalize/repair encoding first), otherwise set
$text = $clean; include a brief comment or logging where appropriate to indicate
a UTF-8 parse error.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b5367f6b-0da3-40c0-8dd0-83bcbef49e62
📒 Files selected for processing (2)
src/Core/Analyzer.phpsrc/Drivers/RegexDriver.php
🚧 Files skipped from review as they are similar to previous changes (1)
- src/Drivers/RegexDriver.php
preg_replace with /u flag returns null on invalid UTF-8 input. Fall back to the original text to avoid silently losing input. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
\p{Cf}format characters (zero-width spaces, invisible separators like U+2063) from input before processing, sofuckis correctly detected*as a universal letter substitution so censored profanity likef*g,s**t,f**kis detected\x01instead of*for internal masking during the detection loop, preventing re-matching of already-masked text when*is a valid substitution characterTest plan
BypassVulnerabilityTest.phpcovering both bypass vectors, false positive guards, and combined scenarios🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Tests