Skip to content

Fix: invisible unicode and asterisk censoring bypasses#49

Merged
deemonic merged 3 commits into
mainfrom
fix/unicode-and-wildcard-bypasses
Mar 27, 2026
Merged

Fix: invisible unicode and asterisk censoring bypasses#49
deemonic merged 3 commits into
mainfrom
fix/unicode-and-wildcard-bypasses

Conversation

@deemonic
Copy link
Copy Markdown
Collaborator

@deemonic deemonic commented Mar 26, 2026

Summary

  • Invisible Unicode bypass: Strip \p{Cf} format characters (zero-width spaces, invisible separators like U+2063) from input before processing, so f⁣uck is correctly detected
  • Asterisk censoring bypass: Add * as a universal letter substitution so censored profanity like f*g, s**t, f**k is detected
  • Internal masking fix: Use \x01 instead of * for internal masking during the detection loop, preventing re-matching of already-masked text when * is a valid substitution character

Test plan

  • 11 new tests in BypassVulnerabilityTest.php covering both bypass vectors, false positive guards, and combined scenarios
  • All 293 existing tests pass with zero regressions
  • No existing tests were modified

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Improved profanity detection for obfuscated inputs (invisible Unicode characters and asterisk-masked words); cleaned output masks detected profanity while preserving surrounding text.
  • Bug Fixes

    • Enhanced preprocessing to strip invisible/formatting characters and prevent bypasses; more robust masking to avoid missed detections.
  • Tests

    • Added tests covering invisible-character bypasses, asterisk-censoring scenarios, and false‑positive checks.

…ring

Two bypass vectors were reported where profanity went undetected:
1. Invisible Unicode characters between letters (e.g. f\u{2063}uck)
2. Asterisk-censored words (e.g. f*g, s**t)

Fixes: strip \p{Cf} format characters before processing, add '*' as a
universal letter substitution, and use \x01 for internal masking to
prevent re-matching masked text.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 26, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 38367302-6180-42b6-baea-dbc5f242327c

📥 Commits

Reviewing files that changed from the base of the PR and between bcc0d89 and ea1de49.

📒 Files selected for processing (1)
  • src/Core/Analyzer.php
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/Core/Analyzer.php

📝 Walkthrough

Walkthrough

Preprocessing now strips Unicode format (invisible) characters before detection; the substitution mapping in config adds '*' as a substitution candidate for many letters; RegexDriver masks matched spans with SOH (\x01) instead of *; new tests cover invisible-char and asterisk bypasses.

Changes

Cohort / File(s) Summary
Configuration
config/blasp.php
Added '*' as an additional substitution candidate for many letter patterns (a–z), adjusting the substitutions map used during normalization.
Preprocessing
src/Core/Analyzer.php
Strip all Unicode format characters (\p{Cf}) from input text before passing to the detection driver (fallback to original text on null).
Detection / Masking
src/Drivers/RegexDriver.php
Mask detected profanity spans using SOH ("\x01") repeated for the match length instead of '*' to prevent * from being treated as a substitution during iterative detection.
Tests
tests/BypassVulnerabilityTest.php
Added new test class with 11 tests verifying detection against invisible Unicode separators, multiple invisible chars, asterisk-censored profanity, masking output, and false-positive cases.

Sequence Diagram(s)

sequenceDiagram
    participant User as Input Text
    participant Analyzer as Analyzer (preprocess)
    participant Driver as RegexDriver (normalize/subst)
    participant Regex as Regex Matcher
    participant Mask as Masking (SOH)
    participant Result as Detection Result

    User->>Analyzer: submit raw text (may include \p{Cf}, asterisks)
    Analyzer->>Analyzer: remove \p{Cf} chars (preg_replace)
    Analyzer->>Driver: pass cleaned text
    Driver->>Driver: normalize & apply substitution mapping (includes '*')
    Driver->>Regex: perform pattern matching
    Regex->>Mask: report matched spans
    Mask->>Mask: replace matched spans with SOH ("\\x01") to avoid re-detection
    Mask->>Result: return occurrences and cleaned output
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰

I hopped through text both thin and thick,
I found the ghosts and every little trick,
I wiped the whispers, hugged SOH tight,
Now masked words fade from sight,
Asterisk and shadow—no more flight.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.08% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes in the pull request: fixing bypasses related to invisible Unicode characters and asterisk censoring.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/unicode-and-wildcard-bypasses

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Drivers/RegexDriver.php`:
- Around line 30-31: The removal of Unicode format chars via
preg_replace('/\p{Cf}/u', '', $text) shifts offsets so MatchedWord ranges no
longer map to the original input; either build and use a normalized→original
index map or use a position-stable normalization before creating MatchedWord
ranges. Concretely: in the code around the $text preprocessing in RegexDriver
(the preg_replace call), replace the destructive deletion with one of two fixes
— (A) construct an array mapping each character index in the normalized text
back to its original index (track removed positions while iterating the original
string) and after matches are produced translate MatchedWord start/length back
to original offsets, or (B) keep text length stable by replacing each \p{Cf}
with a placeholder of the same byte/char width (e.g., a space or neutral
codepoint) so existing match offsets remain correct — then ensure MatchedWord
uses the mapped/original offsets when creating ranges.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3f3b1801-1122-4202-a2d8-217e9e4d53e5

📥 Commits

Reviewing files that changed from the base of the PR and between a6758c9 and 32d6399.

📒 Files selected for processing (3)
  • config/blasp.php
  • src/Drivers/RegexDriver.php
  • tests/BypassVulnerabilityTest.php

Comment thread src/Drivers/RegexDriver.php Outdated
…sition drift

Moves the \p{Cf} stripping from RegexDriver to Analyzer so all drivers
in a pipeline receive already-stripped text. This prevents position
misalignment when PipelineDriver applies RegexDriver positions to the
original (unstripped) input.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Core/Analyzer.php`:
- Line 22: preg_replace('/\p{Cf}/u', '', $text) can return null on malformed
UTF-8 and currently silently replaces $text with null/empty; update the
Analyzer.php logic to assign the preg_replace result to a temporary (e.g. $clean
= preg_replace(...)); check if $clean === null and if so preserve the original
$text (or optionally attempt to normalize/repair encoding first), otherwise set
$text = $clean; include a brief comment or logging where appropriate to indicate
a UTF-8 parse error.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b5367f6b-0da3-40c0-8dd0-83bcbef49e62

📥 Commits

Reviewing files that changed from the base of the PR and between 32d6399 and bcc0d89.

📒 Files selected for processing (2)
  • src/Core/Analyzer.php
  • src/Drivers/RegexDriver.php
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/Drivers/RegexDriver.php

Comment thread src/Core/Analyzer.php Outdated
preg_replace with /u flag returns null on invalid UTF-8 input. Fall
back to the original text to avoid silently losing input.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@deemonic deemonic merged commit e0a2ea5 into main Mar 27, 2026
3 checks passed
@deemonic deemonic deleted the fix/unicode-and-wildcard-bypasses branch March 27, 2026 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant