Fix: invisible unicode and asterisk censoring bypasses by deemonic · Pull Request #49 · Blaspsoft/blasp

deemonic · 2026-03-26T21:36:08Z

Summary

Invisible Unicode bypass: Strip \p{Cf} format characters (zero-width spaces, invisible separators like U+2063) from input before processing, so f⁣uck is correctly detected
Asterisk censoring bypass: Add * as a universal letter substitution so censored profanity like f*g, s**t, f**k is detected
Internal masking fix: Use \x01 instead of * for internal masking during the detection loop, preventing re-matching of already-masked text when * is a valid substitution character

Test plan

11 new tests in BypassVulnerabilityTest.php covering both bypass vectors, false positive guards, and combined scenarios
All 293 existing tests pass with zero regressions
No existing tests were modified

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Improved profanity detection for obfuscated inputs (invisible Unicode characters and asterisk-masked words); cleaned output masks detected profanity while preserving surrounding text.
Bug Fixes
- Enhanced preprocessing to strip invisible/formatting characters and prevent bypasses; more robust masking to avoid missed detections.
Tests
- Added tests covering invisible-character bypasses, asterisk-censoring scenarios, and false‑positive checks.

…ring Two bypass vectors were reported where profanity went undetected: 1. Invisible Unicode characters between letters (e.g. f\u{2063}uck) 2. Asterisk-censored words (e.g. f*g, s**t) Fixes: strip \p{Cf} format characters before processing, add '*' as a universal letter substitution, and use \x01 for internal masking to prevent re-matching masked text. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-03-26T21:36:23Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 38367302-6180-42b6-baea-dbc5f242327c

📥 Commits

Reviewing files that changed from the base of the PR and between bcc0d89 and ea1de49.

📒 Files selected for processing (1)

src/Core/Analyzer.php

🚧 Files skipped from review as they are similar to previous changes (1)

src/Core/Analyzer.php

📝 Walkthrough

Walkthrough

Preprocessing now strips Unicode format (invisible) characters before detection; the substitution mapping in config adds '*' as a substitution candidate for many letters; RegexDriver masks matched spans with SOH (\x01) instead of *; new tests cover invisible-char and asterisk bypasses.

Changes

Cohort / File(s)	Summary
Configuration `config/blasp.php`	Added `'*'` as an additional substitution candidate for many letter patterns (a–z), adjusting the substitutions map used during normalization.
Preprocessing `src/Core/Analyzer.php`	Strip all Unicode format characters (`\p{Cf}`) from input text before passing to the detection driver (fallback to original text on null).
Detection / Masking `src/Drivers/RegexDriver.php`	Mask detected profanity spans using SOH (`"\x01"`) repeated for the match length instead of `''` to prevent `` from being treated as a substitution during iterative detection.
Tests `tests/BypassVulnerabilityTest.php`	Added new test class with 11 tests verifying detection against invisible Unicode separators, multiple invisible chars, asterisk-censored profanity, masking output, and false-positive cases.

Sequence Diagram(s)

sequenceDiagram
    participant User as Input Text
    participant Analyzer as Analyzer (preprocess)
    participant Driver as RegexDriver (normalize/subst)
    participant Regex as Regex Matcher
    participant Mask as Masking (SOH)
    participant Result as Detection Result

    User->>Analyzer: submit raw text (may include \p{Cf}, asterisks)
    Analyzer->>Analyzer: remove \p{Cf} chars (preg_replace)
    Analyzer->>Driver: pass cleaned text
    Driver->>Driver: normalize & apply substitution mapping (includes '*')
    Driver->>Regex: perform pattern matching
    Regex->>Mask: report matched spans
    Mask->>Mask: replace matched spans with SOH ("\\x01") to avoid re-detection
    Mask->>Result: return occurrences and cleaned output

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

fix: load and merge language-specific substitutions (#35) #39: Related changes to how substitution maps are loaded/merged and substitution expression generation.
Blasp v4: Driver-based architecture rewrite #48: Related masking/detection changes in RegexDriver::detect() that previously used '*' for masking.

Poem

🐰

I hopped through text both thin and thick,
I found the ghosts and every little trick,
I wiped the whispers, hugged SOH tight,
Now masked words fade from sight,
Asterisk and shadow—no more flight.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 23.08% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main changes in the pull request: fixing bypasses related to invisible Unicode characters and asterisk censoring.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/unicode-and-wildcard-bypasses

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Drivers/RegexDriver.php`:
- Around line 30-31: The removal of Unicode format chars via
preg_replace('/\p{Cf}/u', '', $text) shifts offsets so MatchedWord ranges no
longer map to the original input; either build and use a normalized→original
index map or use a position-stable normalization before creating MatchedWord
ranges. Concretely: in the code around the $text preprocessing in RegexDriver
(the preg_replace call), replace the destructive deletion with one of two fixes
— (A) construct an array mapping each character index in the normalized text
back to its original index (track removed positions while iterating the original
string) and after matches are produced translate MatchedWord start/length back
to original offsets, or (B) keep text length stable by replacing each \p{Cf}
with a placeholder of the same byte/char width (e.g., a space or neutral
codepoint) so existing match offsets remain correct — then ensure MatchedWord
uses the mapped/original offsets when creating ranges.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3f3b1801-1122-4202-a2d8-217e9e4d53e5

📥 Commits

Reviewing files that changed from the base of the PR and between a6758c9 and 32d6399.

📒 Files selected for processing (3)

config/blasp.php
src/Drivers/RegexDriver.php
tests/BypassVulnerabilityTest.php

…sition drift Moves the \p{Cf} stripping from RegexDriver to Analyzer so all drivers in a pipeline receive already-stripped text. This prevents position misalignment when PipelineDriver applies RegexDriver positions to the original (unstripped) input. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/Core/Analyzer.php`:
- Line 22: preg_replace('/\p{Cf}/u', '', $text) can return null on malformed
UTF-8 and currently silently replaces $text with null/empty; update the
Analyzer.php logic to assign the preg_replace result to a temporary (e.g. $clean
= preg_replace(...)); check if $clean === null and if so preserve the original
$text (or optionally attempt to normalize/repair encoding first), otherwise set
$text = $clean; include a brief comment or logging where appropriate to indicate
a UTF-8 parse error.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b5367f6b-0da3-40c0-8dd0-83bcbef49e62

📥 Commits

Reviewing files that changed from the base of the PR and between 32d6399 and bcc0d89.

📒 Files selected for processing (2)

src/Core/Analyzer.php
src/Drivers/RegexDriver.php

🚧 Files skipped from review as they are similar to previous changes (1)

src/Drivers/RegexDriver.php

preg_replace with /u flag returns null on invalid UTF-8 input. Fall back to the original text to avoid silently losing input. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed Mar 26, 2026

View reviewed changes

Comment thread src/Drivers/RegexDriver.php Outdated

coderabbitai Bot reviewed Mar 26, 2026

View reviewed changes

Comment thread src/Core/Analyzer.php Outdated

fix: guard against preg_replace returning null on malformed UTF-8

ea1de49

preg_replace with /u flag returns null on invalid UTF-8 input. Fall back to the original text to avoid silently losing input. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

deemonic merged commit e0a2ea5 into main Mar 27, 2026
3 checks passed

deemonic deleted the fix/unicode-and-wildcard-bypasses branch March 27, 2026 07:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: invisible unicode and asterisk censoring bypasses#49

Fix: invisible unicode and asterisk censoring bypasses#49
deemonic merged 3 commits into
mainfrom
fix/unicode-and-wildcard-bypasses

deemonic commented Mar 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 26, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

deemonic commented Mar 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

deemonic commented Mar 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 26, 2026 •

edited

Loading