Fix: Chunked regex with overlapping Matcher for boundary-safe data masking#959
Fix: Chunked regex with overlapping Matcher for boundary-safe data masking#959anthonygiuliano wants to merge 2 commits into
Conversation
…jongpie#639) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jongpie
left a comment
There was a problem hiding this comment.
@anthonygiuliano thanks for working on this! Conceptually, these changes seem like they'll help with the data masking issues, but some of the specific code changes are a little difficult to understand. I added a few questions to the PR - would you mind taking a look whenever you have some time?
| @TestVisible | ||
| private static final Integer DATA_MASK_REGEX_CHUNK_SIZE = 4000; | ||
| @TestVisible | ||
| private static final Integer DATA_MASK_REGEX_OVERLAP_SIZE = 20; |
There was a problem hiding this comment.
@anthonygiuliano could you share some insights on why these 2 values (4000 & 20) were chosen for the 2 constants? I think it's worth also adding some comment(s) here to explain the intent of these 2 constants & why these particular values are being used.
There was a problem hiding this comment.
Good question — digging in, the honest answer is there's no universal correct value, so I've added detailed comments on both constants and made them overridable.
Overlap (20): must be ≥ the longest value any enabled rule can match (else a value straddling a chunk boundary is missed). 20 covers the built-in rules (SSN ~11, credit card ~19 w/ separators). It can't be derived from a regex (max match length isn't generally computable), so for orgs with custom rules matching longer values it's overridable via an optional LoggerParameter__mdt.DataMaskRegexOverlapSize record.
Chunk size (4000): I benchmarked this (reproduced on a scratch org + a sandbox). With all shipped rules applied single-pass, the uncatchable Regex too complicated cliff is ~110K chars (realistic/diluted log shape) to ~220K (dense). It's a regex-engine step budget, not a char count — and notably the single worst rule (Mastercard's alternation + \3 backref) sets the cliff; the number of rules doesn't. Chunk size turned out to be a safety knob, not a perf lever: CPU was flat across 1K–64K. So 4000 is deliberately conservative (~27× under the worst-case cliff) at zero perf cost, and also overridable via LoggerParameter__mdt.DataMaskRegexChunkSize for orgs whose custom rules push the failure point down. Full rationale + measurements are in the constant comments and the CMDT Description__c fields.
| if (text == null || text.length() <= DATA_MASK_REGEX_CHUNK_SIZE) { | ||
| return text == null ? text : text.replaceAll(sensitiveDataRegEx, replacementRegEx); | ||
| } |
There was a problem hiding this comment.
I think this would read a little better if there were 2 separate/distinct if blocks:
| if (text == null || text.length() <= DATA_MASK_REGEX_CHUNK_SIZE) { | |
| return text == null ? text : text.replaceAll(sensitiveDataRegEx, replacementRegEx); | |
| } | |
| if (text == null) { | |
| return text null; | |
| } | |
| if (text.length() <= DATA_MASK_REGEX_CHUNK_SIZE) { | |
| return text.replaceAll(sensitiveDataRegEx, replacementRegEx); | |
| } |
There was a problem hiding this comment.
Done — split into two distinct if blocks as suggested (used return text; for the null case).
| private static String expandReplacement(String replacement, List<String> groups) { | ||
| String result = ''; | ||
| for (Integer i = 0; i < replacement.length(); i++) { | ||
| if (replacement.substring(i, i + 1) == '$' && i + 1 < replacement.length()) { | ||
| // Parse the group number following '$' | ||
| Integer j = i + 1; | ||
| while (j < replacement.length() && replacement.substring(j, j + 1) >= '0' && replacement.substring(j, j + 1) <= '9') { | ||
| j++; | ||
| } | ||
| if (j > i + 1) { | ||
| Integer groupNum = Integer.valueOf(replacement.substring(i + 1, j)); | ||
| if (groupNum >= 1 && groupNum < groups.size() && groups[groupNum] != null) { | ||
| result += groups[groupNum]; | ||
| } else { | ||
| result += replacement.substring(i, j); | ||
| } | ||
| i = j - 1; // -1 because the for loop increments | ||
| continue; | ||
| } | ||
| } | ||
| result += replacement.substring(i, i + 1); | ||
| } | ||
| return result; | ||
| } |
There was a problem hiding this comment.
This method is also difficult to read/understand - there are a lot of loops with index variables being manipulated, and it's not clear what the intent is for a lot of the logic. Can we include some explanatory comments, along with examples of what this method is trying to do?
There was a problem hiding this comment.
Addressed. Added a method docblock with a worked example (SSN straddling a chunk boundary) and inline comments explaining each step. The biggest offender for "index variables being manipulated" was expandReplacement — I replaced its hand-rolled i/j character scanner with a precompiled Matcher on \$(\d+), which is much clearer and preserves the key property that a $N sequence inside a captured value is not re-expanded. Logic-equivalent and covered by existing + new tests.
| // Pass 1: Find all matches using overlapping chunks, deduplicating by start position. | ||
| // When the same start position is found by multiple chunks, keep the longest match | ||
| // (the chunk with more trailing context produces the most accurate match). | ||
| Map<Integer, Integer> endByStart = new Map<Integer, Integer>(); | ||
| Map<Integer, List<String>> groupsByStart = new Map<Integer, List<String>>(); | ||
|
|
||
| for (Integer i = 0; i < line.length(); i += step) { | ||
| Integer chunkEnd = Math.min(i + DATA_MASK_REGEX_CHUNK_SIZE, line.length()); | ||
| System.Matcher m = regex.matcher(line.substring(i, chunkEnd)); | ||
| while (m.find()) { | ||
| Integer absStart = i + m.start(); | ||
| Integer absEnd = i + m.end(); | ||
| if (!endByStart.containsKey(absStart) || absEnd > endByStart.get(absStart)) { | ||
| endByStart.put(absStart, absEnd); | ||
| List<String> groups = new List<String>(); | ||
| for (Integer g = 0; g <= m.groupCount(); g++) { | ||
| groups.add(m.group(g)); | ||
| } | ||
| groupsByStart.put(absStart, groups); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
This code is a bit difficult to understand why it's doing certain things. Can we include some explanatory comments, along with examples of what this method is trying to do?
There was a problem hiding this comment.
Added explanatory comments covering why chunks overlap, why matches are keyed by absolute start position, the longest-match dedup, the explicit sort (Apex Map.keySet() has no guaranteed iteration order), and the start < pos skip — with the worked example in the method docblock.
…king - Document DATA_MASK_REGEX_CHUNK_SIZE / OVERLAP_SIZE constants with the measured regex-limit findings (un-chunked cliff ~110K-220K chars depending on content shape; chunk size is a safety knob, not a performance lever; the LimitException is uncatchable). - Split the null / length guard in applyDataMaskRuleToChunkedText into two distinct if blocks. - Add explanatory docblocks + worked examples to applyDataMaskRuleToLongLine and inline comments for the overlapping-chunk match-finding logic. - Rewrite expandReplacement using a precompiled Matcher on $(\d+) instead of a hand-rolled index scanner, preserving the property that $N inside captured group values is not re-expanded. - Make overlap and chunk size overridable at runtime via optional LoggerParameter__mdt records (DataMaskRegexOverlapSize, DataMaskRegexChunkSize) with tradeoff guidance in their descriptions. - Add a test for the overlap override; rename the serialized-record-JSON regression test to match what it actually exercises. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Thanks for the review. Summary of changes pushed:
Happy to split any of this out into a separate PR if you'd prefer to keep this one narrowly scoped to the comment/readability changes. |
Summary
Fixes #639 —
System.LimitException: Regex too complicatedwhenapplyDataMaskRulesprocesses long strings (~35K+ chars).The existing chunking logic in
applyDataMaskRuleToLongLinehad three correctness issues that this PR addresses:expandReplacement: The old iterativeString.replace('$' + i, groups[i])reinterpreted$Npatterns inside captured group values (e.g., if group 1 capturedPRICE:$3, the$3would be replaced with group 3's value). The new implementation scans the replacement template left-to-right, only interpreting$Nin the template itself — matching Java'sMatcher.appendReplacementbehavior.Test plan
start < posbranch)$Ntokens inside captured group values are preserved literally (not re-expanded as group references)🤖 Generated with Claude Code