Skip to content

Fix: Chunked regex with overlapping Matcher for boundary-safe data masking#959

Open
anthonygiuliano wants to merge 2 commits into
jongpie:mainfrom
anthonygiuliano:fix/639-overlapping-chunk-data-masking
Open

Fix: Chunked regex with overlapping Matcher for boundary-safe data masking#959
anthonygiuliano wants to merge 2 commits into
jongpie:mainfrom
anthonygiuliano:fix/639-overlapping-chunk-data-masking

Conversation

@anthonygiuliano
Copy link
Copy Markdown

Summary

Fixes #639System.LimitException: Regex too complicated when applyDataMaskRules processes long strings (~35K+ chars).

The existing chunking logic in applyDataMaskRuleToLongLine had three correctness issues that this PR addresses:

  • Longest-match deduplication: When two overlapping chunks find a match at the same start position, the old code kept whichever was found first (shorter context). Now keeps the longest match, since the chunk with more trailing context produces the most accurate result.
  • Sorted left-to-right processing: The old code processed matches in discovery order across chunks, which isn't guaranteed to be left-to-right. Now explicitly sorts start positions and skips matches consumed by previous replacements.
  • Single-pass expandReplacement: The old iterative String.replace('$' + i, groups[i]) reinterpreted $N patterns inside captured group values (e.g., if group 1 captured PRICE:$3, the $3 would be replaced with group 3's value). The new implementation scans the replacement template left-to-right, only interpreting $N in the template itself — matching Java's Matcher.appendReplacement behavior.

Test plan

  • 13 new targeted tests covering:
    • SSN in the overlap zone (no double-masking)
    • Multiple SSNs near the same chunk boundary
    • SSN at start/end of long strings
    • SSN exactly at the chunk step position
    • Credit card straddling a chunk boundary
    • Exact chunk-size and chunk-size+1 boundary conditions
    • Multiline where one line is exactly chunk size
    • Longer match wins when two chunks find same start
    • Left-to-right ordering across chunks (Map.keySet() has no guaranteed order in Apex)
    • Overlapping match skip (start < pos branch)
    • $N tokens inside captured group values are preserved literally (not re-expanded as group references)

🤖 Generated with Claude Code

Copy link
Copy Markdown
Owner

@jongpie jongpie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anthonygiuliano thanks for working on this! Conceptually, these changes seem like they'll help with the data masking issues, but some of the specific code changes are a little difficult to understand. I added a few questions to the PR - would you mind taking a look whenever you have some time?

Comment on lines +23 to +26
@TestVisible
private static final Integer DATA_MASK_REGEX_CHUNK_SIZE = 4000;
@TestVisible
private static final Integer DATA_MASK_REGEX_OVERLAP_SIZE = 20;
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anthonygiuliano could you share some insights on why these 2 values (4000 & 20) were chosen for the 2 constants? I think it's worth also adding some comment(s) here to explain the intent of these 2 constants & why these particular values are being used.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question — digging in, the honest answer is there's no universal correct value, so I've added detailed comments on both constants and made them overridable.

Overlap (20): must be ≥ the longest value any enabled rule can match (else a value straddling a chunk boundary is missed). 20 covers the built-in rules (SSN ~11, credit card ~19 w/ separators). It can't be derived from a regex (max match length isn't generally computable), so for orgs with custom rules matching longer values it's overridable via an optional LoggerParameter__mdt.DataMaskRegexOverlapSize record.

Chunk size (4000): I benchmarked this (reproduced on a scratch org + a sandbox). With all shipped rules applied single-pass, the uncatchable Regex too complicated cliff is ~110K chars (realistic/diluted log shape) to ~220K (dense). It's a regex-engine step budget, not a char count — and notably the single worst rule (Mastercard's alternation + \3 backref) sets the cliff; the number of rules doesn't. Chunk size turned out to be a safety knob, not a perf lever: CPU was flat across 1K–64K. So 4000 is deliberately conservative (~27× under the worst-case cliff) at zero perf cost, and also overridable via LoggerParameter__mdt.DataMaskRegexChunkSize for orgs whose custom rules push the failure point down. Full rationale + measurements are in the constant comments and the CMDT Description__c fields.

Comment on lines +1165 to +1167
if (text == null || text.length() <= DATA_MASK_REGEX_CHUNK_SIZE) {
return text == null ? text : text.replaceAll(sensitiveDataRegEx, replacementRegEx);
}
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would read a little better if there were 2 separate/distinct if blocks:

Suggested change
if (text == null || text.length() <= DATA_MASK_REGEX_CHUNK_SIZE) {
return text == null ? text : text.replaceAll(sensitiveDataRegEx, replacementRegEx);
}
if (text == null) {
return text null;
}
if (text.length() <= DATA_MASK_REGEX_CHUNK_SIZE) {
return text.replaceAll(sensitiveDataRegEx, replacementRegEx);
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — split into two distinct if blocks as suggested (used return text; for the null case).

Comment on lines +1235 to +1258
private static String expandReplacement(String replacement, List<String> groups) {
String result = '';
for (Integer i = 0; i < replacement.length(); i++) {
if (replacement.substring(i, i + 1) == '$' && i + 1 < replacement.length()) {
// Parse the group number following '$'
Integer j = i + 1;
while (j < replacement.length() && replacement.substring(j, j + 1) >= '0' && replacement.substring(j, j + 1) <= '9') {
j++;
}
if (j > i + 1) {
Integer groupNum = Integer.valueOf(replacement.substring(i + 1, j));
if (groupNum >= 1 && groupNum < groups.size() && groups[groupNum] != null) {
result += groups[groupNum];
} else {
result += replacement.substring(i, j);
}
i = j - 1; // -1 because the for loop increments
continue;
}
}
result += replacement.substring(i, i + 1);
}
return result;
}
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is also difficult to read/understand - there are a lot of loops with index variables being manipulated, and it's not clear what the intent is for a lot of the logic. Can we include some explanatory comments, along with examples of what this method is trying to do?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. Added a method docblock with a worked example (SSN straddling a chunk boundary) and inline comments explaining each step. The biggest offender for "index variables being manipulated" was expandReplacement — I replaced its hand-rolled i/j character scanner with a precompiled Matcher on \$(\d+), which is much clearer and preserves the key property that a $N sequence inside a captured value is not re-expanded. Logic-equivalent and covered by existing + new tests.

Comment on lines +1189 to +1210
// Pass 1: Find all matches using overlapping chunks, deduplicating by start position.
// When the same start position is found by multiple chunks, keep the longest match
// (the chunk with more trailing context produces the most accurate match).
Map<Integer, Integer> endByStart = new Map<Integer, Integer>();
Map<Integer, List<String>> groupsByStart = new Map<Integer, List<String>>();

for (Integer i = 0; i < line.length(); i += step) {
Integer chunkEnd = Math.min(i + DATA_MASK_REGEX_CHUNK_SIZE, line.length());
System.Matcher m = regex.matcher(line.substring(i, chunkEnd));
while (m.find()) {
Integer absStart = i + m.start();
Integer absEnd = i + m.end();
if (!endByStart.containsKey(absStart) || absEnd > endByStart.get(absStart)) {
endByStart.put(absStart, absEnd);
List<String> groups = new List<String>();
for (Integer g = 0; g <= m.groupCount(); g++) {
groups.add(m.group(g));
}
groupsByStart.put(absStart, groups);
}
}
}
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is a bit difficult to understand why it's doing certain things. Can we include some explanatory comments, along with examples of what this method is trying to do?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added explanatory comments covering why chunks overlap, why matches are keyed by absolute start position, the longest-match dedup, the explicit sort (Apex Map.keySet() has no guaranteed iteration order), and the start < pos skip — with the worked example in the method docblock.

…king

- Document DATA_MASK_REGEX_CHUNK_SIZE / OVERLAP_SIZE constants with the
  measured regex-limit findings (un-chunked cliff ~110K-220K chars
  depending on content shape; chunk size is a safety knob, not a
  performance lever; the LimitException is uncatchable).
- Split the null / length guard in applyDataMaskRuleToChunkedText into
  two distinct if blocks.
- Add explanatory docblocks + worked examples to
  applyDataMaskRuleToLongLine and inline comments for the
  overlapping-chunk match-finding logic.
- Rewrite expandReplacement using a precompiled Matcher on $(\d+)
  instead of a hand-rolled index scanner, preserving the property that
  $N inside captured group values is not re-expanded.
- Make overlap and chunk size overridable at runtime via optional
  LoggerParameter__mdt records (DataMaskRegexOverlapSize,
  DataMaskRegexChunkSize) with tradeoff guidance in their descriptions.
- Add a test for the overlap override; rename the serialized-record-JSON
  regression test to match what it actually exercises.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@anthonygiuliano
Copy link
Copy Markdown
Author

Thanks for the review. Summary of changes pushed:

  • Constants: documented in detail + made overridable via LoggerParameter__mdt records (DataMaskRegexOverlapSize, DataMaskRegexChunkSize). Rationale: the right values depend on an org's custom rule match-lengths and payload shapes, so hard-coding wasn't appropriate — conservative measured defaults plus an escape hatch. The "why 4000 / why 20" detail is in the L26 thread.
  • Readability: docblocks + worked examples on applyDataMaskRuleToLongLine; expandReplacement rewritten from the hand-rolled index scanner to a Matcher (logic-equivalent, covered by tests).
  • Tests: added an overlap-override test; renamed it_should_apply_data_mask_rule_to_digit_dense_string_without_limit_exceptionit_should_mask_serialized_record_json_without_limit_exception (the input is serialized-record JSON, and benchmarking showed dense input is actually the safe case — diluted/realistic text trips the limit sooner).
  • All LogEntryEventBuilder_Tests pass (138).

Happy to split any of this out into a separate PR if you'd prefer to keep this one narrowly scoped to the comment/readability changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

System.LimitException: Regex too complicated from LogEntryEventBuilder.applyDataMaskRules:

2 participants