Fix: Chunked regex with overlapping Matcher for boundary-safe data masking by anthonygiuliano · Pull Request #959 · jongpie/NebulaLogger

anthonygiuliano · 2026-03-03T23:31:40Z

Summary

Fixes #639 — System.LimitException: Regex too complicated when applyDataMaskRules processes long strings (~35K+ chars).

The existing chunking logic in applyDataMaskRuleToLongLine had three correctness issues that this PR addresses:

Longest-match deduplication: When two overlapping chunks find a match at the same start position, the old code kept whichever was found first (shorter context). Now keeps the longest match, since the chunk with more trailing context produces the most accurate result.
Sorted left-to-right processing: The old code processed matches in discovery order across chunks, which isn't guaranteed to be left-to-right. Now explicitly sorts start positions and skips matches consumed by previous replacements.
Single-pass expandReplacement: The old iterative String.replace('$' + i, groups[i]) reinterpreted $N patterns inside captured group values (e.g., if group 1 captured PRICE:$3, the $3 would be replaced with group 3's value). The new implementation scans the replacement template left-to-right, only interpreting $N in the template itself — matching Java's Matcher.appendReplacement behavior.

Test plan

13 new targeted tests covering:
- SSN in the overlap zone (no double-masking)
- Multiple SSNs near the same chunk boundary
- SSN at start/end of long strings
- SSN exactly at the chunk step position
- Credit card straddling a chunk boundary
- Exact chunk-size and chunk-size+1 boundary conditions
- Multiline where one line is exactly chunk size
- Longer match wins when two chunks find same start
- Left-to-right ordering across chunks (Map.keySet() has no guaranteed order in Apex)
- Overlapping match skip (start < pos branch)
- $N tokens inside captured group values are preserved literally (not re-expanded as group references)

🤖 Generated with Claude Code

…jongpie#639) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jongpie

@anthonygiuliano thanks for working on this! Conceptually, these changes seem like they'll help with the data masking issues, but some of the specific code changes are a little difficult to understand. I added a few questions to the PR - would you mind taking a look whenever you have some time?

jongpie · 2026-05-08T16:46:47Z

+  @TestVisible
+  private static final Integer DATA_MASK_REGEX_CHUNK_SIZE = 4000;
+  @TestVisible
+  private static final Integer DATA_MASK_REGEX_OVERLAP_SIZE = 20;


@anthonygiuliano could you share some insights on why these 2 values (4000 & 20) were chosen for the 2 constants? I think it's worth also adding some comment(s) here to explain the intent of these 2 constants & why these particular values are being used.

Good question — digging in, the honest answer is there's no universal correct value, so I've added detailed comments on both constants and made them overridable.

Overlap (20): must be ≥ the longest value any enabled rule can match (else a value straddling a chunk boundary is missed). 20 covers the built-in rules (SSN ~11, credit card ~19 w/ separators). It can't be derived from a regex (max match length isn't generally computable), so for orgs with custom rules matching longer values it's overridable via an optional LoggerParameter__mdt.DataMaskRegexOverlapSize record.

Chunk size (4000): I benchmarked this (reproduced on a scratch org + a sandbox). With all shipped rules applied single-pass, the uncatchable Regex too complicated cliff is ~110K chars (realistic/diluted log shape) to ~220K (dense). It's a regex-engine step budget, not a char count — and notably the single worst rule (Mastercard's alternation + \3 backref) sets the cliff; the number of rules doesn't. Chunk size turned out to be a safety knob, not a perf lever: CPU was flat across 1K–64K. So 4000 is deliberately conservative (~27× under the worst-case cliff) at zero perf cost, and also overridable via LoggerParameter__mdt.DataMaskRegexChunkSize for orgs whose custom rules push the failure point down. Full rationale + measurements are in the constant comments and the CMDT Description__c fields.

jongpie · 2026-05-08T16:49:07Z

+    if (text == null || text.length() <= DATA_MASK_REGEX_CHUNK_SIZE) {
+      return text == null ? text : text.replaceAll(sensitiveDataRegEx, replacementRegEx);
+    }


I think this would read a little better if there were 2 separate/distinct if blocks:

Suggested change

if (text == null || text.length() <= DATA_MASK_REGEX_CHUNK_SIZE) {

return text == null ? text : text.replaceAll(sensitiveDataRegEx, replacementRegEx);

}

if (text == null) {

return text null;

}

if (text.length() <= DATA_MASK_REGEX_CHUNK_SIZE) {

return text.replaceAll(sensitiveDataRegEx, replacementRegEx);

}

Done — split into two distinct if blocks as suggested (used return text; for the null case).

jongpie · 2026-05-08T17:03:27Z

+  private static String expandReplacement(String replacement, List<String> groups) {
+    String result = '';
+    for (Integer i = 0; i < replacement.length(); i++) {
+      if (replacement.substring(i, i + 1) == '$' && i + 1 < replacement.length()) {
+        // Parse the group number following '$'
+        Integer j = i + 1;
+        while (j < replacement.length() && replacement.substring(j, j + 1) >= '0' && replacement.substring(j, j + 1) <= '9') {
+          j++;
+        }
+        if (j > i + 1) {
+          Integer groupNum = Integer.valueOf(replacement.substring(i + 1, j));
+          if (groupNum >= 1 && groupNum < groups.size() && groups[groupNum] != null) {
+            result += groups[groupNum];
+          } else {
+            result += replacement.substring(i, j);
+          }
+          i = j - 1; // -1 because the for loop increments
+          continue;
+        }
+      }
+      result += replacement.substring(i, i + 1);
+    }
+    return result;
+  }


This method is also difficult to read/understand - there are a lot of loops with index variables being manipulated, and it's not clear what the intent is for a lot of the logic. Can we include some explanatory comments, along with examples of what this method is trying to do?

Addressed. Added a method docblock with a worked example (SSN straddling a chunk boundary) and inline comments explaining each step. The biggest offender for "index variables being manipulated" was expandReplacement — I replaced its hand-rolled i/j character scanner with a precompiled Matcher on \$(\d+), which is much clearer and preserves the key property that a $N sequence inside a captured value is not re-expanded. Logic-equivalent and covered by existing + new tests.

jongpie · 2026-05-08T17:14:01Z

+    // Pass 1: Find all matches using overlapping chunks, deduplicating by start position.
+    // When the same start position is found by multiple chunks, keep the longest match
+    // (the chunk with more trailing context produces the most accurate match).
+    Map<Integer, Integer> endByStart = new Map<Integer, Integer>();
+    Map<Integer, List<String>> groupsByStart = new Map<Integer, List<String>>();
+
+    for (Integer i = 0; i < line.length(); i += step) {
+      Integer chunkEnd = Math.min(i + DATA_MASK_REGEX_CHUNK_SIZE, line.length());
+      System.Matcher m = regex.matcher(line.substring(i, chunkEnd));
+      while (m.find()) {
+        Integer absStart = i + m.start();
+        Integer absEnd = i + m.end();
+        if (!endByStart.containsKey(absStart) || absEnd > endByStart.get(absStart)) {
+          endByStart.put(absStart, absEnd);
+          List<String> groups = new List<String>();
+          for (Integer g = 0; g <= m.groupCount(); g++) {
+            groups.add(m.group(g));
+          }
+          groupsByStart.put(absStart, groups);
+        }
+      }
+    }


This code is a bit difficult to understand why it's doing certain things. Can we include some explanatory comments, along with examples of what this method is trying to do?

Added explanatory comments covering why chunks overlap, why matches are keyed by absolute start position, the longest-match dedup, the explicit sort (Apex Map.keySet() has no guaranteed iteration order), and the start < pos skip — with the worked example in the method docblock.

…king - Document DATA_MASK_REGEX_CHUNK_SIZE / OVERLAP_SIZE constants with the measured regex-limit findings (un-chunked cliff ~110K-220K chars depending on content shape; chunk size is a safety knob, not a performance lever; the LimitException is uncatchable). - Split the null / length guard in applyDataMaskRuleToChunkedText into two distinct if blocks. - Add explanatory docblocks + worked examples to applyDataMaskRuleToLongLine and inline comments for the overlapping-chunk match-finding logic. - Rewrite expandReplacement using a precompiled Matcher on $(\d+) instead of a hand-rolled index scanner, preserving the property that $N inside captured group values is not re-expanded. - Make overlap and chunk size overridable at runtime via optional LoggerParameter__mdt records (DataMaskRegexOverlapSize, DataMaskRegexChunkSize) with tradeoff guidance in their descriptions. - Add a test for the overlap override; rename the serialized-record-JSON regression test to match what it actually exercises. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

anthonygiuliano · 2026-05-18T00:26:30Z

Thanks for the review. Summary of changes pushed:

Constants: documented in detail + made overridable via LoggerParameter__mdt records (DataMaskRegexOverlapSize, DataMaskRegexChunkSize). Rationale: the right values depend on an org's custom rule match-lengths and payload shapes, so hard-coding wasn't appropriate — conservative measured defaults plus an escape hatch. The "why 4000 / why 20" detail is in the L26 thread.
Readability: docblocks + worked examples on applyDataMaskRuleToLongLine; expandReplacement rewritten from the hand-rolled index scanner to a Matcher (logic-equivalent, covered by tests).
Tests: added an overlap-override test; renamed it_should_apply_data_mask_rule_to_digit_dense_string_without_limit_exception → it_should_mask_serialized_record_json_without_limit_exception (the input is serialized-record JSON, and benchmarking showed dense input is actually the safe case — diluted/realistic text trips the limit sooner).
All LogEntryEventBuilder_Tests pass (138).

Happy to split any of this out into a separate PR if you'd prefer to keep this one narrowly scoped to the comment/readability changes.

Chunked regex with overlapping Matcher for boundary-safe data masking (…

6fd74d8

…jongpie#639) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

anthonygiuliano mentioned this pull request Mar 6, 2026

System.LimitException: Regex too complicated from LogEntryEventBuilder.applyDataMaskRules: #639

Open

anthonygiuliano marked this pull request as ready for review March 12, 2026 03:22

anthonygiuliano requested a review from jongpie as a code owner March 12, 2026 03:22

jongpie mentioned this pull request Mar 27, 2026

Error : System.LimitException: Regex too complicated #968

Closed

jongpie added this to the Version 4.19.0: Spring '26 Release milestone May 6, 2026

jongpie reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Chunked regex with overlapping Matcher for boundary-safe data masking#959

Fix: Chunked regex with overlapping Matcher for boundary-safe data masking#959
anthonygiuliano wants to merge 2 commits into
jongpie:mainfrom
anthonygiuliano:fix/639-overlapping-chunk-data-masking

anthonygiuliano commented Mar 3, 2026

Uh oh!

jongpie left a comment

Uh oh!

jongpie May 8, 2026

Uh oh!

anthonygiuliano May 18, 2026

Uh oh!

jongpie May 8, 2026

Uh oh!

anthonygiuliano May 18, 2026

Uh oh!

jongpie May 8, 2026

Uh oh!

anthonygiuliano May 18, 2026

Uh oh!

jongpie May 8, 2026

Uh oh!

anthonygiuliano May 18, 2026

Uh oh!

anthonygiuliano commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    if (text == null || text.length() <= DATA_MASK_REGEX_CHUNK_SIZE) {
-      return text == null ? text : text.replaceAll(sensitiveDataRegEx, replacementRegEx);
-    }
+    if (text == null) {
+      return text null;
+    }
+    if (text.length() <= DATA_MASK_REGEX_CHUNK_SIZE) {
+      return text.replaceAll(sensitiveDataRegEx, replacementRegEx);
+    }

Uh oh!

Conversation

anthonygiuliano commented Mar 3, 2026

Summary

Test plan

Uh oh!

jongpie left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anthonygiuliano commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants