Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
<?xml version="1.0" encoding="UTF-8"?>
<CustomMetadata xmlns="http://soap.sforce.com/2006/04/metadata" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<label>Data Mask Regex: Chunk Size</label>
<protected>false</protected>
<values>
<field>Comments__c</field>
<value xsi:nil="true"/>
</values>
<values>
<field>Description__c</field>
<value xsi:type="xsd:string">When data masking is applied to a very long string, the value is processed in chunks of this many characters to avoid the Apex &apos;System.LimitException: Regex too complicated&apos; error (which Salesforce raises when a single regex evaluation is too expensive). Tradeoffs: a LARGER chunk size means fewer chunks and slightly less overlap re-scanning, but each regex evaluation runs against more text and is therefore more likely to hit the LimitException; a SMALLER chunk size is safer against the limit but increases chunk count and overlap re-scan overhead. The chunk size must also be larger than DataMaskRegexOverlapSize plus the longest value any enabled rule can match, or boundary values can be missed. The default (4000) is a deliberately conservative value, roughly 27x below even the worst-case measured failure point. With all four shipped rules applied single-pass (no chunking), the limit was hit between ~110K characters (realistic log-shaped text) and ~220K characters (dense structured input) — diluting matches with ordinary text makes it fail sooner, not later, because the limit is a regex-engine step budget rather than a character count. Note: that LimitException is uncatchable, so without chunking a single oversized log message fails the whole logging call. Benchmarking found chunk size to be a safety knob rather than a performance lever: processing CPU was effectively flat across chunk sizes from 1K to 64K, so raising this value yields no measurable speedup while moving closer to the failure point. Lower this if a custom rule still throws &apos;Regex too complicated&apos; at the default; only raise it after testing your specific rule regexes against representative data. When no record is configured, Nebula Logger falls back to its built-in default of 4000.</value>
</values>
<values>
<field>Value__c</field>
<value xsi:type="xsd:string">4000</value>
</values>
</CustomMetadata>
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
<?xml version="1.0" encoding="UTF-8"?>
<CustomMetadata xmlns="http://soap.sforce.com/2006/04/metadata" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<label>Data Mask Regex: Overlap Size</label>
<protected>false</protected>
<values>
<field>Comments__c</field>
<value xsi:nil="true"/>
</values>
<values>
<field>Description__c</field>
<value xsi:type="xsd:string">When data masking is applied to a very long string, the value is processed in overlapping chunks to avoid the Apex &apos;System.LimitException: Regex too complicated&apos; error. This integer controls how many characters adjacent chunks overlap, which guarantees that a sensitive value sitting on a chunk boundary is still fully contained within at least one chunk. This value MUST be greater than or equal to the longest value that any enabled LogEntryDataMaskRule__mdt regex can match. The default (20) covers the built-in rules (SSN ~11 chars, credit card ~19 chars with separators); increase it if you add custom rules that match longer values. When no record is configured, Nebula Logger falls back to its built-in default of 20.</value>
</values>
<values>
<field>Value__c</field>
<value xsi:type="xsd:string">20</value>
</values>
</CustomMetadata>
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,59 @@ global with sharing class LogEntryEventBuilder {
private static final String HTTP_HEADER_FORMAT = '{0}: {1}';
private static final String NEW_LINE_DELIMITER = '\n';

// Data-masking regex is applied in overlapping chunks to avoid Apex's
// `System.LimitException: Regex too complicated`, which Salesforce throws when a single
// regex evaluation exceeds an internal step budget. See issue #639. Salesforce does not
// document the threshold; it depends on the input length and the specific rule's
// pattern. Each enabled rule is an independent `replaceAll` with its own step budget,
// so the single most expensive rule sets the cliff — running more rules does not lower
// it (rule count only adds cumulative CPU, a separate limit). Of the four shipped rules
// the Mastercard pattern (long alternation + `\3` backreference) is by far the worst;
// measured single-pass it alone throws at the same size as all four together.
// Critically, this LimitException is UNCATCHABLE — a try/catch around `replaceAll`
// does not trap it — so without chunking a single large log message makes the entire
// logging call fail unrecoverably.
//
// DATA_MASK_REGEX_CHUNK_SIZE: the max number of characters fed to a single
// `replaceAll`/`Matcher` evaluation. 4000 is a deliberately conservative fixed value,
// trading a higher chunk count for a wide safety margin below the measured failure point.
//
// Measured (Nebula Logger v4.17.3, current Apex regex engine, all four shipped rules
// applied single-pass to the whole blob — i.e. the pre-chunking `applyDataMaskRules`
// path; reproduced identically on a scratch org and a sandbox). The un-chunked cliff
// depends on content shape, not just length, because the limit is a regex-engine STEP
// budget (CPU at the cliff was a steady ~25-45 ms across every shape tested):
// - dense, structured near-miss tokens (best case): throws at ~220K chars
// - tokens diluted with inert text (realistic log shape, worst case): throws at ~110K
// i.e. diluting matches with ordinary text makes it fail SOONER (more engine steps per
// character), not later. The original #639 report at ~35K was an older, lower-threshold
// engine. The default chunk size of 4000 is ~27x below even the worst-case cliff.
//
// Chunk size is a SAFETY knob, not a performance lever: with chunking enabled, CPU was
// flat (<6 ms variance) across chunk sizes 1K-64K and roughly linear in input length
// (200K chars masked in single-digit ms). Raising the chunk size yields no measurable
// speedup and only moves toward the cliff; lowering it adds margin at negligible cost.
// Lower it (via the override below) if custom rules push the failure point down.
// Overridable at runtime via the optional `LoggerParameter__mdt.DataMaskRegexChunkSize`
// record (no deploy required); the constant below is only the default.
//
// DATA_MASK_REGEX_OVERLAP_SIZE: adjacent chunks overlap by this many characters so a
// sensitive value that straddles a chunk boundary is still fully contained within at
// least one chunk. This value MUST be >= the longest sensitive value any data-mask rule
// can match; 20 covers the built-in rules (SSN ~11 chars, credit card ~19 chars with
// separators). It cannot be derived from the rule regexes (a pattern's max match length
// is not generally computable), so for orgs whose custom rules match longer values it is
// overridable at runtime via the optional `LoggerParameter__mdt.DataMaskRegexOverlapSize`
// record (no deploy required); the constant below is only the default.
@TestVisible
private static final Integer DATA_MASK_REGEX_CHUNK_SIZE = 4000;
@TestVisible
private static final Integer DATA_MASK_REGEX_OVERLAP_SIZE = 20;
Comment on lines +67 to +70
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anthonygiuliano could you share some insights on why these 2 values (4000 & 20) were chosen for the 2 constants? I think it's worth also adding some comment(s) here to explain the intent of these 2 constants & why these particular values are being used.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question — digging in, the honest answer is there's no universal correct value, so I've added detailed comments on both constants and made them overridable.

Overlap (20): must be ≥ the longest value any enabled rule can match (else a value straddling a chunk boundary is missed). 20 covers the built-in rules (SSN ~11, credit card ~19 w/ separators). It can't be derived from a regex (max match length isn't generally computable), so for orgs with custom rules matching longer values it's overridable via an optional LoggerParameter__mdt.DataMaskRegexOverlapSize record.

Chunk size (4000): I benchmarked this (reproduced on a scratch org + a sandbox). With all shipped rules applied single-pass, the uncatchable Regex too complicated cliff is ~110K chars (realistic/diluted log shape) to ~220K (dense). It's a regex-engine step budget, not a char count — and notably the single worst rule (Mastercard's alternation + \3 backref) sets the cliff; the number of rules doesn't. Chunk size turned out to be a safety knob, not a perf lever: CPU was flat across 1K–64K. So 4000 is deliberately conservative (~27× under the worst-case cliff) at zero perf cost, and also overridable via LoggerParameter__mdt.DataMaskRegexChunkSize for orgs whose custom rules push the failure point down. Full rationale + measurements are in the constant comments and the CMDT Description__c fields.

// Matches a `$N` capture-group reference (N = one or more digits) inside a replacement
// template. Used by expandReplacement(); safe to regex directly since replacement
// templates are short config values, never the long log payload.
private static final System.Pattern DATA_MASK_REPLACEMENT_TOKEN_PATTERN = System.Pattern.compile('\\$([0-9]+)');

private static String cachedOrganizationEnvironmentType;

@TestVisible
Expand Down Expand Up @@ -1150,12 +1203,176 @@ global with sharing class LogEntryEventBuilder {

for (LogEntryDataMaskRule__mdt dataMaskRule : CACHED_DATA_MASK_RULES.values()) {
if (dataMaskRule.IsEnabled__c) {
dataInput = dataInput.replaceAll(dataMaskRule.SensitiveDataRegEx__c, dataMaskRule.ReplacementRegEx__c);
dataInput = applyDataMaskRuleToChunkedText(dataInput, dataMaskRule.SensitiveDataRegEx__c, dataMaskRule.ReplacementRegEx__c);
}
}
return dataInput;
}

// Chunk size defaults to DATA_MASK_REGEX_CHUNK_SIZE but can be tuned without a deploy via
// the optional LoggerParameter__mdt.DataMaskRegexChunkSize record. Lower it if a custom
// rule's regex still throws `Regex too complicated` at the default; raise it (carefully)
// to reduce chunk count. Resolved once per masking pass and threaded through so a single
// consistent value is used for every boundary calculation in that pass.
private static Integer getDataMaskRegexChunkSize() {
return LoggerParameter.getInteger('DataMaskRegexChunkSize', DATA_MASK_REGEX_CHUNK_SIZE);
}

private static String applyDataMaskRuleToChunkedText(String text, String sensitiveDataRegEx, String replacementRegEx) {
if (text == null) {
return text;
}

Integer chunkSize = getDataMaskRegexChunkSize();

// Short enough to mask in a single pass — no chunking needed.
if (text.length() <= chunkSize) {
return text.replaceAll(sensitiveDataRegEx, replacementRegEx);
}

List<String> lines = text.split('\n', -1);
if (lines.size() > 1) {
List<String> processedLines = new List<String>();
for (String line : lines) {
if (line.length() <= chunkSize) {
processedLines.add(line.replaceAll(sensitiveDataRegEx, replacementRegEx));
} else {
processedLines.add(applyDataMaskRuleToLongLine(line, sensitiveDataRegEx, replacementRegEx, chunkSize));
}
}
return String.join(processedLines, '\n');
}

return applyDataMaskRuleToLongLine(text, sensitiveDataRegEx, replacementRegEx, chunkSize);
}

/**
* Applies a single data-mask rule to one line that is too long to regex in a single pass.
*
* `String.replaceAll` cannot be called on the whole line (it would throw the
* `Regex too complicated` LimitException), so the line is scanned in overlapping
* windows of `chunkSize` characters (the caller-resolved value of
* DATA_MASK_REGEX_CHUNK_SIZE / its LoggerParameter override), advancing by `step`
* (= chunk size - overlap) each iteration. The overlap guarantees that any sensitive
* value sitting on a chunk boundary is fully visible in at least one window.
*
* Because windows overlap, the same match can be discovered more than once, and
* `Matcher` indexes are window-relative — so matches are collected with absolute
* positions, deduplicated, sorted, then applied left-to-right in a second pass.
*
* Worked example (chunk size 10, overlap 4, step 6) masking the SSN `123-45-6789`
* with replacement `***`:
*
* line = "name 123-45-6789 end" (length 20)
* chunk0 = line[0..10) = "name 123-4" -> no full SSN match
* chunk1 = line[6..16) = "23-45-6789" -> matches at window 0 => absStart 6
* chunk2 = line[12..20) = "6789 end" -> no match
* collected: { start 6 -> end 16 }
* result = line[0..6) + "***" + line[16..20) = "name *** end"
*
* Keeping the *longest* match for a given start (rather than the first one found)
* matters because an earlier window may truncate the value at its right edge,
* yielding a shorter, less accurate match than a later window with more context.
*/
private static String applyDataMaskRuleToLongLine(String line, String sensitiveDataRegEx, String replacementRegEx, Integer chunkSize) {
System.Pattern regex = System.Pattern.compile(sensitiveDataRegEx);
// Overlap defaults to DATA_MASK_REGEX_OVERLAP_SIZE but can be raised without a deploy
// via the optional LoggerParameter__mdt.DataMaskRegexOverlapSize record, for orgs whose
// custom data-mask rules match values longer than the built-in rules.
Integer overlapSize = LoggerParameter.getInteger('DataMaskRegexOverlapSize', DATA_MASK_REGEX_OVERLAP_SIZE);
Integer step = chunkSize - overlapSize;

// Pass 1: scan overlapping windows and record every match by its ABSOLUTE start
// position. endByStart maps an absolute start index -> absolute end index; groupsByStart
// keeps that match's capture groups (group 0 = full match) so the replacement template
// can be expanded later without re-running the regex.
Map<Integer, Integer> endByStart = new Map<Integer, Integer>();
Map<Integer, List<String>> groupsByStart = new Map<Integer, List<String>>();

for (Integer i = 0; i < line.length(); i += step) {
Integer chunkEnd = Math.min(i + chunkSize, line.length());
System.Matcher m = regex.matcher(line.substring(i, chunkEnd));
while (m.find()) {
// Matcher indexes are window-relative; add the window offset `i` to get
// absolute positions within the full line.
Integer absStart = i + m.start();
Integer absEnd = i + m.end();
// First time we see this start, OR a later (overlapping) window found a longer
// match starting at the same place — keep the longer one, it has more context.
if (!endByStart.containsKey(absStart) || absEnd > endByStart.get(absStart)) {
endByStart.put(absStart, absEnd);
List<String> groups = new List<String>();
for (Integer g = 0; g <= m.groupCount(); g++) {
groups.add(m.group(g));
}
groupsByStart.put(absStart, groups);
}
}
}

if (endByStart.isEmpty()) {
return line;
}

// Apex Map.keySet() has no guaranteed iteration order, so explicitly sort the start
// positions to process matches strictly left-to-right in Pass 2.
List<Integer> sortedStarts = new List<Integer>(endByStart.keySet());
sortedStarts.sort();

// Pass 2: walk the matches left-to-right, copying the untouched text between matches
// ("gaps") verbatim and substituting each match with its expanded replacement.
// `pos` tracks how far into the original line has been consumed.
String result = '';
Integer pos = 0;
for (Integer start : sortedStarts) {
// This match starts inside a region already replaced by an earlier (longer)
// match — skip it to avoid double-masking overlapping hits.
if (start < pos) {
continue;
}
result += line.substring(pos, start);
result += expandReplacement(replacementRegEx, groupsByStart.get(start));
pos = endByStart.get(start);
}
result += line.substring(pos);
return result;
}

/**
* Expands `$N` capture-group references in a replacement template, equivalent to
* Java's `Matcher.appendReplacement`.
*
* Only `$N` tokens that appear in the original `replacement` template are expanded;
* a `$N` sequence that happens to occur *inside a captured group's value* is copied
* through verbatim (this is why the result is built from the template, not produced by
* `String.replace` on the group values). An unresolvable token (`$0`, an out-of-range
* group, or a null group) is left as the literal text `$N`.
*
* Example: replacement `"[$1]-$2"`, groups [full, "A", "B"] -> `"[A]-B"`.
* Example: replacement `"$1"`, group 1 = `"price=$3"` -> `"price=$3"` (the `$3` in the
* captured value is NOT re-expanded).
*/
private static String expandReplacement(String replacement, List<String> groups) {
System.Matcher tokenMatcher = DATA_MASK_REPLACEMENT_TOKEN_PATTERN.matcher(replacement);
String result = '';
Integer pos = 0;
while (tokenMatcher.find()) {
// Copy the literal template text preceding this `$N` token.
result += replacement.substring(pos, tokenMatcher.start());
Integer groupNum = Integer.valueOf(tokenMatcher.group(1));
if (groupNum >= 1 && groupNum < groups.size() && groups[groupNum] != null) {
result += groups[groupNum];
} else {
// Not a resolvable group reference — preserve the token text literally.
result += tokenMatcher.group();
}
pos = tokenMatcher.end();
}
// Copy any literal template text after the last token.
result += replacement.substring(pos);
return result;
}
Comment on lines +1355 to +1374
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is also difficult to read/understand - there are a lot of loops with index variables being manipulated, and it's not clear what the intent is for a lot of the logic. Can we include some explanatory comments, along with examples of what this method is trying to do?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. Added a method docblock with a worked example (SSN straddling a chunk boundary) and inline comments explaining each step. The biggest offender for "index variables being manipulated" was expandReplacement — I replaced its hand-rolled i/j character scanner with a precompiled Matcher on \$(\d+), which is much clearer and preserves the key property that a $N sequence inside a captured value is not re-expanded. Logic-equivalent and covered by existing + new tests.


private static String getJson(SObject record, Boolean isRecordFieldStrippingEnabled) {
List<SObject> records = new List<SObject>{ record };
records = isRecordFieldStrippingEnabled == false ? records : stripInaccessible(records);
Expand Down Expand Up @@ -1404,7 +1621,7 @@ global with sharing class LogEntryEventBuilder {
String maskedTextValue = textValueToMask;
for (LogEntryDataMaskRule__mdt dataMaskRule : CACHED_DATA_MASK_RULES.values()) {
if (dataMaskRule.IsEnabled__c) {
maskedTextValue = maskedTextValue.replaceAll(dataMaskRule.SensitiveDataRegEx__c, dataMaskRule.ReplacementRegEx__c);
maskedTextValue = applyDataMaskRuleToChunkedText(maskedTextValue, dataMaskRule.SensitiveDataRegEx__c, dataMaskRule.ReplacementRegEx__c);
}
}

Expand Down
Loading