Skip to content

fix: escapeStringJson escapes DEL and C1 control characters#1005

Open
He-Pin wants to merge 1 commit into
databricks:masterfrom
He-Pin:worktree-fix-escape-del-c1
Open

fix: escapeStringJson escapes DEL and C1 control characters#1005
He-Pin wants to merge 1 commit into
databricks:masterfrom
He-Pin:worktree-fix-escape-del-c1

Conversation

@He-Pin

@He-Pin He-Pin commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Motivation

std.escapeStringJson and std.escapeStringPython passed DEL (0x7F)
and C1 control characters (0x80-0x9F) through unescaped. RFC 8259 only
requires U+0000-U+001F to be escaped, so the output was still valid
JSON, but it diverged from go-jsonnet v0.22.0 and C++ jsonnet, which
both emit \uXXXX for this range.

Modification

  • Extend the unsafe-character predicate in BaseRenderer.escape,
    BaseCharRenderer, and BaseByteRenderer to include 0x7F-0x9F
    (always escaped regardless of unicode flag).
  • Add DEL detection to JVM / JS / Native CharSWAR implementations:
    byte-level SWAR detects DEL via XOR with HOLE; 16-bit SWAR adds
    U16_DEL; scalar paths add range check.
  • Replace upstream RenderUtils.escapeByte / escapeChar fallback
    calls with local escape helpers that consistently emit \uXXXX for
    DEL / C1.
  • Tighten isAsciiJsonSafe threshold from >= 0x80 to >= 0x7F to
    exclude DEL from ASCII-safe classification (fixes std.base64Decode
    path).
  • Add renderer tests: DEL / C1 byte and char paths, long C1 strings,
    non-ASCII preservation.

Result

escapeStringJson and escapeStringPython now escape DEL and C1
control characters as \uXXXX, matching go-jsonnet v0.22.0 and C++
jsonnet behavior. Characters above 0x9F (NBSP, accented letters, CJK,
emoji) remain literal when unicode=false. All tests pass on JVM /
JS / Native with Scala 3.3.7 and 2.13.18.

References

@He-Pin He-Pin force-pushed the worktree-fix-escape-del-c1 branch from 8d5cb43 to 33c87a2 Compare June 20, 2026 09:42
@He-Pin He-Pin marked this pull request as draft June 20, 2026 12:14
@He-Pin He-Pin marked this pull request as ready for review June 20, 2026 12:49
Motivation:
std.escapeStringJson and std.escapeStringPython passed DEL (0x7F) and C1 control characters (0x80-0x9F) through literally. go-jsonnet and C++ jsonnet render these values as \uXXXX escapes, so matching them avoids downstream divergence.

The renderer fast paths also need to preserve UTF-8 semantics: byte-array scans must not treat UTF-8 continuation bytes in 0x80-0x9F as standalone C1 controls.

Modification:
- Escape DEL and C1 controls in BaseRenderer/BaseCharRenderer/BaseByteRenderer char-level paths.
- Keep byte-level CharSWAR scans limited to JSON-significant bytes plus DEL, while String/char[] scans detect DEL/C1 before UTF-8 encoding.
- Add JVM, JS, and Native CharSWAR updates and regression tests for DEL, C1, long strings, and non-ASCII passthrough.

Result:
DEL and C1 controls now render as \uXXXX without corrupting UTF-8 encoded non-ASCII data. Characters above U+009F still remain literal when escapeUnicode is false.
@He-Pin He-Pin force-pushed the worktree-fix-escape-del-c1 branch from 80912b5 to d8aaab4 Compare June 20, 2026 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant