Skip to content

feat(doc): strip server-rejected unsafe chars on markdown write path#465

Merged
PeterGuy326 merged 1 commit into
DingTalk-Real-AI:mainfrom
PeterGuy326:feat/doc-strip-input-unsafe-chars
Jun 16, 2026
Merged

feat(doc): strip server-rejected unsafe chars on markdown write path#465
PeterGuy326 merged 1 commit into
DingTalk-Real-AI:mainfrom
PeterGuy326:feat/doc-strip-input-unsafe-chars

Conversation

@PeterGuy326

Copy link
Copy Markdown
Collaborator

What

Harden the dws doc create / dws doc update write boundary so content that contains control characters or dangerous Unicode is stripped before being sent, instead of being rejected by the server.

Why

Today the doc write path only strips a fixed dangerous-Unicode set, and only on the JSONML branch. The Markdown branch sends raw content straight through. When content carries:

  • C0 control characters (anything < 0x20 other than \t / \n) or DEL (0x7F), or
  • a few zero-width / line-separator codepoints (U+200D, U+2028, U+2029)

…the server-side RejectControlChars validator rejects the request and the command fails. This is common with LLM-generated or copy-pasted text (e.g. a stray \x00, a zero-width joiner, or a Windows \r).

Changes

  • Rename stripDocDangerousUnicodestripDocInputUnsafe and extend it to also drop C0 controls (except \t / \n) and DEL, matching the existing authoritative apiclient.rejectDangerousChars definition (so the CLI strips exactly what the API layer would reject).
  • Add U+200D, U+2028, U+2029 to the dangerous-Unicode set to cover the full server-rejected range.
  • Apply the strip on the Markdown write path (doc create / doc update) and the JSONML node path — previously only the JSONML body path was covered.
  • Add unit tests (TestStripDocInputUnsafe) using explicit \u / \x escapes so the offending codepoints are unambiguous in source.

Tab and newline are intentionally preserved as legitimate document text.

Test

  • go test ./internal/helpers/ — pass
  • go test ./internal/helpers/docjsonml/ — pass
  • go build ./... — clean
  • go vet ./internal/helpers/ — clean

Ported from dws-wukong (feat: 增加输入安全字符过滤功能).

The doc write boundary only stripped a fixed dangerous-Unicode set, and only
on the JSONML path. C0 control characters (except tab/newline), DEL (0x7F),
and a few zero-width / line-separator codepoints still reached the server,
where RejectControlChars rejects them — so doc create/update failed on content
that pasted in such characters (common with LLM-generated or copy-pasted text).

- Rename stripDocDangerousUnicode -> stripDocInputUnsafe and extend it to drop
  C0 controls (except \t and \n) and DEL, matching apiclient.rejectDangerousChars.
- Add U+200D, U+2028, U+2029 to the dangerous-Unicode set so it covers the
  full server-rejected range.
- Apply the strip on the markdown write path (doc create/update) and the JSONML
  node path, not just the JSONML body.
- Add unit tests for stripDocInputUnsafe.

Ported from dws-wukong (feat: 增加输入安全字符过滤功能).
@PeterGuy326 PeterGuy326 merged commit ff33114 into DingTalk-Real-AI:main Jun 16, 2026
6 checks passed
@PeterGuy326 PeterGuy326 deleted the feat/doc-strip-input-unsafe-chars branch June 16, 2026 02:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants