Skip to content

Move HTML→Markdown and Unicode-abuse sanitization into Post#21

Merged
odrobnik merged 3 commits into
mainfrom
add-swifttext-sanitizer
Apr 6, 2026
Merged

Move HTML→Markdown and Unicode-abuse sanitization into Post#21
odrobnik merged 3 commits into
mainfrom
add-swifttext-sanitizer

Conversation

@odrobnik
Copy link
Copy Markdown
Contributor

@odrobnik odrobnik commented Apr 5, 2026

Summary

  • Bumps SwiftText minimum version to 1.1.7 (required for SwiftTextCore product)
  • Adds SwiftTextCore to the PostServer target
  • Moves Unicode-abuse sanitization fully into Post
  • Sanitizes subject and body output across CLI, MCP, and IDLE hook payloads
  • Adds optional unicodeAbuse description to JSON/MCP/hook outputs when sanitization modified content

Motivation

SwiftMail's MessagePart.markdownContent() (which performed HTML→Markdown + sanitization) is being removed in the companion PR Cocoanetics/SwiftMail#remove-swifttext. Post already owned HTMLToMarkdown via SwiftTextHTML; this PR completes the move by also preserving Unicode-abuse sanitization inside Post itself.

Oliver also wanted sanitization to apply consistently anywhere Post exposes extracted content — not just markdown conversion — and to surface a simple machine-readable/human-readable hint when content had to be cleaned.

Behavior

Sanitization

UnicodeAbuseSanitizer now removes abusive Unicode from exported content, including:

  • bidirectional override/isolate control characters (e.g. U+202E)
  • excessive combining marks (zalgo-style clusters)
  • suspiciously long ZWJ sequences
  • abusive Unicode tag sequences

Covered outputs

This now applies to:

  • CLI fetch (text, html, and markdown body modes)
  • CLI eml (text, html, and markdown body modes)
  • CLI list --json and search --json
  • CLI plain-text subject/header display
  • IDLE hook payload subject + markdown
  • MCP listMessages
  • MCP searchMessages
  • MCP fetchMessage
  • reply quoting via draft --replying-to

New metadata

When sanitization modifies subject and/or body, outputs may include:

"unicodeAbuse": "Subject: Removed bidirectional control characters; Body: Trimmed excessive combining marks"

No field is emitted when nothing had to be changed.

Notes

  • Raw RFC822 fetch remains raw by design
  • This keeps IMAP flags semantically separate from local content-analysis results
  • Existing output shapes remain largely intact; the main additive schema change is optional unicodeAbuse

Build / test

swift build -c release
swift test ✅ (8 tests, 0 failures)

odrobnik and others added 3 commits April 5, 2026 19:40
SwiftMail no longer performs HTML→Markdown conversion; Post now owns
that responsibility end-to-end.

Changes:
- Bump SwiftText minimum to 1.1.7 (required for SwiftTextCore product)
- Add SwiftTextCore to PostServer target (provides UnicodeAbuseSanitizer)
- MessageDetail.markdown(): apply UnicodeAbuseSanitizer to the converted
  markdown (and to plain-text fallback), matching the sanitization that
  SwiftMail's now-removed markdownContent() used to perform
- Update Package.resolved (SwiftText 1.1.6 → 1.1.7)

Behavior is equivalent or better: bidi-override scalars and
zalgo-style combining-mark clusters are stripped before the string is
returned to callers/AI agents.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@odrobnik odrobnik merged commit f6c4719 into main Apr 6, 2026
2 checks passed
@odrobnik odrobnik deleted the add-swifttext-sanitizer branch April 6, 2026 05:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant