Skip to content

fix: stop corrupting Office Open XML files in security validation#41

Closed
pagatino-afk wants to merge 1 commit into
trsdn:mainfrom
pagatino-afk:fix/office-openxml-sanitizer-corruption
Closed

fix: stop corrupting Office Open XML files in security validation#41
pagatino-afk wants to merge 1 commit into
trsdn:mainfrom
pagatino-afk:fix/office-openxml-sanitizer-corruption

Conversation

@pagatino-afk

Copy link
Copy Markdown

Fixes #40.

What

  • Root fix: validate_file_content_security() no longer matches the loose "xml" in mime_type substring. XML validation is now gated on real .xml/.xhtml extensions or exact XML MIME types (application/xml, text/xml, application/xhtml+xml), with an explicit exclusion list for ZIP-based formats (.docx/.xlsx/.pptx/.docm/.xlsm/.pptm/.epub/.zip).
  • Diagnosability: convert_file_tool now logs the full traceback (logger.exception) and surfaces the real error type/message (e.g. Conversion failed (BadZipFile): ...) instead of an opaque Conversion failed. Path/permission/dependency sanitization is preserved.
  • Tests: tests/security/test_office_openxml_not_corrupted.py asserts .docx/.xlsx/.pptx stay valid ZIPs through validation, and that genuine .xml files are still sanitized.

Verification

.docx: unchanged=True still_zip=True -> PASS
.xlsx: unchanged=True still_zip=True -> PASS
.pptx: unchanged=True still_zip=True -> PASS
.xml regression: rerouted=True       -> PASS

Full markitdown conversion of a real 5.4 MB .docx now returns ~25 KB of Markdown through the MCP convert_file path (previously BadZipFile).

Notes for maintainers

  • KNOWN_ISSUES.md lists PptxConverter threw BadZipFile... under "Missing Dependencies" — this PR likely resolves that; worth re-testing and updating the doc.
  • Heads-up: .gitignore line 133 (test_*.py) ignores all test files, so the new test was added with git add -f.

🤖 Generated with Claude Code

@pagatino-afk pagatino-afk requested a review from trsdn as a code owner June 7, 2026 15:41
@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

DOCX/XLSX/PPTX files are ZIP archives whose MIME type contains the
substring "xml" ("application/vnd.openxmlformats-..."). The format
dispatch in validate_file_content_security() matched this substring and
routed the binary ZIP through validate_xml_security(), which reads the
file as UTF-8 text (errors="ignore") and writes a sanitized .xml temp
copy. That destroys the ZIP structure, so MarkItDown's DocxConverter
then fails with "BadZipFile: File is not a zip file" — surfaced to
clients only as a generic "Tool execution failed".

Fix:
- Gate XML validation on real .xml/.xhtml extensions or exact XML MIME
  types, and explicitly exclude ZIP-based office formats.
- Log the full traceback and surface the real error type/message in
  convert_file_tool instead of an opaque "Conversion failed", so future
  conversion errors are diagnosable.

Add regression tests asserting OOXML files stay valid ZIPs through
validation while genuine .xml files are still sanitized.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@pagatino-afk pagatino-afk force-pushed the fix/office-openxml-sanitizer-corruption branch from 65fb867 to 763a835 Compare June 7, 2026 15:44
trsdn added a commit that referenced this pull request Jun 10, 2026
## Summary
- fix OOXML validation so DOCX/XLSX/PPTX ZIP containers are not sent through XML sanitization
- add `MARKITDOWN_SAFE_DIRS` support and make safe-directory initialization robust when home cannot be resolved
- make `tools/list` schema compatible with Claude/Anthropic clients and stop responding to JSON-RPC notifications
- configure stdio for UTF-8/LF startup behavior and stabilize fragile performance thresholds

## Validation
- `python3 -m ruff check .`
- `PATH="/Library/Frameworks/Python.framework/Versions/3.12/bin:$PATH" python3 -m pytest -q`

Refs #36
Refs #38
Closes #40

## Acknowledgements
Thanks to @kdjkdjkdj for opening #37 and #39 with the Windows/MCP protocol fixes and configurable safe-directory work, and to @pagatino-afk for opening #41 with the Office Open XML corruption fix. Those contributions helped identify and validate the issues consolidated in this PR.
@trsdn

trsdn commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Thanks again @pagatino-afk for the Office Open XML corruption fix in this PR. We consolidated and merged this fix via #44, including acknowledgement there, so I'm closing this PR as superseded by #44.

@trsdn trsdn closed this Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

convert_file corrupts DOCX/XLSX/PPTX: security validation reroutes ZIP-based OOXML through the XML sanitizer (BadZipFile)

2 participants