fix: stop corrupting Office Open XML files in security validation#41
Closed
pagatino-afk wants to merge 1 commit into
Closed
fix: stop corrupting Office Open XML files in security validation#41pagatino-afk wants to merge 1 commit into
pagatino-afk wants to merge 1 commit into
Conversation
Contributor
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
DOCX/XLSX/PPTX files are ZIP archives whose MIME type contains the
substring "xml" ("application/vnd.openxmlformats-..."). The format
dispatch in validate_file_content_security() matched this substring and
routed the binary ZIP through validate_xml_security(), which reads the
file as UTF-8 text (errors="ignore") and writes a sanitized .xml temp
copy. That destroys the ZIP structure, so MarkItDown's DocxConverter
then fails with "BadZipFile: File is not a zip file" — surfaced to
clients only as a generic "Tool execution failed".
Fix:
- Gate XML validation on real .xml/.xhtml extensions or exact XML MIME
types, and explicitly exclude ZIP-based office formats.
- Log the full traceback and surface the real error type/message in
convert_file_tool instead of an opaque "Conversion failed", so future
conversion errors are diagnosable.
Add regression tests asserting OOXML files stay valid ZIPs through
validation while genuine .xml files are still sanitized.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
65fb867 to
763a835
Compare
trsdn
added a commit
that referenced
this pull request
Jun 10, 2026
## Summary - fix OOXML validation so DOCX/XLSX/PPTX ZIP containers are not sent through XML sanitization - add `MARKITDOWN_SAFE_DIRS` support and make safe-directory initialization robust when home cannot be resolved - make `tools/list` schema compatible with Claude/Anthropic clients and stop responding to JSON-RPC notifications - configure stdio for UTF-8/LF startup behavior and stabilize fragile performance thresholds ## Validation - `python3 -m ruff check .` - `PATH="/Library/Frameworks/Python.framework/Versions/3.12/bin:$PATH" python3 -m pytest -q` Refs #36 Refs #38 Closes #40 ## Acknowledgements Thanks to @kdjkdjkdj for opening #37 and #39 with the Windows/MCP protocol fixes and configurable safe-directory work, and to @pagatino-afk for opening #41 with the Office Open XML corruption fix. Those contributions helped identify and validate the issues consolidated in this PR.
Owner
|
Thanks again @pagatino-afk for the Office Open XML corruption fix in this PR. We consolidated and merged this fix via #44, including acknowledgement there, so I'm closing this PR as superseded by #44. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #40.
What
validate_file_content_security()no longer matches the loose"xml" in mime_typesubstring. XML validation is now gated on real.xml/.xhtmlextensions or exact XML MIME types (application/xml,text/xml,application/xhtml+xml), with an explicit exclusion list for ZIP-based formats (.docx/.xlsx/.pptx/.docm/.xlsm/.pptm/.epub/.zip).convert_file_toolnow logs the full traceback (logger.exception) and surfaces the real error type/message (e.g.Conversion failed (BadZipFile): ...) instead of an opaqueConversion failed. Path/permission/dependency sanitization is preserved.tests/security/test_office_openxml_not_corrupted.pyasserts.docx/.xlsx/.pptxstay valid ZIPs through validation, and that genuine.xmlfiles are still sanitized.Verification
Full
markitdownconversion of a real 5.4 MB.docxnow returns ~25 KB of Markdown through the MCPconvert_filepath (previouslyBadZipFile).Notes for maintainers
KNOWN_ISSUES.mdlistsPptxConverter threw BadZipFile...under "Missing Dependencies" — this PR likely resolves that; worth re-testing and updating the doc..gitignoreline 133 (test_*.py) ignores all test files, so the new test was added withgit add -f.🤖 Generated with Claude Code