Security Vulnerability Report
Type: XML External Entity (XXE) Injection
Severity: High
File: packages/markitdown/src/markitdown/converter_utils/docx/pre_process.py, line 45
Commit tested: 4a5340f
Description
The function _convert_omath_to_latex() in pre_process.py uses ET.fromstring() to parse XML content extracted from user-supplied DOCX files:
math_root = ET.fromstring(MATH_ROOT_TEMPLATE.format(str(tag)))
The tag variable originates from BeautifulSoup parsing of the DOCX document.xml, which is user-supplied content. A DOCX file is a ZIP archive containing XML — an attacker can craft a DOCX with malicious Office Math Markup (OMML) tags containing XML external entity declarations.
Python's xml.etree.ElementTree is documented as "not secure against maliciously constructed data". While CPython's expat-based parser has limited XXE surface compared to lxml, it is still vulnerable to:
- Billion Laughs (exponential entity expansion) causing denial of service
- External entity resolution depending on parser configuration
- DTD processing attacks
Secondary Finding (Medium)
Unvalidated exiftool_path parameter reaching subprocess.run() in packages/markitdown/src/markitdown/converters/_exiftool.py (lines 22 and 41). While this uses list-style invocation (not shell=True), the path is not validated against path traversal or symlink attacks.
Recommended Fix
For the XXE:
# Replace:
from xml.etree import ElementTree as ET
# With:
import defusedxml.ElementTree as ET
Or call defusedxml.defuse_stdlib() at module initialization.
For the subprocess issue:
- Validate
exiftool_path against an allowlist or verify it resolves to a known binary using shutil.which()
Impact
markitdown is widely used for converting documents to Markdown. Any application processing untrusted DOCX files is potentially vulnerable, including:
- Web services accepting document uploads
- CI/CD pipelines processing documentation
- AI/LLM pipelines using markitdown for document ingestion
- The markitdown MCP server (markitdown-mcp)
Disclosure Process
We attempted to report this through secure@microsoft.com (bounced — no longer accepted) and the MSRC Researcher Portal. This issue was discovered during an automated scan using the Colosseum deep code analysis platform — 51 gauntlets × 2 platforms × 7 rounds, with 98% cross-platform agreement.
Full scan report: https://battleharden.dev/reports/markitdown
Security Vulnerability Report
Type: XML External Entity (XXE) Injection
Severity: High
File:
packages/markitdown/src/markitdown/converter_utils/docx/pre_process.py, line 45Commit tested: 4a5340f
Description
The function
_convert_omath_to_latex()inpre_process.pyusesET.fromstring()to parse XML content extracted from user-supplied DOCX files:The
tagvariable originates from BeautifulSoup parsing of the DOCXdocument.xml, which is user-supplied content. A DOCX file is a ZIP archive containing XML — an attacker can craft a DOCX with malicious Office Math Markup (OMML) tags containing XML external entity declarations.Python's
xml.etree.ElementTreeis documented as "not secure against maliciously constructed data". While CPython's expat-based parser has limited XXE surface compared to lxml, it is still vulnerable to:Secondary Finding (Medium)
Unvalidated
exiftool_pathparameter reachingsubprocess.run()inpackages/markitdown/src/markitdown/converters/_exiftool.py(lines 22 and 41). While this uses list-style invocation (notshell=True), the path is not validated against path traversal or symlink attacks.Recommended Fix
For the XXE:
Or call
defusedxml.defuse_stdlib()at module initialization.For the subprocess issue:
exiftool_pathagainst an allowlist or verify it resolves to a known binary usingshutil.which()Impact
markitdown is widely used for converting documents to Markdown. Any application processing untrusted DOCX files is potentially vulnerable, including:
Disclosure Process
We attempted to report this through secure@microsoft.com (bounced — no longer accepted) and the MSRC Researcher Portal. This issue was discovered during an automated scan using the Colosseum deep code analysis platform — 51 gauntlets × 2 platforms × 7 rounds, with 98% cross-platform agreement.
Full scan report: https://battleharden.dev/reports/markitdown