Validate OOXML (PPTX/DOCX/XLSX) and ODF files in pure Python — no .NET required.
A Python port of Microsoft's Open XML SDK validation logic. Check whether generated or modified Office files will open cleanly, directly from Python scripts, CI pipelines, or anywhere .NET isn't practical.
Also supports OASIS OpenDocument Format (ODT/ODS/ODP) with staged conformance levels.
Validation is the floor tier. Whether a file survives depends on more than ECMA legality — it also has to load in the target app, survive a save, behave correctly at runtime, and ideally match what the app itself would author. openxml-audit organizes this as an evidence ladder (openxml_audit.EvidenceTier):
schema-valid— parses against ECMA/OASIS schemas (this is whatopenxml-audit validatechecks)loadable— the target app opens without repairroundtrip-preserved— the app's save does not rewrite the intentslideshow-verified— runtime behavior matches intentui-authored— the app itself produced this structure
Tiers 2–5 are backed by curated corpora of target-app-authored XML. The first corpus lives at docs/pptx_oracle/ — PowerPoint animation/timing, where "schema-valid but silently rewritten" is the dominant failure mode. DOCX and XLSX corpora can follow the same layout when the research starts.
from openxml_audit import EvidenceTier
from openxml_audit.pptx import check_capability
check_capability("pptx.anim.effect.entr.fade", minimum_tier=EvidenceTier.LOADABLE)- OOXML Validation: Package structure, schema, semantic, properties, and format-specific checks for PPTX/DOCX/XLSX — 100% parity with Open XML SDK v3.4.1 without the .NET dependency
- ODF Validation: Staged conformance levels — foundation, schema-core (Relax NG), semantic-core, and security-core for ODT/ODS/ODP
- Evidence ladder: Validation is the floor tier. Curated PPTX corpora (
docs/pptx_oracle/) verify loadability, roundtrip preservation, and runtime behavior above it — for features like animation/timing where "schema-valid" isn't enough - Fast: 1.2x the .NET SDK cold, 2.2x warm — validates a 798K DOCX in 101ms
- pytest Plugin:
assert_valid_pptx,assert_valid_docx,assert_valid_xlsx,assert_valid_odf— zero config - CI Ready: GitHub Action, pre-commit hook, and parallel batch validation
- Multiple Output Formats: Text, JSON, and XML output
Libraries that generate Office files routinely produce corrupt output — python-pptx has 12+ open corruption issues, docxtpl has 7, XlsxWriter 25+. These surface as "PowerPoint found a problem" dialogs for end users or silent failures in CI. With AI agents now generating slides and reports, the problem is getting worse.
openxml-audit catches these before your users do — same checks Microsoft's SDK runs, in pure Python.
| Ecosystem | Examples | How openxml-audit helps |
|---|---|---|
| File generators | python-pptx, python-docx, openpyxl, XlsxWriter | Validate output in tests and CI — catch corruption before release |
| Template engines | docxtpl, pptx-template | Jinja2 rendering can break XML structure — validate after render |
| Data pipelines | pandas to_excel, tablib, django-import-export |
Assert valid exports in pipeline tests |
| AI/LLM agents | Auto-PPT, GenFilesMCP, Docling | AI-generated Office files are unreliable — validate and retry |
| Government / ODF | Suite Numerique, odfpy | ODF conformance for EU regulatory requirements |
Pure Python, but close to .NET — lxml does the heavy XML lifting in C.
| Benchmark | .NET SDK | openxml-audit | Ratio |
|---|---|---|---|
| Cold start (6 files, mixed formats) | 994ms | 1,175ms | 1.2x |
| Warm (798K DOCX) | 46ms | 101ms | 2.2x |
| Warm (1.4MB PPTX) | — | 83ms | — |
| Warm (114K XLSX) | — | 29ms | — |
Batch validation supports --parallel N for multiprocess speedup. The pytest plugin uses session-scoped fixtures so schema loading happens once per test run.
pip install openxml-auditOr install from source:
git clone https://github.com/BramAlkema/openxml-audit.git
cd openxml-audit
pip install -e .# Validate a single file
openxml-audit presentation.pptx
# Validate an OASIS OpenDocument file
openxml-audit document.odt
# Validate with JSON output
openxml-audit presentation.pptx --output json
# Validate with XML output
openxml-audit presentation.pptx --output xml
# Validate all matching files in a directory
openxml-audit ./presentations/ --recursive
# Validate against a specific Office version
openxml-audit presentation.pptx --format Office2007
# Limit maximum errors reported
openxml-audit presentation.pptx --max-errors 10from openxml_audit import validate_pptx, is_valid_pptx, OpenXmlValidator
# Quick check
if is_valid_pptx("presentation.pptx"):
print("File is valid!")
# Detailed validation
result = validate_pptx("presentation.pptx")
if not result.is_valid:
print(f"Found {result.error_count} errors, {result.warning_count} warnings")
for error in result.errors:
print(f" [{error.severity.value}] {error.description}")
# With custom options
from openxml_audit import FileFormat
validator = OpenXmlValidator(
file_format=FileFormat.OFFICE_2019,
max_errors=100,
schema_validation=True,
semantic_validation=True,
)
result = validator.validate("presentation.pptx")- ADRs — evidence-ladder mission and PPTX evidence ownership
- PPTX oracle corpus — curated PowerPoint timing fixtures and XML-first methodology
- Parity contract — SDK calibration and drift rules
ODF validation is staged by explicit conformance level.
| Level | Includes | Does not include |
|---|---|---|
foundation |
package/manifest integrity + XML parse sweep | Relax NG schema-core routing, semantic-core rules, security-core checks |
schema-core |
foundation + Relax NG validation for routed XML members | semantic-core and security-core checks |
semantic-core |
foundation + semantic-core rule families (ODFSEM*) |
Relax NG schema-core routing, security-core checks |
security-core |
semantic-core + signature/encryption structural checks (ODFSEC*) |
full cryptographic trust guarantees unless crypto verification backend is configured |
Rule registry and policy references:
- semantic rule IDs:
openxml_audit.odf.get_odf_semantic_rules() - security policy:
docs/odf_security_policy.md - reference calibration/drift contract:
docs/odf_validation_contract.md
Use --odf-level when validating ODF files:
# foundation
openxml-audit file.odt --validator odf --odf-level foundation
# semantic-core (default)
openxml-audit file.odt --validator odf --odf-level semantic-core
# security-core
openxml-audit file.odt --validator odf --odf-level security-coreSchema-core uses bundled OASIS Relax NG schemas by default:
openxml-audit file.odt \
--validator odf \
--odf-level schema-corePass --odf-schema-routes only when you want to override or extend routing. It accepts either
shape:
- versioned mapping:
{"1.3": {"content.xml": "schemas/odf/1.3/content.rng"}}
- flat legacy mapping:
{"content.xml": "schemas/odf/content.rng"}
Security-core crypto verification hook:
openxml-audit file.odt \
--validator odf \
--odf-level security-core \
--odf-verify-cryptographyfrom openxml_audit import FileFormat
from openxml_audit.odf import OdfValidator
# foundation
foundation = OdfValidator(
file_format=FileFormat.ODF_1_3,
schema_validation=False,
semantic_validation=False,
security_validation=False,
)
# schema-core (bundled schemas by default)
schema_core = OdfValidator(
file_format=FileFormat.ODF_1_3,
schema_validation=True,
semantic_validation=False,
security_validation=False,
relaxng_validation=True,
)
# schema-core with custom routes
schema_core_custom = OdfValidator(
file_format=FileFormat.ODF_1_3,
schema_validation=True,
semantic_validation=False,
security_validation=False,
relaxng_validation=True,
schema_routes={"1.3": {"content.xml": "schemas/odf/1.3/content.rng"}},
)
# semantic-core
semantic_core = OdfValidator(
file_format=FileFormat.ODF_1_3,
schema_validation=True,
semantic_validation=True,
security_validation=False,
)
# security-core
security_core = OdfValidator(
file_format=FileFormat.ODF_1_3,
schema_validation=True,
semantic_validation=True,
security_validation=True,
verify_cryptography=False, # set True when crypto backend is available
)# Benchmark an ODF file (5 iterations by default)
python scripts/odf/benchmark_validation.py document.odt
# More iterations, with security checks
python scripts/odf/benchmark_validation.py document.odt --iterations 20 --security
# Foundation-only (skip schema/semantic)
python scripts/odf/benchmark_validation.py document.odt --no-schema --no-semanticReports avg/min/max/P95 with per-phase breakdown (package_structure, xml_parse, schema, semantic, security).
OOXML benchmark: python scripts/benchmark_validation.py presentation.pptx
- Schema-core validates bundled routed members by default; use
schema_routesto extend or override routing for additional XML parts. - Security-core validates structure/policy, not full cryptographic trust by default.
- CLI
--odf-levelonly applies when the selected/auto-detected validator is ODF.
Compare Python results against external validators (ODF Toolkit, OPF) using the scripts in scripts/odf/:
| Script | Purpose |
|---|---|
run_reference_validators.py |
Run Python + external validators on pinned corpus |
compare_reference_results.py |
Diff results into mismatch families |
check_reference_drift.py |
Enforce drift policy against baseline |
bootstrap_reference_validators.py |
Auto-build external validator commands |
CI workflow: .github/workflows/odf-reference-calibration.yml — builds ODF Toolkit and OPF at runtime via Maven/Docker.
Set command templates via --odf-toolkit-cmd / --opf-cmd or env vars ODF_TOOLKIT_CMD / OPF_ODF_VALIDATOR_CMD. Placeholders: {file}, {file_dir}, {file_name}, {file_stem}, {file_suffix}.
The gsuite engine in the oracle dispatcher rounds OOXML files
through Google's import/export pipeline (upload → convert to native
Google Slides → export back to .pptx → diff) and classifies what
GSuite drops, transforms, or normalizes. See
specs/031-gsuite-roundtrip-oracle.md
for the full design.
GSuite uploads require domain-wide delegation because service accounts have zero storage quota since Google's 2024 policy change. The setup is a one-time per-Workspace ceremony:
- Create a GCP project at https://console.cloud.google.com
(e.g.,
openxml-audit-oracle). - APIs & Services → Library, enable Google Drive API.
- IAM & Admin → Service Accounts, create one (e.g.,
oracle-roundtrip); skip the project IAM role grant. - On the new SA → Keys → Add key → JSON. Save to
~/.config/openxml-audit/google_service_account.jsonandchmod 600it. - Note the SA's OAuth client ID (in Show domain-wide delegation on the SA page).
- In Google Workspace Admin Console
(https://admin.google.com) → Security → Access and data
control → API controls → Domain-wide Delegation → Add new.
Paste the OAuth client ID; scope:
https://www.googleapis.com/auth/drive. Requires Workspace super-admin rights — one-time per Workspace. - In Drive, create a folder owned by the impersonation subject
(e.g.,
openxml-audit-oracle-staging) to hold in-flight oracle uploads. Copy its folder ID from the URL.
Install the optional dependency group:
pip install -e ".[gsuite]"Three env vars wire it up:
export GSUITE_ORACLE_CREDS=~/.config/openxml-audit/google_service_account.json # default; override only if elsewhere
export GSUITE_ORACLE_SUBJECT=info@yourdomain.example # the user the SA impersonates
export GSUITE_ORACLE_FOLDER_ID=1abcDEFghijKLM... # the staging folder IDThen:
python -m openxml_audit.oracle gsuite presentation.pptx
python -m openxml_audit.oracle gsuite ./corpus/ --output gsuite-report.jsonThe report classifies each roundtrip across a LossClass taxonomy:
theme_loss, master_loss, style_loss, font_loss,
media_re_encoded, metadata_churn, structural_normalization
(parts GSuite added), content_preserved_lossy,
content_changed, unmapped. Multiple classes may fire per file.
Drive uploads are deleted in finally after each roundtrip — the
oracle never leaves files in your account.
Run the .NET SDK validator separately (requires .NET SDK 8.x or Docker):
dotnet run --project scripts/sdk_check/sdk_check.csproj -- /path/to/file.pptx
dotnet run --project scripts/sdk_compare/OpenXmlSdkValidator.csproj -- /path/to/file.pptx # JSON
# Via Docker
docker run --rm -v "$PWD:/work" -w /work mcr.microsoft.com/dotnet/sdk:8.0 \
dotnet run --project scripts/sdk_check/sdk_check.csproj -- /work/path/to/file.pptxSupports PPTX/DOCX/XLSX and variants. Configured for Office 2019.
Validate Office files in your PRs automatically:
# .github/workflows/validate-office-files.yml
name: Validate Office Files
on: [pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- uses: BramAlkema/openxml-audit@main
with:
changed-only: "true" # only validate files changed in the PROptions:
| Input | Default | Description |
|---|---|---|
path |
. |
Directory or file to validate |
format |
Office2019 |
Office version to validate against |
changed-only |
false |
Only validate files changed in the PR |
recursive |
true |
Search subdirectories |
max-errors |
100 |
Maximum errors per file |
# .pre-commit-config.yaml
repos:
- repo: https://github.com/BramAlkema/openxml-audit
rev: v0.5.0
hooks:
- id: openxml-auditValidates any .pptx, .docx, .xlsx, .odt, .ods, or .odp file before commit.
Ready-to-run scripts in examples/:
| Script | Description |
|---|---|
validate_python_pptx.py |
Generate a PPTX with python-pptx and validate it |
validate_openpyxl.py |
Generate an XLSX with openpyxl and validate it |
validate_odf.py |
Validate an ODF file (ODT/ODS/ODP) |
ci_validation.py |
Validate all Office files in a directory (CI-ready, OOXML + ODF) |
| Workflow | Trigger | Purpose |
|---|---|---|
parity-gate.yml |
PR / push | Enforce OOXML parity + perf budget against SDK baseline |
calibrate-parity.yml |
Weekly / dispatch | Calibrate against Open XML SDK upstream |
sdk-update.yml |
Quarterly / dispatch | Track upstream SDK version changes |
odf-reference-calibration.yml |
Dispatch | Run ODF reference validators and drift checks |
validate-inputs.yml |
Push to inputs/ |
Validate dropped files with both Python and .NET SDK |
release.yml |
Tag push (v*) |
Build and publish to PyPI |
pages.yml |
Push to main |
Deploy documentation site |
OOXML parity details: docs/parity_contract.md. ODF reference contract: docs/odf_validation_contract.md.
Fixtures are registered automatically — just pip install openxml-audit and use them:
def test_my_presentation(assert_valid_pptx, tmp_path):
output = tmp_path / "output.pptx"
generate_pptx(output)
assert_valid_pptx(output) # fails with detailed errors if invalid
def test_my_document(assert_valid_docx, tmp_path):
output = tmp_path / "output.docx"
generate_docx(output)
assert_valid_docx(output)
def test_my_spreadsheet(assert_valid_xlsx, tmp_path):
output = tmp_path / "output.xlsx"
generate_xlsx(output)
assert_valid_xlsx(output)
def test_odf_file(assert_valid_odf, tmp_path):
output = tmp_path / "output.odt"
generate_odt(output)
assert_valid_odf(output)CLI options:
# Validate against a specific Office version
pytest --openxml-format Office2007
# Limit errors collected per file
pytest --openxml-max-errors 50Available fixtures: openxml_validator, assert_valid_pptx, assert_valid_docx, assert_valid_xlsx, assert_valid_odf.
# Context manager
from openxml_audit import validation_context
with validation_context(raise_on_invalid=True) as validator:
result = validator.validate("presentation.pptx")
# Decorator — validate after save
from openxml_audit import validate_on_save
@validate_on_save(raise_on_invalid=True)
def create_presentation(output_path: str) -> None:
Presentation().save(output_path)
# Decorator — require valid input
from openxml_audit import require_valid_pptx
@require_valid_pptx()
def process(input_path: str) -> dict: ...OpenXmlValidator(file_format=FileFormat.OFFICE_2019, max_errors=1000,
schema_validation=True, semantic_validation=True)
OdfValidator(file_format=FileFormat.ODF_1_3, max_errors=1000,
schema_validation=True, semantic_validation=True,
security_validation=False, strict=True)Both expose:
validate(path) -> ValidationResultvalidate_with_timings(path) -> (ValidationResult, dict[str, float])is_valid(path) -> bool
| Property | Type | Description |
|---|---|---|
is_valid |
bool |
No ERROR-severity issues |
errors |
list[ValidationError] |
All errors and warnings |
error_count / warning_count |
int |
Counts by severity |
file_path |
str |
Validated file path |
file_format |
FileFormat |
Version validated against |
| Property | Type | Description |
|---|---|---|
error_type |
ValidationErrorType |
PACKAGE, BINARY, SCHEMA, SEMANTIC, RELATIONSHIP, MARKUP_COMPATIBILITY |
severity |
ValidationSeverity |
ERROR, WARNING, INFO |
description |
str |
Human-readable message |
part_uri |
str | None |
Affected part URI |
path |
str | None |
XPath to affected element |
| OOXML | ODF |
|---|---|
OFFICE_2007 through MICROSOFT_365 (default: OFFICE_2019) |
ODF_1_2, ODF_1_3 (default: ODF_1_3) |
validate_pptx(path) -> ValidationResultis_valid_pptx(path) -> bool
These libraries create Office files — openxml-audit checks them:
| Library | Format | Link |
|---|---|---|
| python-pptx | PPTX | Create and update PowerPoint files |
| python-docx | DOCX | Create and update Word files |
| openpyxl | XLSX | Create and update Excel files |
from pptx import Presentation
from openxml_audit import validate_pptx
Presentation().save("output.pptx")
result = validate_pptx("output.pptx")
if not result.is_valid:
print(f"{result.error_count} issues found")Contributions are welcome! See CONTRIBUTING.md for dev setup and guidelines.
This project is actively looking for co-maintainers — especially people working with:
- Office file generation pipelines (python-pptx, python-docx, openpyxl)
- ODF tooling and OASIS conformance
- Open XML SDK internals
If you're interested, open an issue or reach out.
If this project saves you time, consider sponsoring its development:
See CHANGELOG.md for a full list of changes by version.
Based on the validation logic from Microsoft's Open XML SDK for .NET.
