Skip to content

Fix XLSX ingestion memory spikes with streaming parser#1

Merged
plasma16 merged 1 commit into
mainfrom
fix/xlsx-memory-spike
May 7, 2026
Merged

Fix XLSX ingestion memory spikes with streaming parser#1
plasma16 merged 1 commit into
mainfrom
fix/xlsx-memory-spike

Conversation

@plasma16
Copy link
Copy Markdown
Owner

@plasma16 plasma16 commented May 7, 2026

Summary

  • route .xlsx conversion through a streaming openpyxl reader (read_only=True, data_only=True)
  • cap scan bounds (max_rows=5000, max_cols=64) to prevent pathological worksheet ranges from exploding memory
  • stop scanning after sustained empty tails to avoid sparse-sheet runaway processing

Why

Some workbooks report huge used ranges (e.g. max_row=1048571) despite having very little real data, which can cause generic converters to consume excessive RAM.

Result

Significantly lower memory use during XLSX ingest while preserving useful sheet content for KB compilation.

@plasma16 plasma16 force-pushed the fix/xlsx-memory-spike branch from 9075e4c to c3a1f11 Compare May 7, 2026 07:30
@plasma16 plasma16 merged commit 554d93b into main May 7, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant