Skip to content

Latest commit

 

History

History
37 lines (30 loc) · 2.47 KB

File metadata and controls

37 lines (30 loc) · 2.47 KB

SheetReader

SheetReader is a fast, memory-efficient spreadsheet parser for tabular data from Excel OOXML (.xlsx) files, implemented in C++. Unlike many existing spreadsheet loaders, which rely on general-purpose XML parsers, SheetReader is designed specifically for spreadsheet data. By exploiting the fixed structure of .xlsx files, using parallelism at multiple levels, and managing memory carefully, it avoids unnecessary XML overhead and enables efficient ingestion of spreadsheet data.

Bindings

We provide bindings for several environments:

  • R: load spreadsheets into data frames. Also available on CRAN.
  • Python: load spreadsheets into pandas DataFrames. Also available on PyPI.
  • PostgreSQL FDW: access spreadsheets from PostgreSQL through a foreign data wrapper and combine them with other PostgreSQL tables. Also available on PGXN.
  • DuckDB extension: access spreadsheets from DuckDB and combine them with other DuckDB tables. Also available as a community extension.

Scientific Background

SheetReader was introduced in the PolyDB research project (polydbms.org). The initial design and evaluation was published in the Information Systems Journal. If you use SheetReader in your research, consider citing the following paper:

@article{GavriilidisHZM23,
  author       = {Haralampos Gavriilidis and Felix Henze and Eleni Tzirita Zacharatou and Volker Markl},
  title        = {SheetReader: Efficient Specialized Spreadsheet Parsing},
  journal      = {Inf. Syst.},
  volume       = {115},
  pages        = {102183},
  year         = {2023},
  url          = {https://doi.org/10.1016/j.is.2023.102183}
}

Acknowledgements

SheetReader includes and uses the following C/C++ libraries:

Logo design by Stefanie Lenk.