Skip to content

Conversation

@luccadibe
Copy link
Contributor

This PR aims to fix and optimize the HDF5Reader implementation from systemds with the goal of being able to correctly read the So2Sat LCZ42 dataset (https://mediatum.ub.tum.de/1454690) .

For this I added support for the filter pipeline and attribute message types from HDF5 ;
n dimensional matrices with n>2 are flattened into 2d .
I also added support for inferring hdf5 from the .h5 file extension.

Apologies for the massive PR 🤕 .
I benchmarked the performance of the new implementation and shared results in this repo:
https://github.com/luccadibe/systemds-hdf5-reader-benchmark

The code still needs some work regarding code style and formatting ( I am not sure if I set up the fomatter correctly as mentioned in the CONTRIBUTING.md ; in some files I was getting a huge diff so I tried to format only what I touched).

I am unsure about how to best split this into multiple PRs , or if that is wanted even.
I would appreciate some general feedback on this.

@codecov
Copy link

codecov bot commented Jan 17, 2026

Codecov Report

❌ Patch coverage is 39.55659% with 518 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.52%. Comparing base (79122eb) to head (c83b072).
⚠️ Report is 27 commits behind head on main.

Files with missing lines Patch % Lines
...n/java/org/apache/sysds/runtime/io/ReaderHDF5.java 37.78% 167 Missing and 24 partials ⚠️
...rg/apache/sysds/runtime/io/ReaderHDF5Parallel.java 19.25% 95 Missing and 14 partials ⚠️
...che/sysds/runtime/io/hdf5/H5ContiguousDataset.java 31.49% 69 Missing and 18 partials ⚠️
...ava/org/apache/sysds/runtime/meta/MetaDataAll.java 53.19% 18 Missing and 26 partials ⚠️
...org/apache/sysds/runtime/io/hdf5/H5RootObject.java 53.03% 20 Missing and 11 partials ⚠️
...ntime/io/hdf5/message/H5FilterPipelineMessage.java 0.00% 23 Missing ⚠️
...main/java/org/apache/sysds/runtime/io/hdf5/H5.java 70.37% 10 Missing and 6 partials ⚠️
...ds/runtime/io/hdf5/message/H5AttributeMessage.java 50.00% 2 Missing and 2 partials ⚠️
...ds/runtime/io/hdf5/message/H5DataSpaceMessage.java 42.85% 3 Missing and 1 partial ⚠️
...s/runtime/io/hdf5/message/H5DataLayoutMessage.java 62.50% 2 Missing and 1 partial ⚠️
... and 4 more
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2394      +/-   ##
============================================
- Coverage     72.29%   71.52%   -0.78%     
- Complexity    46937    47446     +509     
============================================
  Files          1513     1539      +26     
  Lines        178421   182620    +4199     
  Branches      35036    35913     +877     
============================================
+ Hits         128993   130611    +1618     
- Misses        39665    42015    +2350     
- Partials       9763     9994     +231     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mboehm7
Copy link
Contributor

mboehm7 commented Jan 17, 2026

LGTM - thanks for the patch @luccadibe. Overall, this generalization is great. During the merge, I resolve the merge conflicts, reverted the format changes in MetaDataAll, disabled the last test cases (group1/subgroup/data2) which was consistently failing, and replaced the stdout tracing with log tracing.

@mboehm7 mboehm7 closed this in 0ab05a5 Jan 17, 2026
@github-project-automation github-project-automation bot moved this from In Progress to Done in SystemDS PR Queue Jan 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants