Skip to content

Improve memory usage of unaligned dataset joins in an RDataFrame analysis #21859

@TomasDado

Description

@TomasDado

Explain what you would like to see improved and how.

I had a short discussion with @vepadulano regarding one issue we see in RDataFrame.

We run on TTrees that we need to "join" with BuildIndex(). The problem is in our case the number of events we need to match is around 500M, this causes issues because the memory needed to keep the hash map for the matching is huge. This would be manageable on its own, but the problem is that each thread keeps the copy of the map. This results in our case of 40 threads using more than 120 GBs of memory (would need probably much more but this is the limitation of the hardware). As you can see this is pretty restrictive as the solution here is to either:

  • Do not use that many threads
  • Somehow split the files so you dont need to have a map of 500M entries
  • "Just get more RAM"

These are not very compelling options.

We understand that this is probably beyond the scope of TTree and RDF support but this is something that could maybe improve for the RNtuple and RDF? As the current situation with TTrees is not sustainable

ROOT version

Any

Installation method

Any

Operating system

Any

Additional context

No response

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions