Skip to content

Memory error in hdfeos5_2json_mbtiles.py for large files. #8

@falkamelung

Description

@falkamelung

It turns out this is an open issue:
#4

It worked on the PVC queue using --num-worker 12. Adding a queue-switch-on-timeout option to PVC queue would solve it, but we should do only once it is clear that there is no easier solution.

I am trying to ingest a 29GB file. For --num-workers 3 I get this memory error:

hdfeos5_2json_mbtiles.py miaplpy_201410_202603/network_delaunay_4/S1_asc_149_miaplpy_20141027_XXXXXXXX_S2343W06835_S2338W06810_S2306W06819_S2311W06843_filtDel4DS.he5 miaplpy_2
01410_202603/network_delaunay_4/JSON --num-workers 3
reading displacement data from file: miaplpy_201410_202603/network_delaunay_4/S1_asc_149_miaplpy_20141027_XXXXXXXX_S2343W06835_S2338W06810_S2306W06819_S2311W06843_filtDel4DS.he5 ...
^[[A^[[A^[[A^[[A
reading mask data from file: miaplpy_201410_202603/network_delaunay_4/S1_asc_149_miaplpy_20141027_XXXXXXXX_S2343W06835_S2338W06810_S2306W06819_S2311W06843_filtDel4DS.he5 ...
Masking displacement
Creating shared memory for multiple processes
miaplpy_201410_202603/network_delaunay_4/JSON already exists
columns: 6669
rows: 2599
converted chunk 0
converted chunk 1
converted chunk 2
converted chunk 3
converted chunk 4
converted chunk 5
converted chunk 6
converted chunk 7
converted chunk 8
Process ForkPoolWorker-3:
Traceback (most recent call last):
  File "/work2/05861/tg851601/stampede2/code/minsar/tools/miniforge3/envs/minsar/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/work2/05861/tg851601/stampede2/code/minsar/tools/miniforge3/envs/minsar/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/work2/05861/tg851601/stampede2/code/minsar/tools/miniforge3/envs/minsar/lib/python3.10/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/work2/05861/tg851601/stampede2/code/minsar/tools/miniforge3/envs/minsar/lib/python3.10/multiprocessing/queues.py", line 367, in get
    return _ForkingPickler.loads(res)
MemoryError

For --num-workers 1 I don't get memory error but it does not finish. It goes through Chunk 90 using 2 hours walltime. It uses 49% of the 198 GB RAM available, it goes up to 63%

c454-024      48     1.5    0.03    95.8     0.0      63.1%       2.46  UNDERUSED
NODES        CPU    LOAD   L/CPU   IDLE%     WA%       MEM%    GB/core  STATUS
-------------------------------------------------------------------------------
c454-024      48     1.5    0.03    97.6     0.0      49.7%       1.94  UNDERUSED

NODES        CPU    LOAD   L/CPU   IDLE%     WA%       MEM%    GB/core  STATUS
-------------------------------------------------------------------------------
c454-024      48     1.3    0.03    97.7     0.0      49.7%       1.94  UNDERUSED

I then used the PVC queue (1TB memory) with --num-worker=12. It went up to 22% (220GB) but finished in 30 minutes

Here what Cursor says:

How parallelization works
The .he5 cube is loaded into shared memory; timeseries_datasets holds one 2D array per date (slices of that cube).
The grid (WIDTH × LENGTH) is split into chunks of 20,000 flat indices; each chunk is one task.
multiprocessing.Pool(...).starmap(create_json, worker_args) runs those tasks with --num-workers processes.
After all chunk_*.json files exist, tippecanoe is run once via os.system on *.json (not parallelized in Python).
Why --num-workers 3 can OOM
starmap pickles every task’s arguments to send them to workers.
Each task’s tuple still includes the full timeseries_datasets dict and large lats / lons arrays (same for every chunk), so there are hundreds of near-duplicate large payloads in the queue.
Your MemoryError in ForkPoolWorker-3 at ForkPoolPickler.loads matches unpickling those queued tasks under memory pressure.

Choosing --num-workers
Use the largest value that does not OOM on your node; if 2 works and 3 fails, 2 is the right setting for that dataset.
There is no reliable formula from pixels/dates/file size alone because multiprocessing overhead dominates, not raw array size.
Pixels vs dates
Pixels drive the size of each 2D slice and how many points you iterate.
Dates multiply the depth of the 3D cube and the length of each feature’s d array in JSON.
Both matter for peak RAM; workers mainly add serialization/queue pressure on top.
Tippecanoe
It is often a real bottleneck (single external process, large input).
Running it only in a second job would require not running it in the first step (e.g. a --skip-tippecanoe flag). The current script always runs it; the doc records the exact commands so you can mirror that in a second step if you add a skip path later or use a wrapper.
Code was not modified; only MemoryIssues.md was added.

Does memory go up with more workers?
Usually yes for this script, but not as “N workers ⇒ N × full dataset in RAM.”

Base usage — Shared cube, timeseries_datasets, lat/lon, and task list depend mainly on pixels × dates, not strongly on worker count.

More worker processes — With fork, memory is shared at first (copy-on-write). Each worker that runs create_json touches data and builds Python/JSON, which can duplicate pages per process. More workers ⇒ more of that at once, so RSS can grow.

Main effect here — starmap pickles every chunk’s arguments, and each payload still includes the full timeseries_datasets and large arrays. Several workers unpickling in parallel and queue buffers raise peak memory when you increase --num-workers.

So: higher --num-workers tends to increase peak memory (especially from IPC/unpickling), which matches seeing OOM at 3 but not at 2. It is not a simple linear rule; if you are tight on RAM, fewer workers is safer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions