Environment
Python 3.13.13
Platform: Linux HPC (SLURM)
Single GPU inference
Description
When running classpose-predict-wsi on large WSI files (~8600 tiles), the process consistently crashes after tile loading completes but while the worker is still processing tiles. The crash produces no Python traceback and exits with code 143 (SIGTERM) or 1, always at the same relative progress point (~76-85% of predicted tiles).
Root cause
The SlideLoader runs as a subprocess sharing a multiprocessing.Manager server with the main process and PostProcessor. When SlideLoader finishes filling the queue and its subprocess exits naturally, Python cleanup of the Manager proxies it holds destabilizes the shared Manager server. The worker (running in the main process) is still actively using Manager-proxied objects (slide.n, predicted_tiles_value, pp.q) at this point, causing a silent crash.
Workaround
Pre-loading all tiles into a local list before starting inference eliminates the concurrency between SlideLoader and the worker:
#After slide initialization, drain the queue fully before starting inference
all_tiles = []
while True:
item = slide.q.get()
if item[0] is None:
break
all_tiles.append(item)
slide.p.join() # SlideLoader is fully done before worker starts
Then feed all_tiles into a plain queue.Queue for the worker. This ensures the SlideLoader subprocess is completely finished before inference begins, so its teardown cannot affect the Manager.
Additional notes
The pp.polygons.empty() check in the polygon collection loop is also unreliable for managed queues on large slides (can return True prematurely with ~250k cells). Draining by count (pp.value.value) or writing directly to a file from the PostProcessor subprocess is more robust.
Reproducible across multiple different SVS files and hardware configurations.
Environment
Python 3.13.13
Platform: Linux HPC (SLURM)
Single GPU inference
Description
When running classpose-predict-wsi on large WSI files (~8600 tiles), the process consistently crashes after tile loading completes but while the worker is still processing tiles. The crash produces no Python traceback and exits with code 143 (SIGTERM) or 1, always at the same relative progress point (~76-85% of predicted tiles).
Root cause
The SlideLoader runs as a subprocess sharing a multiprocessing.Manager server with the main process and PostProcessor. When SlideLoader finishes filling the queue and its subprocess exits naturally, Python cleanup of the Manager proxies it holds destabilizes the shared Manager server. The worker (running in the main process) is still actively using Manager-proxied objects (slide.n, predicted_tiles_value, pp.q) at this point, causing a silent crash.
Workaround
Pre-loading all tiles into a local list before starting inference eliminates the concurrency between SlideLoader and the worker:
Then feed all_tiles into a plain queue.Queue for the worker. This ensures the SlideLoader subprocess is completely finished before inference begins, so its teardown cannot affect the Manager.
Additional notes
The pp.polygons.empty() check in the polygon collection loop is also unreliable for managed queues on large slides (can return True prematurely with ~250k cells). Draining by count (pp.value.value) or writing directly to a file from the PostProcessor subprocess is more robust.
Reproducible across multiple different SVS files and hardware configurations.