Fix extractor sandbox hang when the extract function raises an error#492
Fix extractor sandbox hang when the extract function raises an error#492discobot wants to merge 1 commit into
Conversation
A raising extract function killed the sandbox worker during warmup or mid-document, so every document burned the full timeout, was misreported as a timeout, and the real error never surfaced. The worker now treats warmup as best-effort and pipes per-document exceptions back to the parent, where they are counted as clean_error and the worker is reused. The worker is also a staticmethod now, since respawning previously crashed under the spawn start method when pickling the sandbox with already-started processes. The parent additionally closes its copy of the child pipe end so a dead worker is detected immediately via EOF instead of after the timeout, and a worker that dies during warmup raises EOFError (broken_process) instead of TimeoutError.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit f3f5f42. Configure here.
| else: | ||
| # Timeout occurred before the worker process signaled it's ready | ||
| self._cleanup_process() | ||
| raise TimeoutError( |
There was a problem hiding this comment.
Extract errors misclassified as timeout
Low Severity
When the worker sends a raised TimeoutError or EOFError from extract_fn over the pipe, process_document re-raises it and the enclosing except (TimeoutError, EOFError) treats it like a sandbox timeout or dead worker: the worker is torn down and stats record timeout or broken_process instead of clean_error, contradicting per-document error reuse.
Reviewed by Cursor Bugbot for commit f3f5f42. Configure here.


Fixes #339.
An extractor that raises (as in the issue's warmup traceback) kills the sandbox worker, but the parent doesn't notice the death: it sits in
poll(self.timeout)and burns the full timeout for every document, reporting each one as a generic timeout — the real exception never surfaces. The same applies to per-document errors, since the worker loop only caughtEOFError, which also made theisinstance(result, Exception)branch inprocess_documentdead code and forced a respawn per failing document.There are two more latent crashes on the same path: under the spawn start method the respawn fails with
cannot pickle 'weakref.ReferenceType' object(the boundself._workerdragsself.all_processes, which now holds a startedProcess, into the pickle), and ifstart()raises,__exit__crashes withAttributeErrortrying toterminate()a never-started process.This PR makes warmup best-effort (some extractors raise on the empty warmup text but work on real documents), pipes per-document exceptions back to the parent — so they're counted as
clean_error, logged once, and the worker is reused — and turns_workerinto a staticmethod so spawn never pickles the sandbox. The parent now also closes its copy of the child pipe end, so a dead worker is detected immediately via EOF instead of after the timeout, and a death during warmup raisesEOFError(broken_process) instead of being misreported as a timeout.On the issue's scenario (extractor raising on every document, 3 docs, timeout=3) main takes 9s and reports
{'timeout': 3}under fork (and additionally crashes in__exit__under spawn); with this PR it takes 0.1s, reports{'clean_error': 3}, and logs the realValueError. Addedtests/pipeline/test_extractor_sandbox.pycovering the error/reuse/timeout/crash/warmup paths (verified under both fork and spawn); existing extractor tests and a Trafilatura end-to-end run through the sandbox still pass.Note
Medium Risk
Changes multiprocessing lifecycle and error handling on a core extraction path; behavior shifts from timeout to clean_error/broken_process, but timeouts and cleanup paths are covered by new tests.
Overview
Fixes extractor sandbox hangs when
extractraises: the worker no longer dies silently on per-document errors, so the parent stops burning the full per-document timeout and mis-reportingtimeoutinstead of the real failure.ExtractorSandboxnow makes warmup best-effort (empty warmup can fail without blocking real docs), sends exceptions back over the pipe soBaseExtractorcan countclean_errorand reuse the worker, and uses a static_worker(withwarmup_textpassed in) so macOS spawn does not pickle the sandbox/all_processes. The parent closes its child pipe end and polls warmup readiness like document extraction, so crashes/EOF surface quickly asbroken_process/EOFErrorrather than warmup timeouts; failedProcess.start()no longer leaves a half-initialized process for cleanup.Adds
tests/pipeline/test_extractor_sandbox.pyfor always-failing extractors, worker reuse after errors, timeouts, hard crashes, warmup crashes, and warmup-only failures.Reviewed by Cursor Bugbot for commit f3f5f42. Bugbot is set up for automated code reviews on this repo. Configure here.