Skip to content

Fix extractor sandbox hang when the extract function raises an error#492

Open
discobot wants to merge 1 commit into
huggingface:mainfrom
discobot:fix/339-extractor-sandbox
Open

Fix extractor sandbox hang when the extract function raises an error#492
discobot wants to merge 1 commit into
huggingface:mainfrom
discobot:fix/339-extractor-sandbox

Conversation

@discobot

@discobot discobot commented Jun 13, 2026

Copy link
Copy Markdown

Fixes #339.

An extractor that raises (as in the issue's warmup traceback) kills the sandbox worker, but the parent doesn't notice the death: it sits in poll(self.timeout) and burns the full timeout for every document, reporting each one as a generic timeout — the real exception never surfaces. The same applies to per-document errors, since the worker loop only caught EOFError, which also made the isinstance(result, Exception) branch in process_document dead code and forced a respawn per failing document.

There are two more latent crashes on the same path: under the spawn start method the respawn fails with cannot pickle 'weakref.ReferenceType' object (the bound self._worker drags self.all_processes, which now holds a started Process, into the pickle), and if start() raises, __exit__ crashes with AttributeError trying to terminate() a never-started process.

This PR makes warmup best-effort (some extractors raise on the empty warmup text but work on real documents), pipes per-document exceptions back to the parent — so they're counted as clean_error, logged once, and the worker is reused — and turns _worker into a staticmethod so spawn never pickles the sandbox. The parent now also closes its copy of the child pipe end, so a dead worker is detected immediately via EOF instead of after the timeout, and a death during warmup raises EOFError (broken_process) instead of being misreported as a timeout.

On the issue's scenario (extractor raising on every document, 3 docs, timeout=3) main takes 9s and reports {'timeout': 3} under fork (and additionally crashes in __exit__ under spawn); with this PR it takes 0.1s, reports {'clean_error': 3}, and logs the real ValueError. Added tests/pipeline/test_extractor_sandbox.py covering the error/reuse/timeout/crash/warmup paths (verified under both fork and spawn); existing extractor tests and a Trafilatura end-to-end run through the sandbox still pass.


Note

Medium Risk
Changes multiprocessing lifecycle and error handling on a core extraction path; behavior shifts from timeout to clean_error/broken_process, but timeouts and cleanup paths are covered by new tests.

Overview
Fixes extractor sandbox hangs when extract raises: the worker no longer dies silently on per-document errors, so the parent stops burning the full per-document timeout and mis-reporting timeout instead of the real failure.

ExtractorSandbox now makes warmup best-effort (empty warmup can fail without blocking real docs), sends exceptions back over the pipe so BaseExtractor can count clean_error and reuse the worker, and uses a static _worker (with warmup_text passed in) so macOS spawn does not pickle the sandbox/all_processes. The parent closes its child pipe end and polls warmup readiness like document extraction, so crashes/EOF surface quickly as broken_process / EOFError rather than warmup timeouts; failed Process.start() no longer leaves a half-initialized process for cleanup.

Adds tests/pipeline/test_extractor_sandbox.py for always-failing extractors, worker reuse after errors, timeouts, hard crashes, warmup crashes, and warmup-only failures.

Reviewed by Cursor Bugbot for commit f3f5f42. Bugbot is set up for automated code reviews on this repo. Configure here.

A raising extract function killed the sandbox worker during warmup or mid-document, so every document burned the full timeout, was misreported as a timeout, and the real error never surfaced. The worker now treats warmup as best-effort and pipes per-document exceptions back to the parent, where they are counted as clean_error and the worker is reused. The worker is also a staticmethod now, since respawning previously crashed under the spawn start method when pickling the sandbox with already-started processes. The parent additionally closes its copy of the child pipe end so a dead worker is detected immediately via EOF instead of after the timeout, and a worker that dies during warmup raises EOFError (broken_process) instead of TimeoutError.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit f3f5f42. Configure here.

else:
# Timeout occurred before the worker process signaled it's ready
self._cleanup_process()
raise TimeoutError(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract errors misclassified as timeout

Low Severity

When the worker sends a raised TimeoutError or EOFError from extract_fn over the pipe, process_document re-raises it and the enclosing except (TimeoutError, EOFError) treats it like a sandbox timeout or dead worker: the worker is torn down and stats record timeout or broken_process instead of clean_error, contradicting per-document error reuse.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f3f5f42. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

task hang when extraction raises error

1 participant