UN-3085 Structure tool made into a celery task by Deepak-Kesavan · Pull Request #1779 · Zipstack/unstract

Deepak-Kesavan · 2026-02-05T05:07:25Z

What

Why

How

Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)

Database Migrations

Env Config

Relevant Docs

Related Issues or PRs

Dependencies Versions

Notes on Testing

Screenshots

Checklist

I have read and understood the Contribution Guidelines.

coderabbitai · 2026-02-05T05:07:42Z

Summary by CodeRabbit

Release Notes

New Features
- Added Structure Extraction worker service supporting distributed execution of structure-based tools.
- Implemented configurable autoscaling and health monitoring for the worker service deployment.

Walkthrough

This pull request introduces a new Celery-based structure extraction worker service to the system. The changes add a worker-structure service to Docker Compose, extend the workflow execution layer to route structure tools through Celery based on a feature flag, and implement a comprehensive structure extraction worker with task orchestration, configuration helpers, and health checks.

Changes

Cohort / File(s)	Summary
Docker & Configuration `docker/docker-compose.yaml`, `docker/sample.env`	Added worker-structure service configuration with autoscaling settings and new environment variables for structure worker identity and health port.
System Constants & Routing `unstract/workflow-execution/src/unstract/workflow_execution/constants.py`, `workers/shared/enums/worker_enums_base.py`	Added STRUCTURE tool identifier constant and new STRUCTURE member to WorkerType and QueueName enums with corresponding health port mapping (8088).
Workflow Execution Integration `unstract/workflow-execution/src/unstract/workflow_execution/tools_utils.py`	Introduced Celery-based routing for structure tools via new methods `_should_use_celery_for_structure()` and `_dispatch_celery_task()`, enabling feature-flag-controlled task dispatch to Celery instead of Docker runners.
Structure Worker Package `workers/structure/__init__.py`, `workers/structure/worker.py`, `workers/structure/constants.py`, `workers/structure/helpers.py`, `workers/structure/utils.py`	Implemented new worker module with Celery app setup, health check mechanism, configuration constant classes (SettingsKeys, IndexingConstants), and utility helpers for extraction, indexing, JSON repair, and Markdown conversion.
Structure Extraction Task Engine `workers/structure/tasks.py`	Implemented comprehensive `execute_structure_extraction()` task orchestrating end-to-end extraction workflow, including WorkerToolContext adapter, support for both prompt-studio and agentic tools, optional summarization, multi-output indexing, profile overrides, and robust error handling with streaming updates.

Sequence Diagram(s)

sequenceDiagram
    participant WE as Workflow<br/>Execution
    participant Tools as ToolsUtils
    participant Celery as Celery<br/>Broker
    participant SW as Structure<br/>Worker
    participant Storage as File<br/>Storage
    participant PT as PromptTool<br/>Service

    WE->>Tools: run_tool(structure_tool)
    Tools->>Tools: _should_use_celery_for_structure()
    alt Feature Flag Enabled
        Tools->>Tools: _dispatch_celery_task()
        Tools->>Storage: Gather execution context<br/>& metadata
        Tools->>Celery: Send task with kwargs
        Note over Celery: structure_extraction queue
        Celery->>SW: Dispatch to worker
        SW->>Storage: Read settings & inputs
        SW->>PT: Create PromptTool client
        alt Agentic Tool
            SW->>PT: _run_agentic_extraction()
        else Prompt-Studio Tool
            SW->>Storage: dynamic_extraction()
            PT->>PT: Execute extraction
            SW->>Storage: dynamic_indexing()
        end
        SW->>Storage: Write structured_output
        SW->>Celery: Return result
        Celery->>Tools: Task result
        Tools->>WE: RunnerContainerRunResponse
    else Feature Flag Disabled
        Tools->>WE: run_tool_with_retry()<br/>(Docker path)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description template is present but entirely empty—all required sections lack any substantive content.	Complete all critical sections: What (changes made), Why (motivation), How (implementation details), Can this PR break existing features (with specific reasoning), and Env Config changes.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and directly describes the main change: converting the Structure tool into a Celery task.
Docstring Coverage	✅ Passed	Docstring coverage is 82.76% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch UN-3085-remove-tool-and-make-it-a-celery-task

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-02-05T05:08:14Z

Test Results

Summary

✅ Runner Tests: 11 passed, 0 failed (11 total)
✅ SDK1 Tests: 66 passed, 0 failed (66 total)

Runner Tests - Full Report

filepath	function	$$\textcolor{#23d18b}{\tt{passed}}$$	SUBTOTAL
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_logs}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_cleanup}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_cleanup\_skip}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_client\_init}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_image\_exists}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_image}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config\_without\_mount}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_run\_container}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_image\_for\_sidecar}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_sidecar\_container}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{TOTAL}}$$		$$\textcolor{#23d18b}{\tt{11}}$$	$$\textcolor{#23d18b}{\tt{11}}$$

SDK1 Tests - Full Report

sonarqubecloud · 2026-02-05T05:08:20Z

Quality Gate failed

Failed conditions
2 Security Hotspots
26.2% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

coderabbitai

Actionable comments posted: 5

🤖 Fix all issues with AI agents

In `@workers/structure/__init__.py`:
- Line 3: Replace the implicit top-level import "from worker import app" with an
explicit relative import so the package-local module is resolved correctly;
change the import in workers.structure.__init__.py to import the app symbol from
the sibling module named worker (i.e., use a relative import of worker and
import app).

In `@workers/structure/tasks.py`:
- Around line 269-280: The code currently reads PROMPT_HOST and PROMPT_PORT into
local variables (prompt_host, prompt_port) then constructs PromptTool, but
STHelper.dynamic_extraction and STHelper.dynamic_indexing later call
tool.get_env_or_die which reads os.environ directly and will fail if the keys
are not present; before creating PromptTool (or before any STHelper calls) call
os.environ.setdefault(SettingsKeys.PROMPT_HOST, prompt_host) and
os.environ.setdefault(SettingsKeys.PROMPT_PORT, prompt_port) so the defaults
persist in the environment, ensuring tool.get_env_or_die (and
STHelper.dynamic_extraction/dynamic_indexing) can find them.
- Around line 296-326: Replace the broad except around get_prompt_studio_tool
with a specific catch for RequestException and only fall back to
get_agentic_studio_tool when the caught RequestException indicates a 404 from
the service; for any other RequestException (no response or non-404 status)
re-raise it so auth/network errors propagate. Concretely, change the try/except
that calls get_prompt_studio_tool (and references prompt_registry_id,
exported_tool, SettingsKeys.TOOL_METADATA, logger) to except RequestException as
e: check e.response exists and e.response.status_code == 404 to proceed to the
agentic lookup, otherwise raise e; keep the existing SdkError handling for
agentic lookup failures.

In `@workers/structure/utils.py`:
- Around line 1-3: The workers package is missing the json_repair dependency
used by repair_json_with_best_structure() in workers/structure/utils.py; add
"json_repair" to the dependencies list in workers/pyproject.toml (under
[tool.poetry.dependencies] or the equivalent dependencies section) so the import
from json_repair and usage in repair_json_with_best_structure() will not fail at
runtime.

In `@workers/structure/worker.py`:
- Around line 61-69: Replace the hardcoded "healthy" status in the Celery task
healthcheck with the real result from check_structure_health(): call
check_structure_health() inside the healthcheck(self) task, use its returned
status (or entire result) for the "status" field (and include any diagnostic
info it returns) while preserving existing fields like "worker_type", "task_id"
(self.request.id) and "worker_name" (config.worker_name fallback); update
references to the function name check_structure_health and the healthcheck task
to ensure the task reflects DEGRADED/unhealthy states from the API client.

🧹 Nitpick comments (7)

unstract/workflow-execution/src/unstract/workflow_execution/tools_utils.py (3)
263-266: Consider using constants for queue and task names.

The queue name "structure_extraction" and task path "structure.execute_extraction" are hardcoded. For consistency and maintainability, consider using the QueueName.STRUCTURE constant from workers/shared/enums/worker_enums_base.py.
♻️ Suggested refactor
+        from workers.shared.enums.worker_enums_base import QueueName
+
         # Map task_name to full Celery task path
         task_map = {
-            "structure": "structure.execute_extraction",
+            "structure": "structure.tasks.execute_structure_extraction",
             # Future: Add other migrated tasks here
         }
         ...
         # Determine queue name based on task
-        queue_name = "structure_extraction" if task_name == "structure" else "celery"
+        queue_name = QueueName.STRUCTURE.value if task_name == "structure" else QueueName.GENERAL.value
Also applies to: 335-336

360-370: Use logger.exception to preserve stack trace on task failure.

When logging the Celery task failure, using logger.exception instead of logger.error will automatically include the full stack trace, which aids debugging.
♻️ Proposed fix
         except Exception as e:
-            logger.error(
+            logger.exception(
                 f"Celery task {full_task_name} failed for "
                 f"file_execution_id={file_execution_id}: {e}"
             )
285-295: Metadata fallback catches overly broad exception.

Catching a bare Exception when loading metadata may mask unexpected errors. Consider catching the specific FileMetadataJsonNotFound exception (referenced in the ExecutionFileHandler.get_workflow_metadata snippet) or at minimum log the exception for debugging.
♻️ Proposed improvement
+        from unstract.workflow_execution.exceptions import FileMetadataJsonNotFound
+
         # Get metadata to extract execution context
         try:
             metadata = file_handler.get_workflow_metadata()
-        except Exception:
+        except FileMetadataJsonNotFound:
             # If metadata doesn't exist yet, create minimal metadata
+            logger.debug("Metadata file not found, using minimal metadata")
             metadata = {
                 "source_name": "INFILE",
                 "source_hash": "",
                 "tags": [],
                 "llm_profile_id": None,
                 "custom_data": {},
             }
workers/structure/utils.py (1)

22-27: Naive plural handling may produce incorrect labels.

The logic that strips a trailing 's' to singularize labels works for simple cases like "items" → "item" but will produce incorrect results for words like "status" → "statu", "address" → "addres", or "analysis" → "analysi".

Given the TODO comment acknowledges this limitation, consider whether this is acceptable for the expected data or if a more robust approach (e.g., using a library like inflect) would be worthwhile.
workers/structure/constants.py (2)
8-41: Remove duplicate SettingsKeys entries.

OUTPUTS, TOOL_ID, and NAME are defined twice; the later assignments override earlier ones and add noise.
♻️ Suggested cleanup
@@
-    OUTPUTS = "outputs"
@@
-    TOOL_ID = "tool_id"
@@
-    NAME = "name"
Also applies to: 67-67

103-106: Fix typo in TOOL_EXECUTION_METADATA constant name.

TOOL_EXECUTION_METATADA is misspelled; correcting now avoids spreading the typo.
✏️ Rename constant
-    TOOL_EXECUTION_METATADA = "tool_execution_metadata"
+    TOOL_EXECUTION_METADATA = "tool_execution_metadata"
🔧 Update usages (outside this file)
--- a/workers/structure/helpers.py
+++ b/workers/structure/helpers.py
@@
-            IKeys.TOOL_EXECUTION_METATADA: tool.get_exec_metadata,
+            IKeys.TOOL_EXECUTION_METADATA: tool.get_exec_metadata,
@@
-            IKeys.TOOL_EXECUTION_METATADA: tool.get_exec_metadata,
+            IKeys.TOOL_EXECUTION_METADATA: tool.get_exec_metadata,
workers/structure/tasks.py (1)
400-402: Use a randomized temp dir to avoid predictable /tmp paths.

Predictable temp paths can collide across retries and are flagged as unsafe; prefer tempfile with a prefix that includes the execution id.
♻️ Suggested fix
@@
-import os
+import os
+import tempfile
@@
-    tool_data_dir = Path(f"/tmp/structure_{file_execution_id}")
+    tool_data_dir = Path(
+        tempfile.mkdtemp(prefix=f"structure_{file_execution_id}_")
+    )

coderabbitai · 2026-02-05T05:18:31Z

workers/structure/__init__.py

@@ -0,0 +1,5 @@
+"""Structure extraction worker module."""
+
+from worker import app


⚠️ Potential issue | 🟠 Major

Use explicit relative import for worker module.

The import from worker import app uses an implicit import which may fail in Python 3 if worker is not in sys.path. Since worker.py is in the same package (workers/structure/), use an explicit relative import.

🐛 Proposed fix

-from worker import app +from .worker import app

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from worker import app

from .worker import app

🤖 Prompt for AI Agents

In `@workers/structure/__init__.py` at line 3, Replace the implicit top-level import "from worker import app" with an explicit relative import so the package-local module is resolved correctly; change the import in workers.structure.__init__.py to import the app symbol from the sibling module named worker (i.e., use a relative import of worker and import app).

coderabbitai · 2026-02-05T05:18:32Z

workers/structure/tasks.py

+    # Create SDK1 service clients using WorkerToolContext
+    prompt_host = os.getenv(SettingsKeys.PROMPT_HOST, "prompt-service")
+    prompt_port = os.getenv(SettingsKeys.PROMPT_PORT, "3003")
+    platform_host = os.getenv(ToolEnv.PLATFORM_HOST, "backend")
+    platform_port = os.getenv(ToolEnv.PLATFORM_PORT, "8000")
+
+    responder: PromptTool = PromptTool(
+        tool=tool_context,
+        prompt_host=prompt_host,
+        prompt_port=prompt_port,
+        request_id=file_execution_id,
+    )


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, let's check what the provided search finds rg -n "PROMPT_HOST|PROMPT_PORT" -g "*.yml" -g "*.yaml" -g "*.env*" -g "*.py" | head -50

Repository: Zipstack/unstract

Length of output: 3918

🏁 Script executed:

#!/bin/bash # Let's also find the StructureToolHelper and understand how it uses get_env_or_die rg -n "StructureToolHelper|get_env_or_die" --type py -A 3 -B 1 | head -100

Repository: Zipstack/unstract

Length of output: 6937

🏁 Script executed:

#!/bin/bash # Let's look at the actual file in question to understand the context better wc -l workers/structure/tasks.py

Repository: Zipstack/unstract

Length of output: 91

🏁 Script executed:

#!/bin/bash # Read the full context around lines 269-280 in tasks.py sed -n '260,310p' workers/structure/tasks.py

Repository: Zipstack/unstract

Length of output: 2069

🏁 Script executed:

#!/bin/bash # Also check where STHelper is called to understand the code flow rg -n "STHelper|StructureToolHelper" workers/structure/tasks.py -A 5 -B 2

Repository: Zipstack/unstract

Length of output: 1119

🏁 Script executed:

#!/bin/bash # Check the get_env_or_die implementation in the tool/stream.py sed -n '130,160p' unstract/sdk1/src/unstract/sdk1/tool/stream.py

Repository: Zipstack/unstract

Length of output: 917

Set PROMPT_HOST/PROMPT_PORT defaults into os.environ before STHelper calls.

Defaults are obtained locally via os.getenv (lines 270–271), but STHelper.dynamic_extraction (line 438) and STHelper.dynamic_indexing (line 492) later call tool.get_env_or_die for the same keys, which checks os.environ directly. If these environment variables aren't set at runtime, get_env_or_die will raise an error, even though local defaults exist. Use os.environ.setdefault() to persist the defaults.

Suggested fix

# Create SDK1 service clients using WorkerToolContext prompt_host = os.getenv(SettingsKeys.PROMPT_HOST, "prompt-service") prompt_port = os.getenv(SettingsKeys.PROMPT_PORT, "3003") + os.environ.setdefault(SettingsKeys.PROMPT_HOST, prompt_host) + os.environ.setdefault(SettingsKeys.PROMPT_PORT, prompt_port) platform_host = os.getenv(ToolEnv.PLATFORM_HOST, "backend") platform_port = os.getenv(ToolEnv.PLATFORM_PORT, "8000")

🤖 Prompt for AI Agents

In `@workers/structure/tasks.py` around lines 269 - 280, The code currently reads PROMPT_HOST and PROMPT_PORT into local variables (prompt_host, prompt_port) then constructs PromptTool, but STHelper.dynamic_extraction and STHelper.dynamic_indexing later call tool.get_env_or_die which reads os.environ directly and will fail if the keys are not present; before creating PromptTool (or before any STHelper calls) call os.environ.setdefault(SettingsKeys.PROMPT_HOST, prompt_host) and os.environ.setdefault(SettingsKeys.PROMPT_PORT, prompt_port) so the defaults persist in the environment, ensuring tool.get_env_or_die (and STHelper.dynamic_extraction/dynamic_indexing) can find them.

coderabbitai · 2026-02-05T05:18:32Z

workers/structure/tasks.py

+    try:
+        exported_tool = platform_helper.get_prompt_studio_tool(
+            prompt_registry_id=prompt_registry_id
+        )
+    except Exception as e:
+        logger.info(f"Not found as prompt studio project, trying agentic registry: {e}")
+
+    if exported_tool and SettingsKeys.TOOL_METADATA in exported_tool:
+        tool_metadata = exported_tool[SettingsKeys.TOOL_METADATA]
+        is_agentic = False
+        tool_metadata["is_agentic"] = False
+    else:
+        # Try agentic registry as fallback
+        try:
+            agentic_tool = platform_helper.get_agentic_studio_tool(
+                agentic_registry_id=prompt_registry_id
+            )
+            if not agentic_tool or SettingsKeys.TOOL_METADATA not in agentic_tool:
+                raise SdkError(
+                    f"Error fetching project: Registry returned empty response for {prompt_registry_id}"
+                )
+            tool_metadata = agentic_tool[SettingsKeys.TOOL_METADATA]
+            is_agentic = True
+            tool_metadata["is_agentic"] = True
+            logger.info(
+                f"Retrieved agentic project: {tool_metadata.get('name', prompt_registry_id)}"
+            )
+        except Exception as agentic_error:
+            raise SdkError(
+                f"Error fetching project from both registries for ID '{prompt_registry_id}': {agentic_error}"
+            )


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, let's find the relevant files and understand the exception handling rg -l "get_prompt_studio_tool" --type py | head -5

Repository: Zipstack/unstract

Length of output: 236

🏁 Script executed:

#!/bin/bash # Look for the actual implementation of get_prompt_studio_tool rg -n "def get_prompt_studio_tool" --type py -A 15

Repository: Zipstack/unstract

Length of output: 2654

🏁 Script executed:

#!/bin/bash # Check for _call_service and how it handles errors rg -n "def _call_service" --type py -A 20

Repository: Zipstack/unstract

Length of output: 3706

🏁 Script executed:

#!/bin/bash # Look for SdkError definition and how status_code is set rg -n "class SdkError" --type py -A 10

Repository: Zipstack/unstract

Length of output: 918

🏁 Script executed:

#!/bin/bash # Get the full implementation of _call_service in platform.py sed -n '218,280p' unstract/sdk1/src/unstract/sdk1/platform.py

Repository: Zipstack/unstract

Length of output: 2719

🏁 Script executed:

#!/bin/bash # Check if there are more lines in the _call_service implementation sed -n '218,350p' unstract/sdk1/src/unstract/sdk1/platform.py

Repository: Zipstack/unstract

Length of output: 5374

🏁 Script executed:

#!/bin/bash # Check what exception types are actually raised and how they're caught in the codebase rg -n "RequestException" unstract/sdk1/src/unstract/sdk1/platform.py -A 2 -B 2

Repository: Zipstack/unstract

Length of output: 727

🏁 Script executed:

#!/bin/bash # Check if there's a wrapper or custom exception that carries status_code rg -n "raise.*RequestException\|raise.*SdkError" unstract/sdk1/src/unstract/sdk1/platform.py

Repository: Zipstack/unstract

Length of output: 43

🏁 Script executed:

#!/bin/bash # Look at the actual tasks.py code to see what happens when exception is caught sed -n '296,330p' workers/structure/tasks.py

Repository: Zipstack/unstract

Length of output: 1537

🏁 Script executed:

#!/bin/bash # Check what stream_error_and_exit does - does it raise or exit? rg -n "def stream_error_and_exit" unstract/sdk1/src/unstract/sdk1 -A 5

Repository: Zipstack/unstract

Length of output: 609

🏁 Script executed:

#!/bin/bash # Check the imports in tasks.py to understand what exceptions should be caught head -50 workers/structure/tasks.py | grep -E "^import|^from"

Repository: Zipstack/unstract

Length of output: 720

🏁 Script executed:

#!/bin/bash # Let's check if RequestException carries status_code attribute python3 << 'EOF' from requests.exceptions import RequestException import inspect # Check RequestException attributes print("RequestException attributes:") print([attr for attr in dir(RequestException('test')) if not attr.startswith('_')]) # Check the signature e = RequestException('test') print("\nHas status_code:", hasattr(e, 'status_code')) EOF

Repository: Zipstack/unstract

Length of output: 243

🏁 Script executed:

#!/bin/bash # Check what happens in stream_error_and_exit - does it raise or call sys.exit? sed -n '118,135p' unstract/sdk1/src/unstract/sdk1/tool/stream.py

Repository: Zipstack/unstract

Length of output: 646

Avoid falling back to agentic on non-404 errors.

Catching all exceptions treats auth/network failures as "not found," which can route to agentic incorrectly. However, the suggested fix is incorrect: _call_service raises RequestException (not SdkError) for 404 errors on tool lookups, and RequestException does not have a status_code attribute.

The correct approach is to catch RequestException specifically instead of all exceptions, since that's what _call_service raises for 404 on tool lookups:

✅ Corrected fix

try: exported_tool = platform_helper.get_prompt_studio_tool( prompt_registry_id=prompt_registry_id ) - except Exception as e: + except RequestException as e: logger.info(f"Not found as prompt studio project, trying agentic registry: {e}")

This ensures only 404 responses from the tool lookup endpoints trigger the fallback; other exceptions (auth errors, network issues, etc.) will propagate properly.

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 300-300: Do not catch blind exception: Exception

(BLE001)

[warning] 314-316: Abstract raise to an inner function

(TRY301)

[warning] 314-316: Avoid specifying long messages outside the exception class

(TRY003)

[warning] 323-323: Do not catch blind exception: Exception

(BLE001)

[warning] 324-326: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

[warning] 324-326: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In `@workers/structure/tasks.py` around lines 296 - 326, Replace the broad except around get_prompt_studio_tool with a specific catch for RequestException and only fall back to get_agentic_studio_tool when the caught RequestException indicates a 404 from the service; for any other RequestException (no response or non-404 status) re-raise it so auth/network errors propagate. Concretely, change the try/except that calls get_prompt_studio_tool (and references prompt_registry_id, exported_tool, SettingsKeys.TOOL_METADATA, logger) to except RequestException as e: check e.response exists and e.response.status_code == 404 to proceed to the agentic lookup, otherwise raise e; keep the existing SdkError handling for agentic lookup failures.

coderabbitai · 2026-02-05T05:18:32Z

workers/structure/utils.py

+from typing import Any
+
+from json_repair import repair_json


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check for json_repair in various dependency files echo "=== Searching for json_repair in dependency files ===" # List common dependency files in the repo echo "=== Looking for dependency declaration files ===" fd -t f '(requirements|setup|pyproject|poetry|Pipfile)' echo "" echo "=== Searching for json_repair in all files ===" rg 'json.repair|json_repair' --type py --type txt --type toml --type cfg --type yaml

Repository: Zipstack/unstract

Length of output: 1958

🏁 Script executed:

#!/bin/bash echo "=== Content of workers/pyproject.toml ===" cat workers/pyproject.toml echo "" echo "=== Content of root pyproject.toml ===" cat pyproject.toml

Repository: Zipstack/unstract

Length of output: 13247

🏁 Script executed:

#!/bin/bash echo "=== Searching for json_repair in unstract packages ===" find unstract -name "pyproject.toml" -exec grep -l "json.repair\|json_repair" {} \; echo "" echo "=== Checking where repair_json is imported/used ===" rg "from json_repair import|import json_repair|repair_json" --type py echo "" echo "=== Check if repair_json is actually used in workers/structure/utils.py ===" cat workers/structure/utils.py

Repository: Zipstack/unstract

Length of output: 7142

Add json_repair to workers dependencies.

The json_repair library is imported in workers/structure/utils.py and actively used in repair_json_with_best_structure(), but it is not declared as a dependency in workers/pyproject.toml. Add it to the dependencies list to prevent runtime import failures.

🤖 Prompt for AI Agents

In `@workers/structure/utils.py` around lines 1 - 3, The workers package is missing the json_repair dependency used by repair_json_with_best_structure() in workers/structure/utils.py; add "json_repair" to the dependencies list in workers/pyproject.toml (under [tool.poetry.dependencies] or the equivalent dependencies section) so the import from json_repair and usage in repair_json_with_best_structure() will not fail at runtime.

coderabbitai · 2026-02-05T05:18:32Z

workers/structure/worker.py

+@app.task(bind=True)
+def healthcheck(self):
+    """Health check task for monitoring systems."""
+    return {
+        "status": "healthy",
+        "worker_type": "structure",
+        "task_id": self.request.id,
+        "worker_name": config.worker_name if config else "structure-worker",
+    }


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, let's find the healthcheck task and understand its context cd workers/structure || exit cat -n worker.py | head -80

Repository: Zipstack/unstract

Length of output: 2759

🏁 Script executed:

#!/bin/bash # Search for check_structure_health function rg -n "check_structure_health" --type py

Repository: Zipstack/unstract

Length of output: 221

🏁 Script executed:

#!/bin/bash # Look for healthcheck references in the codebase rg -n "healthcheck" --type py

Repository: Zipstack/unstract

Length of output: 506

🏁 Script executed:

#!/bin/bash # Check the overall structure of the workers directory find . -name "*.py" -type f | grep -E "(worker|health)" | head -20

Repository: Zipstack/unstract

Length of output: 1054

🏁 Script executed:

#!/bin/bash # Find HealthCheckResult class definition rg -n "class HealthCheckResult" --type py

Repository: Zipstack/unstract

Length of output: 139

🏁 Script executed:

#!/bin/bash # Check the HealthStatus enum rg -n "class HealthStatus" --type py

Repository: Zipstack/unstract

Length of output: 140

🏁 Script executed:

#!/bin/bash # Look at one of the other worker files to see if they have similar pattern cat -n workers/general/worker.py | sed -n '50,80p'

Repository: Zipstack/unstract

Length of output: 740

🏁 Script executed:

#!/bin/bash # Read the HealthCheckResult and HealthStatus definitions cat -n workers/shared/infrastructure/monitoring/health.py

Repository: Zipstack/unstract

Length of output: 21437

Use actual health check result in healthcheck task instead of hardcoding "healthy".

The healthcheck task currently always reports "status": "healthy" regardless of actual worker state. The check_structure_health() function can return DEGRADED status when the API client is unhealthy, but this is masked by the Celery task. If this task is used by liveness/readiness probes, it will incorrectly report the worker as healthy even during degradation.

✅ Suggested fix

`@app.task`(bind=True) def healthcheck(self): """Health check task for monitoring systems.""" + result = check_structure_health() return { - "status": "healthy", + "status": result.status.value, "worker_type": "structure", "task_id": self.request.id, "worker_name": config.worker_name if config else "structure-worker", + "details": result.details, + "message": result.message, }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@app.task(bind=True)

def healthcheck(self):

"""Health check task for monitoring systems."""

return {

"status": "healthy",

"worker_type": "structure",

"task_id": self.request.id,

"worker_name": config.worker_name if config else "structure-worker",

}

`@app.task`(bind=True)

def healthcheck(self):

"""Health check task for monitoring systems."""

result = check_structure_health()

return {

"status": result.status.value,

"worker_type": "structure",

"task_id": self.request.id,

"worker_name": config.worker_name if config else "structure-worker",

"details": result.details,

"message": result.message,

}

🤖 Prompt for AI Agents

In `@workers/structure/worker.py` around lines 61 - 69, Replace the hardcoded "healthy" status in the Celery task healthcheck with the real result from check_structure_health(): call check_structure_health() inside the healthcheck(self) task, use its returned status (or entire result) for the "status" field (and include any diagnostic info it returns) while preserving existing fields like "worker_type", "task_id" (self.request.id) and "worker_name" (config.worker_name fallback); update references to the function name check_structure_health and the healthcheck task to ensure the task reflects DEGRADED/unhealthy states from the API client.

UN-3085 Structure tool made into a celery task

90c92d5

Deepak-Kesavan self-assigned this Feb 5, 2026

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UN-3085 Structure tool made into a celery task#1779

UN-3085 Structure tool made into a celery task#1779
Deepak-Kesavan wants to merge 1 commit intomainfrom
UN-3085-remove-tool-and-make-it-a-celery-task

Deepak-Kesavan commented Feb 5, 2026

Uh oh!

coderabbitai bot commented Feb 5, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 5, 2026

Uh oh!

sonarqubecloud bot commented Feb 5, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 5, 2026

Uh oh!

coderabbitai bot Feb 5, 2026

Uh oh!

coderabbitai bot Feb 5, 2026

Uh oh!

coderabbitai bot Feb 5, 2026

Uh oh!

coderabbitai bot Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -0,0 +1,5 @@
		"""Structure extraction worker module."""

		from worker import app

Conversation

Deepak-Kesavan commented Feb 5, 2026

What

Why

How

Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)

Database Migrations

Env Config

Relevant Docs

Related Issues or PRs

Dependencies Versions

Notes on Testing

Screenshots

Checklist

Uh oh!

coderabbitai bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

github-actions bot commented Feb 5, 2026

Test Results

Uh oh!

sonarqubecloud bot commented Feb 5, 2026

Quality Gate failed

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Feb 5, 2026 •

edited

Loading