Skip to content

fix: silent error path causes integration test to hang#1346

Merged
hmzakhalid merged 5 commits into
mainfrom
fix/multithread-error-path
Feb 18, 2026
Merged

fix: silent error path causes integration test to hang#1346
hmzakhalid merged 5 commits into
mainfrom
fix/multithread-error-path

Conversation

@hmzakhalid

@hmzakhalid hmzakhalid commented Feb 18, 2026

Copy link
Copy Markdown
Collaborator

if the rayon thread panics, it causes the ? operator to return early, before publishing either ComputeResponse or ComputeRequestError.

Summary by CodeRabbit

  • Bug Fixes

    • Improved handling and reporting of task failures—including propagation of worker panics and early reporting of pool errors.
    • Added error logging for unmatched compute responses to aid diagnosis.
  • Chores

    • Switched CI jobs to run on self-hosted runners.
    • Reduced several info-level logs to debug to lower log noise.

@vercel

vercel Bot commented Feb 18, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
crisp Ready Ready Preview, Comment Feb 18, 2026 1:11pm
enclave-docs Ready Ready Preview, Comment Feb 18, 2026 1:11pm

Request Review

@coderabbitai

coderabbitai Bot commented Feb 18, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

Capture of compute requests before spawn and explicit TaskPool panic/spawn error propagation added; pool spawn errors are mapped to ComputeRequestError and published on the bus with early return. Unmatched correlation_id arrivals in zk-prover now log errors. CI runner targets changed to self-hosted; various net logs reduced to debug.

Changes

Cohort / File(s) Summary
Multithread actor
crates/multithread/src/multithread.rs
Capture ComputeRequest snapshot before task spawn; replace direct await with pool_result handling; on pool.spawn error log, map error to ComputeRequestErrorKind (Zk vs TrBFV), publish ComputeRequestError to bus, and return early; tracing imports consolidated.
Task pool & panic propagation
crates/multithread/src/pool.rs
Add TaskPoolError::Panic(String) variant; wrap worker execution in catch_unwind, convert panics into Panic(...) and send through channel; send Ok(result) on normal completion; adjust receiver logic to unpack nested Result and propagate panic info.
ZK prover response handling
crates/zk-prover/src/actors/proof_request.rs
Add error-level logging when ComputeResponse arrives with a correlation_id that has no matching pending request in handle_pk_generation_response and handle_pk_bfv_response (behavior otherwise unchanged).
Net logging verbosity
crates/net/src/net_interface.rs
Lowered several info! logs to debug! for swarm, put/get record, and sync events; no control-flow changes.
CI runners
.github/workflows/ci.yml
All CI jobs switched from ubuntu-latest to self-hosted runners; job steps unchanged.

Sequence Diagram(s)

sequenceDiagram
    participant Requester
    participant MultithreadActor as Multithread
    participant TaskPool
    participant Bus

    Requester->>Multithread: Submit ComputeRequest
    Multithread->>Multithread: request_snapshot = clone(request)
    Multithread->>TaskPool: spawn(task with request_snapshot)
    alt spawn OK
        TaskPool->>TaskPool: execute task (catch_unwind)
        alt task panics
            TaskPool->>Multithread: return Err(Panic(msg))
            Multithread->>Bus: publish ComputeRequestError(kind, msg)
            Multithread-->>Requester: return Ok(())
        else task completes
            TaskPool->>Multithread: return Ok(result)
            Multithread->>Bus: publish ComputeResponse/result
            Multithread-->>Requester: return Ok(())
        end
    else spawn Err(pool_err)
        TaskPool-->>Multithread: Err(pool_err)
        Multithread->>Bus: publish ComputeRequestError(mapped_kind, pool_err)
        Multithread-->>Requester: return Ok(())
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • 0xjei
  • ctrlc03

Poem

🐰 I held the snapshot tight before the leap,
When worker panics stir from tunnels deep,
I publish woes upon the bus with care,
And log the stray IDs lost in the air,
Hoppity fixes, small but bright and sweet 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: fixing an error path that causes integration test hangs due to silent failures in the multithread error handling.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/multithread-error-path

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/multithread/src/multithread.rs`:
- Around line 208-210: The code currently maps any TrBFV pool spawn error to the
hardcoded e3_trbfv::TrBFVError::GenPkShareAndSkSss variant, which is misleading;
update the match in the ComputeRequestKind::TrBFV arm to inspect
request_snapshot.request (the actual TrBFV variant) and map pool errors to the
corresponding TrBFV error variant for each operation (e.g., Convert to
GenEsiSss, CalculateDecryptionKey, CalculateDecryptionShare,
CalculateThresholdDecryption, or GenPkShareAndSkSss) instead of always using
GenPkShareAndSkSss so diagnostics reflect the real failing TrBFV operation.

In `@crates/zk-prover/src/actors/proof_request.rs`:
- Around line 137-140: In handle_pk_generation_response (the branch that looks
up pending_threshold), update the error log message that currently says
"Received PkBfv ComputeResponse..." to the correct wording "Received
PkGeneration ComputeResponse..." so the log matches the handler and pending
request type; modify the error! call that references correlation_id in that
function to use "PkGeneration" instead of "PkBfv".

Comment thread crates/multithread/src/multithread.rs Outdated
Comment thread crates/zk-prover/src/actors/proof_request.rs

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
.github/workflows/ci.yml (1)

36-36: Verify your self-hosted runner policy for public-repo PRs.

Running CI on self-hosted runners is efficient, but for a public repository the pull_request trigger means external contributors' PRs execute on your own infrastructure. If runners are configured to auto-approve workflows from all contributors (not just maintainers), a malicious PR can exfiltrate secrets and modify the runner environment. Additionally, the contrib-readme-job runs with permissions: contents: write on the self-hosted runner — ensure GITHUB_TOKEN scoping is tightly controlled.

Recommended actions:

  • Confirm GitHub's "Fork pull request workflows" setting is set to require approval from maintainers (Settings → Actions → Fork pull request workflows from outside collaborators).
  • Consider keeping security-sensitive jobs (those with secrets.GITHUB_TOKEN write access) isolated or guarded with an if: github.event_name != 'pull_request' || github.event.pull_request.head.repo.full_name == github.repository condition.

Also applies to: 79-79, 122-122, 151-151, 188-188, 228-228, 263-263, 276-276, 341-341, 403-403, 434-434, 507-507, 615-615, 641-641, 670-670, 719-719, 763-763, 823-823, 860-860

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml at line 36, The workflow currently runs jobs on
self-hosted runners (lines showing "runs-on: self-hosted") including the job
named contrib-readme-job which has permissions: contents: write and may run for
pull_request events; update those job definitions to guard against untrusted
external PRs by adding an if condition such as verifying the event is not a
forked PR (e.g., if: github.event_name != 'pull_request' ||
github.event.pull_request.head.repo.full_name == github.repository) and tighten
permissions/GITHUB_TOKEN scope (remove write-level access where not strictly
needed or use minimal permissions) so self-hosted runners do not execute
privileged steps for external contributors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.github/workflows/ci.yml:
- Line 36: The workflow currently runs jobs on self-hosted runners (lines
showing "runs-on: self-hosted") including the job named contrib-readme-job which
has permissions: contents: write and may run for pull_request events; update
those job definitions to guard against untrusted external PRs by adding an if
condition such as verifying the event is not a forked PR (e.g., if:
github.event_name != 'pull_request' ||
github.event.pull_request.head.repo.full_name == github.repository) and tighten
permissions/GITHUB_TOKEN scope (remove write-level access where not strictly
needed or use minimal permissions) so self-hosted runners do not execute
privileged steps for external contributors.

@hmzakhalid hmzakhalid merged commit 26119c5 into main Feb 18, 2026
26 checks passed
@github-actions github-actions Bot deleted the fix/multithread-error-path branch February 26, 2026 03:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants