Skip to content

Embed original traceback in Hydra's exception report.#2863

Open
gleize wants to merge 4 commits into
facebookresearch:mainfrom
gleize:main
Open

Embed original traceback in Hydra's exception report.#2863
gleize wants to merge 4 commits into
facebookresearch:mainfrom
gleize:main

Conversation

@gleize

@gleize gleize commented Mar 1, 2024

Copy link
Copy Markdown
Contributor

Motivation

When Hydra runs jobs (multirun mode), and the job crashes, only the exception gets reported and raised (but not the original traceback). This is missing for proper / faster debugging of the original error.
This PR solves that missing feature.

Have you read the Contributing Guidelines on pull requests?

No

Test Plan

Tested with the submitit plugin, which pickles the exception information properly.

Related Issues and PRs

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 1, 2024
@meta-cla

meta-cla Bot commented Sep 13, 2025

Copy link
Copy Markdown

Hi @gleize!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@kirillbobyrev

Copy link
Copy Markdown

Hi! Is it possible to merge this so that SAM 3D Objects setup is slightly less convoluted?

@omry omry left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. This PR is addressing the problem tracked in #2664, but it needs more work before it can be merged.

The core issue is real: failed remote/multirun jobs can lose the useful original traceback after the JobReturn crosses a serialization boundary. However, this implementation is not complete enough as-is:

  • Please link this PR explicitly to #2664.
  • Preserve the existing JobReturn.return_value behavior as much as possible. Replacing the stored exception with a TracebackException and then raising JobException changes the exception type and can break callers that expect the original exception object/type.
  • The traceback representation needs to survive the serialization boundaries used by launchers. In particular, a stdlib pickle round-trip of JobReturn containing traceback.TracebackException(*sys.exc_info()) currently fails with TypeError: cannot pickle code objects.
  • Add tests that reproduce the lost traceback after serialization and verify the fixed behavior. The tests should cover the relevant launcher boundary, not only the local in-process path.
  • Add a news fragment for the user-visible behavior change.

If this does not get completed before the upcoming 1.4 release, I will handle this fix separately.

@omry omry added the awaiting_response Awaiting response label Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting_response Awaiting response CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants