Embed original traceback in Hydra's exception report.#2863
Conversation
Embed original traceback in Hydra's exception report.
|
Hi @gleize! Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention. You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
|
Hi! Is it possible to merge this so that SAM 3D Objects setup is slightly less convoluted? |
omry
left a comment
There was a problem hiding this comment.
Thanks for working on this. This PR is addressing the problem tracked in #2664, but it needs more work before it can be merged.
The core issue is real: failed remote/multirun jobs can lose the useful original traceback after the JobReturn crosses a serialization boundary. However, this implementation is not complete enough as-is:
- Please link this PR explicitly to #2664.
- Preserve the existing
JobReturn.return_valuebehavior as much as possible. Replacing the stored exception with aTracebackExceptionand then raisingJobExceptionchanges the exception type and can break callers that expect the original exception object/type. - The traceback representation needs to survive the serialization boundaries used by launchers. In particular, a stdlib
pickleround-trip ofJobReturncontainingtraceback.TracebackException(*sys.exc_info())currently fails withTypeError: cannot pickle code objects. - Add tests that reproduce the lost traceback after serialization and verify the fixed behavior. The tests should cover the relevant launcher boundary, not only the local in-process path.
- Add a news fragment for the user-visible behavior change.
If this does not get completed before the upcoming 1.4 release, I will handle this fix separately.
Motivation
When Hydra runs jobs (multirun mode), and the job crashes, only the exception gets reported and raised (but not the original traceback). This is missing for proper / faster debugging of the original error.
This PR solves that missing feature.
Have you read the Contributing Guidelines on pull requests?
No
Test Plan
Tested with the submitit plugin, which pickles the exception information properly.
Related Issues and PRs