Guard save_compiled_model with try-catch to handle serialization failures#227
Guard save_compiled_model with try-catch to handle serialization failures#227aditya-dl wants to merge 1 commit intoROCm:wml-mainfrom
Conversation
|
@aditya-dl appreciate the patch to this. Curious where are you seeing this blow up? How large of a file? What limits are you seeing this fail on and is that anywhere we can easily reproduce? |
| save_compiled_model(prog, model_cache_file); | ||
| } catch (const std::exception& e) { | ||
| LOGS_DEFAULT(WARNING) << ">>>>>> exception caught at Compile::save_compiled_model: " << e.what(); | ||
| } catch (...) { |
There was a problem hiding this comment.
Don't do a catch all here as we want to be explicit in what exceptions we're expecting and those that we aren't. If there's an MIGraphX API fail we're hitting since that's what save_compiled_model uses, I don't think we should catch and mask that but instead report it.
TedThemistokleous
left a comment
There was a problem hiding this comment.
Few questions/comments on this one. Let me know what error and things you're seeing so we can be explicit in catching certain exceptions. We want to be able to signal API failures if something unexpected happens.
This is a good fix is good to stop a higher level application from crashing but I want to be clear on what cases we're seeing this and dig further if you're hitting and MIGraphX failure and there's a bug somewhere else right now.
Description
Wrap only the
save_compiled_modelcall (in both the initial compile and recompile paths) with a try-catch that logs a warning instead of propagating the exception. All other operations (SerializeToString,parse_onnx_buffer,compile_program) are confirmed to succeed and are left unwrapped.Motivation and Context
save_compiled_modelcan throwstd::bad_allocduring MXR cache serialization when the compiled model is large. This crashes the entire inference session even though the model compiled and runs correctly; the cache file is purely an optimization for subsequent loads. Inference should not fail because of a cache write failure.