Add error checking and status propagation to ROCm profiler#671
Add error checking and status propagation to ROCm profiler#671magaonka-amd wants to merge 1 commit intoROCm:mainfrom
Conversation
943a9fc to
e30dd6f
Compare
| #include <time.h> | ||
| #include <unistd.h> |
There was a problem hiding this comment.
nit: C system headers (<time.h>, <unistd.h>) should appear in the system-header block near the top of the file, before third-party headers, per Google C++ style include ordering. They're currently placed after the rocm/include/rocprofiler-sdk/ headers.
Also, <time.h> appears to be unused — no symbol from it (time_t, clock_gettime, timespec, etc.) is referenced anywhere in this file. Consider removing it entirely and moving <unistd.h> back to the system-header group.
| #include <time.h> | |
| #include <unistd.h> | |
| #include <unistd.h> |
There was a problem hiding this comment.
Resolved - addressed in this revision.
|
|
||
| #define RETURN_IF_ROCPROFILER_ERROR(expr) \ | ||
| do { \ | ||
| rocprofiler_status_t _status = (expr); \ | ||
| if (ABSL_PREDICT_FALSE(_status != ROCPROFILER_STATUS_SUCCESS)) { \ | ||
| const char* _errstr = rocprofiler_get_status_string(_status); \ | ||
| LOG(ERROR) << #expr << " failed: " << _errstr; \ | ||
| return -1; \ | ||
| } \ |
There was a problem hiding this comment.
nit: The macro-local identifiers _status and _errstr start with a leading underscore. While identifiers like _lowercase are only reserved at file/global scope (not inside a function block), this is still a common footgun that linters flag. Consider using non-underscore-prefixed names (e.g., rocprofiler_macro_status / rocprofiler_macro_errstr) to avoid any ambiguity.
| #define RETURN_IF_ROCPROFILER_ERROR(expr) \ | |
| do { \ | |
| rocprofiler_status_t _status = (expr); \ | |
| if (ABSL_PREDICT_FALSE(_status != ROCPROFILER_STATUS_SUCCESS)) { \ | |
| const char* _errstr = rocprofiler_get_status_string(_status); \ | |
| LOG(ERROR) << #expr << " failed: " << _errstr; \ | |
| return -1; \ | |
| } \ | |
| #define RETURN_IF_ROCPROFILER_ERROR(expr) \ | |
| do { \ | |
| rocprofiler_status_t rocprofiler_macro_status = (expr); \ | |
| if (ABSL_PREDICT_FALSE(rocprofiler_macro_status != \ | |
| ROCPROFILER_STATUS_SUCCESS)) { \ | |
| const char* rocprofiler_macro_errstr = \ | |
| rocprofiler_get_status_string(rocprofiler_macro_status); \ | |
| LOG(ERROR) << #expr << " failed: " << rocprofiler_macro_errstr;\ | |
| return -1; \ | |
| } \ | |
| } while (false) |
There was a problem hiding this comment.
Resolved - addressed in this revision.
|
|
||
| rocprofiler_status_t rc = rocprofiler_start_context(context_); | ||
| if (rc != ROCPROFILER_STATUS_SUCCESS) { | ||
| const char* errstr = rocprofiler_get_status_string(rc); | ||
| options_ = {}; | ||
| collector_ = nullptr; | ||
| return absl::InternalError( | ||
| absl::StrCat("rocprofiler_start_context failed: ", errstr)); | ||
| } | ||
| api_tracing_enabled_ = true; |
There was a problem hiding this comment.
suggestion: The rollback on rocprofiler_start_context failure resets options_ and collector_ but does not undo the annotation_map_.Clear() from line 147. This means a failed Enable() leaves the annotation map cleared — if there was useful state from a prior session it's now lost.
Consider moving annotation_map_.Clear() to after the success check, so state is only modified once the context is confirmed started:
rocprofiler_status_t rc = rocprofiler_start_context(context_);
if (rc != ROCPROFILER_STATUS_SUCCESS) {
const char* errstr = rocprofiler_get_status_string(rc);
options_ = {};
collector_ = nullptr;
return absl::InternalError(
absl::StrCat("rocprofiler_start_context failed: ", errstr));
}
annotation_map_.Clear();There was a problem hiding this comment.
Resolved - addressed in this revision.
| RETURN_IF_ROCPROFILER_ERROR( | ||
| rocprofiler_assign_callback_thread(buffer_, client_thread)); |
There was a problem hiding this comment.
nit: Every other rocprofiler API call in toolInit() is now wrapped with RETURN_IF_ROCPROFILER_ERROR, but rocprofiler_context_is_valid() is left unwrapped. While the isValid == 0 fallthrough produces the same -1 return, wrapping it would be more consistent and would log the actual rocprofiler error string if the call itself fails.
| RETURN_IF_ROCPROFILER_ERROR( | |
| rocprofiler_assign_callback_thread(buffer_, client_thread)); | |
| RETURN_IF_ROCPROFILER_ERROR( | |
| rocprofiler_context_is_valid(context_, &isValid)); |
Review SummaryGood improvement to error handling in the ROCm profiler. Wrapping rocprofiler-sdk calls with A few items flagged inline:
See inline comments for details and suggested fixes. |
| #include <time.h> | ||
| #include <unistd.h> |
There was a problem hiding this comment.
Include order violation: <time.h> and <unistd.h> are placed after the rocprofiler-sdk third-party headers. Per XLA conventions (and .clang-format enforcement), C/C++ system headers must appear in the second group, immediately after the corresponding header — not interspersed with third-party includes.
Additionally, <time.h> appears to have been newly added — if it is actually needed, it should be <ctime> (C++ style). If it's not needed, it should be removed. <unistd.h> was previously included higher up and was moved here; it should be placed back in the system-headers group.
| #include <time.h> | |
| #include <unistd.h> | |
| #include <ctime> | |
| #include <unistd.h> |
(These lines should be moved up to sit with the other C/C++ system headers around line 24, before the third-party absl/ includes.)
There was a problem hiding this comment.
Resolved - addressed in this revision.
| #define RETURN_IF_ROCPROFILER_ERROR(expr) \ | ||
| do { \ | ||
| rocprofiler_status_t _status = (expr); \ | ||
| if (ABSL_PREDICT_FALSE(_status != ROCPROFILER_STATUS_SUCCESS)) { \ | ||
| const char* _errstr = rocprofiler_get_status_string(_status); \ | ||
| LOG(ERROR) << #expr << " failed: " << _errstr; \ | ||
| return -1; \ | ||
| } \ | ||
| } while (false) |
There was a problem hiding this comment.
Nit: macro variable names use leading-underscore identifiers. While _status and _errstr are fine inside a do-while block scope, they could shadow outer variables and the convention is slightly unusual for XLA code. Consider a more distinctive prefix:
| #define RETURN_IF_ROCPROFILER_ERROR(expr) \ | |
| do { \ | |
| rocprofiler_status_t _status = (expr); \ | |
| if (ABSL_PREDICT_FALSE(_status != ROCPROFILER_STATUS_SUCCESS)) { \ | |
| const char* _errstr = rocprofiler_get_status_string(_status); \ | |
| LOG(ERROR) << #expr << " failed: " << _errstr; \ | |
| return -1; \ | |
| } \ | |
| } while (false) | |
| #define RETURN_IF_ROCPROFILER_ERROR(expr) \ | |
| do { \ | |
| rocprofiler_status_t rocprof_status_ = (expr); \ | |
| if (ABSL_PREDICT_FALSE(rocprof_status_ != ROCPROFILER_STATUS_SUCCESS)) { \ | |
| const char* rocprof_errstr_ = rocprofiler_get_status_string(rocprof_status_); \ | |
| LOG(ERROR) << #expr << " failed: " << rocprof_errstr_; \ | |
| return -1; \ | |
| } \ | |
| } while (false) |
There was a problem hiding this comment.
Resolved - addressed in this revision.
| rocprofiler_assign_callback_thread(buffer_, client_thread)); | ||
|
|
||
| int isValid = 0; | ||
| rocprofiler_context_is_valid(context_, &isValid); |
There was a problem hiding this comment.
Unchecked rocprofiler_context_is_valid return value. After wrapping all other rocprofiler-sdk calls with RETURN_IF_ROCPROFILER_ERROR, this call is left unchecked. If it fails (returns non-success), the error is silently ignored and only the isValid output is inspected. For consistency with the rest of the PR's approach, this should also be wrapped:
| rocprofiler_context_is_valid(context_, &isValid); | |
| RETURN_IF_ROCPROFILER_ERROR(rocprofiler_context_is_valid(context_, &isValid)); |
| } | ||
|
|
||
| // Utility context to gather code‑object info | ||
| rocprofiler_create_context(&utility_context_); | ||
| RETURN_IF_ROCPROFILER_ERROR(rocprofiler_create_context(&utility_context_)); |
There was a problem hiding this comment.
Partial initialization leak on mid-function failure. If a RETURN_IF_ROCPROFILER_ERROR triggers partway through toolInit(), previously created resources (contexts, buffers, threads) are not cleaned up. For example, if rocprofiler_create_buffer succeeds at line 457 but a subsequent rocprofiler_configure_buffer_tracing_service fails, both utility_context_ and context_ have been created but neither is destroyed before returning -1.
Consider adding a cleanup path (e.g., a helper that destroys created resources on failure) or using an RAII-style guard so the singleton isn't left in a half-initialized state.
| } | ||
| options_ = options; | ||
| collector_ = collector; | ||
| annotation_map_.Clear(); |
There was a problem hiding this comment.
Minor: annotation_map_.Clear() not rolled back on failure. If rocprofiler_start_context fails at line 151, options_ and collector_ are correctly reset (lines 154-155), but the annotation map has already been cleared. For symmetry in the cleanup path, consider moving annotation_map_.Clear() after the success check, or adding it to the error cleanup block.
There was a problem hiding this comment.
Resolved - addressed in this revision.
| "@com_google_absl//absl/strings", | ||
| "@com_google_absl//absl/strings:str_format", | ||
| "@com_google_absl//absl/synchronization", | ||
| "@com_google_absl//absl/types:optional", |
There was a problem hiding this comment.
Pre-existing: absl/types:optional is a prohibited XLA API. Since the code at rocm_tracer.h:84 already uses std::optional, both this BUILD dep and the #include "absl/types/optional.h" in rocm_tracer.h:23 appear to be dead references. This PR could opportunistically clean them up.
| "@com_google_absl//absl/types:optional", |
Review SummaryGood improvement — adding error checking to the ROCm profiler's Key findings (details in inline comments):
🤖 Generated with Claude Code |
e30dd6f to
ad776c6
Compare
| rocprofiler_assign_callback_thread(buffer_, client_thread)); | ||
|
|
||
| int isValid = 0; | ||
| rocprofiler_context_is_valid(context_, &isValid); |
There was a problem hiding this comment.
rocprofiler_context_is_valid is the only rocprofiler-sdk API call in toolInit() that is not wrapped with RETURN_IF_ROCPROFILER_ERROR. Its return status is silently discarded. Should this be wrapped consistently with every other call in this function?
| rocprofiler_context_is_valid(context_, &isValid); | |
| RETURN_IF_ROCPROFILER_ERROR(rocprofiler_context_is_valid(context_, &isValid)); |
| @@ -431,29 +452,30 @@ int RocmTracer::toolInit(rocprofiler_client_finalize_t fini_func, | |||
| constexpr auto buffer_watermark_bytes = 40 * 4096; | |||
|
|
|||
| // Utility context to gather code‑object info | |||
| rocprofiler_create_context(&context_); | |||
| RETURN_IF_ROCPROFILER_ERROR(rocprofiler_create_context(&context_)); | |||
There was a problem hiding this comment.
If any RETURN_IF_ROCPROFILER_ERROR call fails partway through toolInit(), previously created and started contexts are not cleaned up. For example, if rocprofiler_create_context(&context_) at the bottom of this block fails, utility_context_ (created and started above) is leaked.
Consider adding cleanup logic (e.g., an absl::Cleanup or explicit rollback) so that on early return, already-created resources are torn down.
There was a problem hiding this comment.
This I feel too much engineering for the return we get.
|
|
||
| #include "absl/container/flat_hash_map.h" | ||
| #include "absl/container/node_hash_set.h" | ||
| #include "absl/status/status.h" |
There was a problem hiding this comment.
Good addition. Note that the existing #include "absl/types/optional.h" on line 23 is a prohibited API in XLA (the project convention requires std::optional). Since std::optional is already used on line 84, the absl/types/optional.h include and its BUILD dep (@com_google_absl//absl/types:optional) appear to be dead code. Consider removing them in this PR as a cleanup.
There was a problem hiding this comment.
unrelated to my change if the fix needs to go it should be seperate PR
Re-review Summary3 of the original findings have been addressed in the latest revision:
4 prior findings remain open (no change in latest revision):
Overall the error-checking and status propagation changes are solid. The remaining items are minor hardening opportunities. 🤖 Re-review by Claude |
ad776c6 to
5866850
Compare
Re-review SummaryRe-reviewed the latest revision. Previous issues around include ordering, leading-underscore macro variables, and The remaining open items from prior review are unchanged (no new code addressing them, no developer replies):
No new issues found in this revision. The error-checking and |
- Wrap all rocprofiler-sdk API calls in toolInit() with RETURN_IF_ROCPROFILER_ERROR macro that logs and returns -1 on failure - Change RocmTracer::Enable() to return absl::Status instead of void, propagating rocprofiler_start_context errors to callers - Use TF_RETURN_IF_ERROR in GpuTracer::DoStart() to surface Enable failures - Add required absl/status, absl/strings, and tsl/platform:errors deps
5866850 to
1e2b3ac
Compare
Submission Checklist