Skip to content

Add error checking and status propagation to ROCm profiler#671

Draft
magaonka-amd wants to merge 1 commit intoROCm:mainfrom
magaonka-amd:fix/rocm-profiler-error-status
Draft

Add error checking and status propagation to ROCm profiler#671
magaonka-amd wants to merge 1 commit intoROCm:mainfrom
magaonka-amd:fix/rocm-profiler-error-status

Conversation

@magaonka-amd
Copy link

  • Wrap all rocprofiler-sdk API calls in toolInit() with RETURN_IF_ROCPROFILER_ERROR macro that logs and returns -1 on failure
  • Change RocmTracer::Enable() to return absl::Status instead of void, propagating rocprofiler_start_context errors to callers
  • Use TF_RETURN_IF_ERROR in GpuTracer::DoStart() to surface Enable failures
  • Add required absl/status, absl/strings, and tsl/platform:errors deps
  • This adds no measurable overhead to profiler.

Submission Checklist

@magaonka-amd magaonka-amd force-pushed the fix/rocm-profiler-error-status branch from 943a9fc to e30dd6f Compare March 17, 2026 05:14
Comment on lines +47 to +48
#include <time.h>
#include <unistd.h>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: C system headers (<time.h>, <unistd.h>) should appear in the system-header block near the top of the file, before third-party headers, per Google C++ style include ordering. They're currently placed after the rocm/include/rocprofiler-sdk/ headers.

Also, <time.h> appears to be unused — no symbol from it (time_t, clock_gettime, timespec, etc.) is referenced anywhere in this file. Consider removing it entirely and moving <unistd.h> back to the system-header group.

Suggested change
#include <time.h>
#include <unistd.h>
#include <unistd.h>

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved - addressed in this revision.

Comment on lines +53 to +61

#define RETURN_IF_ROCPROFILER_ERROR(expr) \
do { \
rocprofiler_status_t _status = (expr); \
if (ABSL_PREDICT_FALSE(_status != ROCPROFILER_STATUS_SUCCESS)) { \
const char* _errstr = rocprofiler_get_status_string(_status); \
LOG(ERROR) << #expr << " failed: " << _errstr; \
return -1; \
} \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The macro-local identifiers _status and _errstr start with a leading underscore. While identifiers like _lowercase are only reserved at file/global scope (not inside a function block), this is still a common footgun that linters flag. Consider using non-underscore-prefixed names (e.g., rocprofiler_macro_status / rocprofiler_macro_errstr) to avoid any ambiguity.

Suggested change
#define RETURN_IF_ROCPROFILER_ERROR(expr) \
do { \
rocprofiler_status_t _status = (expr); \
if (ABSL_PREDICT_FALSE(_status != ROCPROFILER_STATUS_SUCCESS)) { \
const char* _errstr = rocprofiler_get_status_string(_status); \
LOG(ERROR) << #expr << " failed: " << _errstr; \
return -1; \
} \
#define RETURN_IF_ROCPROFILER_ERROR(expr) \
do { \
rocprofiler_status_t rocprofiler_macro_status = (expr); \
if (ABSL_PREDICT_FALSE(rocprofiler_macro_status != \
ROCPROFILER_STATUS_SUCCESS)) { \
const char* rocprofiler_macro_errstr = \
rocprofiler_get_status_string(rocprofiler_macro_status); \
LOG(ERROR) << #expr << " failed: " << rocprofiler_macro_errstr;\
return -1; \
} \
} while (false)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved - addressed in this revision.

Comment on lines +150 to 159

rocprofiler_status_t rc = rocprofiler_start_context(context_);
if (rc != ROCPROFILER_STATUS_SUCCESS) {
const char* errstr = rocprofiler_get_status_string(rc);
options_ = {};
collector_ = nullptr;
return absl::InternalError(
absl::StrCat("rocprofiler_start_context failed: ", errstr));
}
api_tracing_enabled_ = true;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: The rollback on rocprofiler_start_context failure resets options_ and collector_ but does not undo the annotation_map_.Clear() from line 147. This means a failed Enable() leaves the annotation map cleared — if there was useful state from a prior session it's now lost.

Consider moving annotation_map_.Clear() to after the success check, so state is only modified once the context is confirmed started:

  rocprofiler_status_t rc = rocprofiler_start_context(context_);
  if (rc != ROCPROFILER_STATUS_SUCCESS) {
    const char* errstr = rocprofiler_get_status_string(rc);
    options_ = {};
    collector_ = nullptr;
    return absl::InternalError(
        absl::StrCat("rocprofiler_start_context failed: ", errstr));
  }
  annotation_map_.Clear();

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved - addressed in this revision.

Comment on lines +498 to +499
RETURN_IF_ROCPROFILER_ERROR(
rocprofiler_assign_callback_thread(buffer_, client_thread));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Every other rocprofiler API call in toolInit() is now wrapped with RETURN_IF_ROCPROFILER_ERROR, but rocprofiler_context_is_valid() is left unwrapped. While the isValid == 0 fallthrough produces the same -1 return, wrapping it would be more consistent and would log the actual rocprofiler error string if the call itself fails.

Suggested change
RETURN_IF_ROCPROFILER_ERROR(
rocprofiler_assign_callback_thread(buffer_, client_thread));
RETURN_IF_ROCPROFILER_ERROR(
rocprofiler_context_is_valid(context_, &isValid));

@claude
Copy link

claude bot commented Mar 17, 2026

Review Summary

Good improvement to error handling in the ROCm profiler. Wrapping rocprofiler-sdk calls with RETURN_IF_ROCPROFILER_ERROR in toolInit() and changing Enable() to return absl::Status are both solid changes that will make failures much easier to diagnose.

A few items flagged inline:

  • <time.h> appears unused; both C system headers need to move up per include ordering
  • Macro-local identifiers use leading underscores (minor style nit)
  • annotation_map_.Clear() runs before the rocprofiler_start_context check — consider reordering for cleaner rollback
  • rocprofiler_context_is_valid() is the only unwrapped API call in toolInit()

See inline comments for details and suggested fixes.

Comment on lines +47 to +48
#include <time.h>
#include <unistd.h>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include order violation: <time.h> and <unistd.h> are placed after the rocprofiler-sdk third-party headers. Per XLA conventions (and .clang-format enforcement), C/C++ system headers must appear in the second group, immediately after the corresponding header — not interspersed with third-party includes.

Additionally, <time.h> appears to have been newly added — if it is actually needed, it should be <ctime> (C++ style). If it's not needed, it should be removed. <unistd.h> was previously included higher up and was moved here; it should be placed back in the system-headers group.

Suggested change
#include <time.h>
#include <unistd.h>
#include <ctime>
#include <unistd.h>

(These lines should be moved up to sit with the other C/C++ system headers around line 24, before the third-party absl/ includes.)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved - addressed in this revision.

Comment on lines +54 to +62
#define RETURN_IF_ROCPROFILER_ERROR(expr) \
do { \
rocprofiler_status_t _status = (expr); \
if (ABSL_PREDICT_FALSE(_status != ROCPROFILER_STATUS_SUCCESS)) { \
const char* _errstr = rocprofiler_get_status_string(_status); \
LOG(ERROR) << #expr << " failed: " << _errstr; \
return -1; \
} \
} while (false)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: macro variable names use leading-underscore identifiers. While _status and _errstr are fine inside a do-while block scope, they could shadow outer variables and the convention is slightly unusual for XLA code. Consider a more distinctive prefix:

Suggested change
#define RETURN_IF_ROCPROFILER_ERROR(expr) \
do { \
rocprofiler_status_t _status = (expr); \
if (ABSL_PREDICT_FALSE(_status != ROCPROFILER_STATUS_SUCCESS)) { \
const char* _errstr = rocprofiler_get_status_string(_status); \
LOG(ERROR) << #expr << " failed: " << _errstr; \
return -1; \
} \
} while (false)
#define RETURN_IF_ROCPROFILER_ERROR(expr) \
do { \
rocprofiler_status_t rocprof_status_ = (expr); \
if (ABSL_PREDICT_FALSE(rocprof_status_ != ROCPROFILER_STATUS_SUCCESS)) { \
const char* rocprof_errstr_ = rocprofiler_get_status_string(rocprof_status_); \
LOG(ERROR) << #expr << " failed: " << rocprof_errstr_; \
return -1; \
} \
} while (false)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved - addressed in this revision.

rocprofiler_assign_callback_thread(buffer_, client_thread));

int isValid = 0;
rocprofiler_context_is_valid(context_, &isValid);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unchecked rocprofiler_context_is_valid return value. After wrapping all other rocprofiler-sdk calls with RETURN_IF_ROCPROFILER_ERROR, this call is left unchecked. If it fails (returns non-success), the error is silently ignored and only the isValid output is inspected. For consistency with the rest of the PR's approach, this should also be wrapped:

Suggested change
rocprofiler_context_is_valid(context_, &isValid);
RETURN_IF_ROCPROFILER_ERROR(rocprofiler_context_is_valid(context_, &isValid));

Comment on lines 432 to +435
}

// Utility context to gather code‑object info
rocprofiler_create_context(&utility_context_);
RETURN_IF_ROCPROFILER_ERROR(rocprofiler_create_context(&utility_context_));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial initialization leak on mid-function failure. If a RETURN_IF_ROCPROFILER_ERROR triggers partway through toolInit(), previously created resources (contexts, buffers, threads) are not cleaned up. For example, if rocprofiler_create_buffer succeeds at line 457 but a subsequent rocprofiler_configure_buffer_tracing_service fails, both utility_context_ and context_ have been created but neither is destroyed before returning -1.

Consider adding a cleanup path (e.g., a helper that destroys created resources on failure) or using an RAII-style guard so the singleton isn't left in a half-initialized state.

}
options_ = options;
collector_ = collector;
annotation_map_.Clear();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: annotation_map_.Clear() not rolled back on failure. If rocprofiler_start_context fails at line 151, options_ and collector_ are correctly reset (lines 154-155), but the annotation map has already been cleared. For symmetry in the cleanup path, consider moving annotation_map_.Clear() after the success check, or adding it to the error cleanup block.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved - addressed in this revision.

"@com_google_absl//absl/strings",
"@com_google_absl//absl/strings:str_format",
"@com_google_absl//absl/synchronization",
"@com_google_absl//absl/types:optional",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing: absl/types:optional is a prohibited XLA API. Since the code at rocm_tracer.h:84 already uses std::optional, both this BUILD dep and the #include "absl/types/optional.h" in rocm_tracer.h:23 appear to be dead references. This PR could opportunistically clean them up.

Suggested change
"@com_google_absl//absl/types:optional",

@claude
Copy link

claude bot commented Mar 17, 2026

Review Summary

Good improvement — adding error checking to the ROCm profiler's toolInit() path and propagating absl::Status from Enable() are meaningful reliability wins.

Key findings (details in inline comments):

  • Include order: <time.h> / <unistd.h> placed after third-party headers; should be in the system-headers group (and <time.h><ctime>)
  • rocprofiler_context_is_valid is the one rocprofiler call not wrapped with the new error-checking macro
  • Partial initialization leak: if toolInit() fails mid-way, previously created contexts/buffers are not cleaned up
  • Minor: annotation_map_.Clear() not rolled back on Enable() failure
  • Pre-existing: absl/types/optional.h include and BUILD dep are prohibited XLA APIs and appear unused

🤖 Generated with Claude Code

@magaonka-amd magaonka-amd marked this pull request as draft March 17, 2026 05:24
@magaonka-amd magaonka-amd force-pushed the fix/rocm-profiler-error-status branch from e30dd6f to ad776c6 Compare March 17, 2026 05:40
rocprofiler_assign_callback_thread(buffer_, client_thread));

int isValid = 0;
rocprofiler_context_is_valid(context_, &isValid);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rocprofiler_context_is_valid is the only rocprofiler-sdk API call in toolInit() that is not wrapped with RETURN_IF_ROCPROFILER_ERROR. Its return status is silently discarded. Should this be wrapped consistently with every other call in this function?

Suggested change
rocprofiler_context_is_valid(context_, &isValid);
RETURN_IF_ROCPROFILER_ERROR(rocprofiler_context_is_valid(context_, &isValid));

Comment on lines 435 to +455
@@ -431,29 +452,30 @@ int RocmTracer::toolInit(rocprofiler_client_finalize_t fini_func,
constexpr auto buffer_watermark_bytes = 40 * 4096;

// Utility context to gather code‑object info
rocprofiler_create_context(&context_);
RETURN_IF_ROCPROFILER_ERROR(rocprofiler_create_context(&context_));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any RETURN_IF_ROCPROFILER_ERROR call fails partway through toolInit(), previously created and started contexts are not cleaned up. For example, if rocprofiler_create_context(&context_) at the bottom of this block fails, utility_context_ (created and started above) is leaked.

Consider adding cleanup logic (e.g., an absl::Cleanup or explicit rollback) so that on early return, already-created resources are torn down.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I feel too much engineering for the return we get.


#include "absl/container/flat_hash_map.h"
#include "absl/container/node_hash_set.h"
#include "absl/status/status.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good addition. Note that the existing #include "absl/types/optional.h" on line 23 is a prohibited API in XLA (the project convention requires std::optional). Since std::optional is already used on line 84, the absl/types/optional.h include and its BUILD dep (@com_google_absl//absl/types:optional) appear to be dead code. Consider removing them in this PR as a cleanup.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated to my change if the fix needs to go it should be seperate PR

@claude
Copy link

claude bot commented Mar 17, 2026

Re-review Summary

3 of the original findings have been addressed in the latest revision:

  • Include ordering (<time.h>/<unistd.h>) — fixed (headers removed)
  • Leading-underscore macro variables (_status/_errstr) — renamed to rocprofiler_macro_status/rocprofiler_macro_errstr
  • annotation_map_.Clear() rollback — fixed by reordering so rocprofiler_start_context is called before Clear()

4 prior findings remain open (no change in latest revision):

  • rocprofiler_context_is_valid() not wrapped with RETURN_IF_ROCPROFILER_ERROR in toolInit()
  • Partial initialization leak in toolInit() (no cleanup on mid-function failure)
  • Unused absl/types/optional.h include in rocm_tracer.h
  • Unused absl/types:optional BUILD dep

Overall the error-checking and status propagation changes are solid. The remaining items are minor hardening opportunities.

🤖 Re-review by Claude

@magaonka-amd magaonka-amd force-pushed the fix/rocm-profiler-error-status branch from ad776c6 to 5866850 Compare March 18, 2026 16:43
@claude
Copy link

claude bot commented Mar 18, 2026

Re-review Summary

Re-reviewed the latest revision. Previous issues around include ordering, leading-underscore macro variables, and annotation_map_.Clear() rollback have been fixed — nice cleanup.

The remaining open items from prior review are unchanged (no new code addressing them, no developer replies):

  • rocprofiler_context_is_valid() return value unchecked in toolInit()
  • Partial resource leak if toolInit() fails midway
  • absl/types/optional.h include and BUILD dep (prohibited API — std::optional is already used directly)

No new issues found in this revision. The error-checking and absl::Status propagation logic looks correct.

- Wrap all rocprofiler-sdk API calls in toolInit() with
  RETURN_IF_ROCPROFILER_ERROR macro that logs and returns -1 on failure
- Change RocmTracer::Enable() to return absl::Status instead of void,
  propagating rocprofiler_start_context errors to callers
- Use TF_RETURN_IF_ERROR in GpuTracer::DoStart() to surface Enable failures
- Add required absl/status, absl/strings, and tsl/platform:errors deps
@magaonka-amd magaonka-amd force-pushed the fix/rocm-profiler-error-status branch from 5866850 to 1e2b3ac Compare March 18, 2026 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant