Skip to content

Fix AnnotationMap memory leak in ROCm profiler across profiling sessions#656

Open
magaonka-amd wants to merge 1 commit intoROCm:rocm-jaxlib-v0.9.0from
magaonka-amd:fix/annotation-map-leak
Open

Fix AnnotationMap memory leak in ROCm profiler across profiling sessions#656
magaonka-amd wants to merge 1 commit intoROCm:rocm-jaxlib-v0.9.0from
magaonka-amd:fix/annotation-map-leak

Conversation

@magaonka-amd
Copy link

@magaonka-amd magaonka-amd commented Mar 4, 2026

The AnnotationMap in RocmTracer is a singleton member that accumulates annotation strings (correlation_id -> XLA op name mappings) during each profiling session. These entries were never cleared between sessions.

The existing line in Enable():
AnnotationMap(options_->max_annotation_strings);
was a no-op -- it constructed a temporary AnnotationMap on the stack that was immediately destroyed, never touching the singleton member.

This fix:

  • Removes the no-op temporary construction in Enable()
  • Adds AnnotationMap::Clear() to release all stored annotations
  • Calls Clear() in Enable()

Submission Checklist

@magaonka-amd magaonka-amd force-pushed the fix/annotation-map-leak branch from ae28715 to f1ffa69 Compare March 4, 2026 16:08
explicit AnnotationMap(uint64_t max_size) : max_size_(max_size) {}
void Add(uint32_t correlation_id, const std::string& annotation);
absl::string_view LookUp(uint32_t correlation_id);
void Clear();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Consider adding ABSL_LOCKS_EXCLUDED(map_.mutex) annotation to Clear() for compile-time lock-ordering checks. The existing Add() and LookUp() methods are also missing these annotations (pre-existing issue), but since this is new code, it's a good opportunity to start.

Suggested change
void Clear();
void Clear() ABSL_LOCKS_EXCLUDED(map_.mutex);

@magaonka-amd magaonka-amd force-pushed the fix/annotation-map-leak branch 2 times, most recently from e58a3ae to 43a778d Compare March 4, 2026 16:22
The AnnotationMap in RocmTracer is a singleton member that accumulates
annotation strings (correlation_id -> XLA op name mappings) during each
profiling session. These entries were never cleared between sessions,
causing memory to grow linearly with the number of start_trace/stop_trace
cycles.

The existing line in Enable():
  AnnotationMap(options_->max_annotation_strings);
was a no-op -- it constructed a temporary AnnotationMap on the stack that
was immediately destroyed, never touching the singleton member.

This fix:
- Adds AnnotationMap::Clear() to release all stored annotations
- Replaces the no-op temporary construction in Enable() with
  annotation_map_.Clear(), which safely clears stale data from the
  previous session before new data is collected
- Clear is placed in Enable() rather than Disable() because
  RocmTracerEvent::annotation fields are string_views into the
  AnnotationMap storage, and Export() reads them after Disable()
  returns

Measured impact: reduces profiler memory overhead from ~7MB to ~2.4MB
per profiling session in repeated-cycle tests.
@magaonka-amd magaonka-amd force-pushed the fix/annotation-map-leak branch from 43a778d to c82815c Compare March 4, 2026 16:25
@claude
Copy link

claude bot commented Mar 4, 2026

Re-review Summary: The AnnotationMap memory leak fix looks correct. Clear() is correctly placed in Enable() (not Disable()), ensuring annotations remain valid for deferred consumers. Thread safety is sound - Clear() properly acquires map_.mutex, and nested lock ordering with collector_mutex_ is consistent. New AnnotationMapClear test adequately covers the functionality. No new issues found; previous review comments updated accordingly.

@magaonka-amd magaonka-amd marked this pull request as ready for review March 4, 2026 17:58
@magaonka-amd
Copy link
Author

magaonka-amd commented Mar 4, 2026

Hi @cj401-amd since you have done significant contribution to profiler code in the past. I would love to hear feedback from you for this fix.

@cj401-amd
Copy link

Hi @cj401-amd since you have done significant contribution to profiler code in the past. I would love to hear feedback from you for this fix.

Hey @magaonka-amd, LGTM and thanks for the work.

@i-chaochen i-chaochen added open-upstream Tag when you want a copy of this PR to be opened on upstream rocm-jaxlib-v0.9.0 labels Mar 5, 2026
@i-chaochen
Copy link
Collaborator

Hi @magaonka-amd thanks for this clean up, could you just upstream it first? We can cherry-pick it back once it's merged to upstream.

Thanks!

@magaonka-amd
Copy link
Author

okay sure @i-chaochen , moving forward . pushing to upstream XLA and then cherry-pick it back to downstream is standard procedure for submitting changes XLA?

just making sure so that I follow right procedure in coming PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

open-upstream Tag when you want a copy of this PR to be opened on upstream rocm-jaxlib-v0.9.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants