Fix AnnotationMap memory leak in ROCm profiler across profiling sessions#656
Fix AnnotationMap memory leak in ROCm profiler across profiling sessions#656magaonka-amd wants to merge 1 commit intoROCm:rocm-jaxlib-v0.9.0from
Conversation
ae28715 to
f1ffa69
Compare
| explicit AnnotationMap(uint64_t max_size) : max_size_(max_size) {} | ||
| void Add(uint32_t correlation_id, const std::string& annotation); | ||
| absl::string_view LookUp(uint32_t correlation_id); | ||
| void Clear(); |
There was a problem hiding this comment.
Nit: Consider adding ABSL_LOCKS_EXCLUDED(map_.mutex) annotation to Clear() for compile-time lock-ordering checks. The existing Add() and LookUp() methods are also missing these annotations (pre-existing issue), but since this is new code, it's a good opportunity to start.
| void Clear(); | |
| void Clear() ABSL_LOCKS_EXCLUDED(map_.mutex); |
e58a3ae to
43a778d
Compare
The AnnotationMap in RocmTracer is a singleton member that accumulates annotation strings (correlation_id -> XLA op name mappings) during each profiling session. These entries were never cleared between sessions, causing memory to grow linearly with the number of start_trace/stop_trace cycles. The existing line in Enable(): AnnotationMap(options_->max_annotation_strings); was a no-op -- it constructed a temporary AnnotationMap on the stack that was immediately destroyed, never touching the singleton member. This fix: - Adds AnnotationMap::Clear() to release all stored annotations - Replaces the no-op temporary construction in Enable() with annotation_map_.Clear(), which safely clears stale data from the previous session before new data is collected - Clear is placed in Enable() rather than Disable() because RocmTracerEvent::annotation fields are string_views into the AnnotationMap storage, and Export() reads them after Disable() returns Measured impact: reduces profiler memory overhead from ~7MB to ~2.4MB per profiling session in repeated-cycle tests.
43a778d to
c82815c
Compare
|
Re-review Summary: The AnnotationMap memory leak fix looks correct. Clear() is correctly placed in Enable() (not Disable()), ensuring annotations remain valid for deferred consumers. Thread safety is sound - Clear() properly acquires map_.mutex, and nested lock ordering with collector_mutex_ is consistent. New AnnotationMapClear test adequately covers the functionality. No new issues found; previous review comments updated accordingly. |
|
Hi @cj401-amd since you have done significant contribution to profiler code in the past. I would love to hear feedback from you for this fix. |
Hey @magaonka-amd, LGTM and thanks for the work. |
|
Hi @magaonka-amd thanks for this clean up, could you just upstream it first? We can cherry-pick it back once it's merged to upstream. Thanks! |
|
okay sure @i-chaochen , moving forward . pushing to upstream XLA and then cherry-pick it back to downstream is standard procedure for submitting changes XLA? just making sure so that I follow right procedure in coming PRs |
The AnnotationMap in RocmTracer is a singleton member that accumulates annotation strings (correlation_id -> XLA op name mappings) during each profiling session. These entries were never cleared between sessions.
The existing line in Enable():
AnnotationMap(options_->max_annotation_strings);
was a no-op -- it constructed a temporary AnnotationMap on the stack that was immediately destroyed, never touching the singleton member.
This fix:
Submission Checklist