Skip to content

Comments

add support for callgraph profiling#686

Open
seemk wants to merge 9 commits intomainfrom
callgraphs
Open

add support for callgraph profiling#686
seemk wants to merge 9 commits intomainfrom
callgraphs

Conversation

@seemk
Copy link

@seemk seemk commented Jan 23, 2026

New environment variables:

  • SPLUNK_SNAPSHOT_PROFILER_ENABLED default: false
  • SPLUNK_SNAPSHOT_SELECTION_PROBABILITY default: 0.01
  • SPLUNK_SNAPSHOT_SAMPLING_INTERVAL default: 10 (milliseconds)

The profiling is triggered in CallgraphsSpanProcessor which filters the stacktraces based on active traces (i.e. it only keeps stacktraces matching a trace that has been selected for profiling).

Changes to profiling:

  • Add the missing profiling.instrumentation.source attribute.
  • Use time.monotonic instead of time.time for consistent sleeping.
  • Add the ability to pause profiling - which hibernates the profiler thread. Can be resumed again via start. This is used for callgraphs when no traces have been selected for a minute.
  • Add support for multiple profiler instances. Context attach/detach wrapping is still done only once, meaning the profiler instances share the thread state mapping.

@seemk seemk requested review from a team as code owners January 23, 2026 12:55
Copy link
Contributor

@pmcollins pmcollins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Siim! This is great. Added some comments.

service_name, sampling_interval, self._filter_stacktraces, instrumentation_source="snapshot"
)

def on_start(self, span: Span, parent_context: Optional[Context] = None) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on_start and on_end can be called by multiple threads so have to be careful about shared state. I think span_id_to_trace_id and active_traces members need to have access synchronized behind a lock in both methods.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed we don't actually need active_traces at all since span_id_to_trace_id already has all the info - updated the PR.

The lock should no longer be necessary as the operation now only modifies span_id_to_trace_id (which is atomic)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, but I'm wondering if we still should synchronize access to the self.span_id_to_trace_id field. An example race condition is (as threads take turns executing instructions) on_start starts the profiler, then on_end immediately ends it because it checked the length of the dictionary before on_start added a new traceid to it, but it stopped the profiler immediately after on_start started it. Also, we're iterating over the values of this dict in _filter_stacktraces but potentially modifying it while iteration is still occurring (as threads take turns executing instructions) which could cause an exception. The dict and its lock could be encapsulated in their own class and it could handle its own synchronization.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it can happen, but the profiler would linger on in this case.

The .values() iteration is an issue indeed, I've added a lock to span_id_to_trace_id now 🙏

self.wakeup_event.set()
self.thread.join()

def pause_after(self, seconds: float):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by the names. Should this method be called pause_for and should the pause_at field be called resume_at?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method is used to schedule a pause after seconds have passed, the pause itself is indefinite. I assume pause_for would have different semantics - pausing right now until seconds have passed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I mention it is because it looks like the pause logic in the _loop method says: if a pause has been scheduled and it's in the future, then sleep now until the wakeup time (start time + pause_at), no?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, the logic should've been an indefinite sleep until woken up again. I've hopefully simplified and fixed it now and added an additional comment.


_timer = None
_pylogger = logging.getLogger(__name__)
_thread_states = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll need to synchronize access to this behind a lock because it's accessed by multiple threads. And since we'll need a lock, maybe it would better to encapsulate both in a class e.g. ThreadStates or ThreadStateManager.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The writes and reads to _thread_states are atomic (both with GIL and the future / experimental PEP 703) - should be fine as is unless I'm missing something

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this seems correct. But 3.13+ have an experimental mode to run without the GIL (and this is expected to be the default mode eventually). May be a good time to future proof this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to PEP 703 it will have thread safety:

This PEP proposes using per-object locks to provide many of the same protections that the GIL provides.
For example, every list, dictionary, and set will have an associated lightweight lock.
All operations that modify the object must hold the object’s lock.
Most operations that read from the object should acquire the object’s lock as well;
the few read operations that can proceed without holding a lock are described below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants