add support for callgraph profiling by seemk · Pull Request #686 · signalfx/splunk-otel-python

seemk · 2026-01-23T12:55:28Z

New environment variables:

SPLUNK_SNAPSHOT_PROFILER_ENABLED default: false
SPLUNK_SNAPSHOT_SELECTION_PROBABILITY default: 0.01
SPLUNK_SNAPSHOT_SAMPLING_INTERVAL default: 10 (milliseconds)

The profiling is triggered in CallgraphsSpanProcessor which filters the stacktraces based on active traces (i.e. it only keeps stacktraces matching a trace that has been selected for profiling).

Changes to profiling:

Add the missing profiling.instrumentation.source attribute.
Use time.monotonic instead of time.time for consistent sleeping.
Add the ability to pause profiling - which hibernates the profiler thread. Can be resumed again via start. This is used for callgraphs when no traces have been selected for a minute.
Add support for multiple profiler instances. Context attach/detach wrapping is still done only once, meaning the profiler instances share the thread state mapping.

pmcollins

Hi Siim! This is great. Added some comments.

src/splunk_otel/callgraphs/__init__.py

src/splunk_otel/callgraphs/span_processor.py

pmcollins · 2026-01-26T20:11:41Z

src/splunk_otel/callgraphs/span_processor.py

+            service_name, sampling_interval, self._filter_stacktraces, instrumentation_source="snapshot"
+        )
+
+    def on_start(self, span: Span, parent_context: Optional[Context] = None) -> None:


on_start and on_end can be called by multiple threads so have to be careful about shared state. I think span_id_to_trace_id and active_traces members need to have access synchronized behind a lock in both methods.

I noticed we don't actually need active_traces at all since span_id_to_trace_id already has all the info - updated the PR.

The lock should no longer be necessary as the operation now only modifies span_id_to_trace_id (which is atomic)

Nice, but I'm wondering if we still should synchronize access to the self.span_id_to_trace_id field. An example race condition is (as threads take turns executing instructions) on_start starts the profiler, then on_end immediately ends it because it checked the length of the dictionary before on_start added a new traceid to it, but it stopped the profiler immediately after on_start started it. Also, we're iterating over the values of this dict in _filter_stacktraces but potentially modifying it while iteration is still occurring (as threads take turns executing instructions) which could cause an exception. The dict and its lock could be encapsulated in their own class and it could handle its own synchronization.

I guess it can happen, but the profiler would linger on in this case.

The .values() iteration is an issue indeed, I've added a lock to span_id_to_trace_id now 🙏

pmcollins · 2026-01-26T20:55:59Z

src/splunk_otel/profile.py

+        self.wakeup_event.set()
        self.thread.join()

+    def pause_after(self, seconds: float):


I'm a little confused by the names. Should this method be called pause_for and should the pause_at field be called resume_at?

The method is used to schedule a pause after seconds have passed, the pause itself is indefinite. I assume pause_for would have different semantics - pausing right now until seconds have passed.

The reason I mention it is because it looks like the pause logic in the _loop method says: if a pause has been scheduled and it's in the future, then sleep now until the wakeup time (start time + pause_at), no?

Good catch, the logic should've been an indefinite sleep until woken up again. I've hopefully simplified and fixed it now and added an additional comment.

pmcollins · 2026-01-26T21:50:53Z

src/splunk_otel/profile.py


-_timer = None
-_pylogger = logging.getLogger(__name__)
+_thread_states = {}


I think we'll need to synchronize access to this behind a lock because it's accessed by multiple threads. And since we'll need a lock, maybe it would better to encapsulate both in a class e.g. ThreadStates or ThreadStateManager.

The writes and reads to _thread_states are atomic (both with GIL and the future / experimental PEP 703) - should be fine as is unless I'm missing something

Yeah, this seems correct. But 3.13+ have an experimental mode to run without the GIL (and this is expected to be the default mode eventually). May be a good time to future proof this.

According to PEP 703 it will have thread safety:

This PEP proposes using per-object locks to provide many of the same protections that the GIL provides. For example, every list, dictionary, and set will have an associated lightweight lock. All operations that modify the object must hold the object’s lock. Most operations that read from the object should acquire the object’s lock as well; the few read operations that can proceed without holding a lock are described below.

src/splunk_otel/callgraphs/span_processor.py

add support for callgraph profiling

3fe4632

seemk requested review from a team as code owners January 23, 2026 12:55

use set discard and fix the check for parent context

7f0ce87

pmcollins reviewed Jan 26, 2026

View reviewed changes

seemk added 7 commits February 5, 2026 19:49

remove active_traces from CallgraphsSpanProcessor

1770046

use constant for splunk.trace.snapshot.volume

2106073

rename start_callgraphs_if_enabled to _configure_callgraphs_if_enabled

d382cb1

Merge branch 'main' into callgraphs

5b6d18d

add a lock to CallgraphsSpanProcessor

67ebe23

fix the profiler sleep logic

a8e5014

fix a race condition when waking up the profiler thread

c38b767

Comments

Conversation

seemk commented Jan 23, 2026

Uh oh!

pmcollins left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants