Conversation
pmcollins
left a comment
There was a problem hiding this comment.
Hi Siim! This is great. Added some comments.
| service_name, sampling_interval, self._filter_stacktraces, instrumentation_source="snapshot" | ||
| ) | ||
|
|
||
| def on_start(self, span: Span, parent_context: Optional[Context] = None) -> None: |
There was a problem hiding this comment.
on_start and on_end can be called by multiple threads so have to be careful about shared state. I think span_id_to_trace_id and active_traces members need to have access synchronized behind a lock in both methods.
There was a problem hiding this comment.
I noticed we don't actually need active_traces at all since span_id_to_trace_id already has all the info - updated the PR.
The lock should no longer be necessary as the operation now only modifies span_id_to_trace_id (which is atomic)
There was a problem hiding this comment.
Nice, but I'm wondering if we still should synchronize access to the self.span_id_to_trace_id field. An example race condition is (as threads take turns executing instructions) on_start starts the profiler, then on_end immediately ends it because it checked the length of the dictionary before on_start added a new traceid to it, but it stopped the profiler immediately after on_start started it. Also, we're iterating over the values of this dict in _filter_stacktraces but potentially modifying it while iteration is still occurring (as threads take turns executing instructions) which could cause an exception. The dict and its lock could be encapsulated in their own class and it could handle its own synchronization.
There was a problem hiding this comment.
I guess it can happen, but the profiler would linger on in this case.
The .values() iteration is an issue indeed, I've added a lock to span_id_to_trace_id now 🙏
| self.wakeup_event.set() | ||
| self.thread.join() | ||
|
|
||
| def pause_after(self, seconds: float): |
There was a problem hiding this comment.
I'm a little confused by the names. Should this method be called pause_for and should the pause_at field be called resume_at?
There was a problem hiding this comment.
The method is used to schedule a pause after seconds have passed, the pause itself is indefinite. I assume pause_for would have different semantics - pausing right now until seconds have passed.
There was a problem hiding this comment.
The reason I mention it is because it looks like the pause logic in the _loop method says: if a pause has been scheduled and it's in the future, then sleep now until the wakeup time (start time + pause_at), no?
There was a problem hiding this comment.
Good catch, the logic should've been an indefinite sleep until woken up again. I've hopefully simplified and fixed it now and added an additional comment.
|
|
||
| _timer = None | ||
| _pylogger = logging.getLogger(__name__) | ||
| _thread_states = {} |
There was a problem hiding this comment.
I think we'll need to synchronize access to this behind a lock because it's accessed by multiple threads. And since we'll need a lock, maybe it would better to encapsulate both in a class e.g. ThreadStates or ThreadStateManager.
There was a problem hiding this comment.
The writes and reads to _thread_states are atomic (both with GIL and the future / experimental PEP 703) - should be fine as is unless I'm missing something
There was a problem hiding this comment.
Yeah, this seems correct. But 3.13+ have an experimental mode to run without the GIL (and this is expected to be the default mode eventually). May be a good time to future proof this.
There was a problem hiding this comment.
According to PEP 703 it will have thread safety:
This PEP proposes using per-object locks to provide many of the same protections that the GIL provides.
For example, every list, dictionary, and set will have an associated lightweight lock.
All operations that modify the object must hold the object’s lock.
Most operations that read from the object should acquire the object’s lock as well;
the few read operations that can proceed without holding a lock are described below.
New environment variables:
SPLUNK_SNAPSHOT_PROFILER_ENABLEDdefault:falseSPLUNK_SNAPSHOT_SELECTION_PROBABILITYdefault:0.01SPLUNK_SNAPSHOT_SAMPLING_INTERVALdefault:10(milliseconds)The profiling is triggered in
CallgraphsSpanProcessorwhich filters the stacktraces based on active traces (i.e. it only keeps stacktraces matching a trace that has been selected for profiling).Changes to profiling:
profiling.instrumentation.sourceattribute.time.monotonicinstead oftime.timefor consistent sleeping.start. This is used for callgraphs when no traces have been selected for a minute.