Skip to content

Reduce the impact of metrics on throughput#304

Closed
fpacifici wants to merge 2 commits intomainfrom
fpacifici/metrics_perf
Closed

Reduce the impact of metrics on throughput#304
fpacifici wants to merge 2 commits intomainfrom
fpacifici/metrics_perf

Conversation

@fpacifici
Copy link
Copy Markdown
Collaborator

@fpacifici fpacifici commented Apr 26, 2026

Recording metrics at every step is still having a major impact on throughput.

In this profile we ran the protobuf parsing of a message + metrics + message instantiation 1M times:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)                                                                                    
  1000000    1.957    0.000   12.888    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/adapters/arroyo/steps_chain.py:55(fake_transform)                                                                                                                                                     
  1000000    0.867    0.000    5.923    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/adapters/arroyo/steps_chain.py:34(output_metrics)                                                                                                                                                     
  1199970    0.442    0.000    2.962    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/metrics/metrics.py:380(increment)          
  1000000    0.510    0.000    2.678    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/pipeline/msg_codecs.py:45(msg_parser)      
  2199970    0.779    0.000    2.567    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/metrics/metrics.py:302(__add_to_buffer)    
  1000000    0.363    0.000    2.430    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/metrics/metrics.py:397(timing)             
  1199970    0.450    0.000    2.192    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/metrics/metrics.py:326(increment)          
  1000000    0.374    0.000    1.800    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/metrics/metrics.py:339(timing)             
  2199970    1.190    0.000    1.787    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/metrics/metrics.py:319(__hash_tags)        
  1000000    0.325    0.000    1.294    0.000 /Users/filippopacifici/code/streams/sentry_streams/.venv/lib/python3.11/site-packages/sentry_kafka_schemas/codecs/protobuf.py:34(decode)                                                                                                                                
  1000000    0.970    0.000    0.970    0.000 {method 'ParseFromString' of 'google._upb._message.Message' objects}                                         
  1000000    0.279    0.000    0.853    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/adapters/arroyo/steps_chain.py:20(input_metrics)                                                                                                                                                      
  1000000    0.503    0.000    0.805    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/pipeline/msg_codecs.py:32(_get_codec_from_msg)                                   
  2000000    0.493    0.000    0.744    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/metrics/metrics.py:488(get_size)
  2199970    0.454    0.000    0.601    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/metrics/metrics.py:343(__throttled_flush)
  2199970    0.450    0.000    0.595    0.000 /opt/homebrew/Cellar/python@3.11/3.11.9_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/enum.py:193(__get__)                            
  2199970    0.370    0.000    0.370    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/metrics/metrics.py:320(<listcomp>)
  3000555    0.271    0.000    0.271    0.000 {built-in method builtins.isinstance}
  1000000    0.177    0.000    0.249    0.000 {built-in method builtins.hasattr}
  3299957    0.222    0.000    0.222    0.000 {built-in method time.time}
  3000000    0.209    0.000    0.209    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/pipeline/message.py:201(payload)
  1000000    0.173    0.000    0.173    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/pipeline/message.py:105(__init__)
  2000000    0.166    0.000    0.166    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/pipeline/message.py:213(schema)
  1000000    0.161    0.000    0.163    0.000 /Users/filippopacifici/code/streams/sentry_streams/.venv/lib/python3.11/site-packages/sentry_kafka_schemas/sentry_kafka_schemas.py:147(get_codec)
  2199970    0.145    0.000    0.145    0.000 /opt/homebrew/Cellar/python@3.11/3.11.9_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/enum.py:12

3/4 of the time is spent in recording metrics in the buffered backend.
This does not include sending the metrics, jsut the buffering logic.

This is not unexpected:

  • we record metrics at each step and step logic can be minimal, even less than recording metrics.
  • we record 5 metrics per step

In this PR:

  • Reduced the number of metrics we record at each step. We can introduce something again later
  • Introduced sampling so that we run the metrics logic a lot less

@fpacifici fpacifici requested a review from a team as a code owner April 26, 2026 05:02

metrics.increment(Metric.OUTPUT_MESSAGES, tags=tags)
if message_size is not None:
metrics.increment(Metric.OUTPUT_BYTES, tags=tags, value=message_size)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error counter not scaled to compensate for sampling

High Severity

The Metric.ERRORS counter in output_metrics is not scaled by 1 / sample_rate, unlike Metric.INPUT_MESSAGES in input_metrics which correctly uses value=1 / sample_rate. Since output_metrics returns early based on random.random() > sample_rate, errors are both sampled down and recorded as a value of 1, leading to systematic under-reporting by a factor of 1/sample_rate (e.g., 10x with the default 0.1 rate). This makes error counts unreliable in monitoring dashboards.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 6e7d58f. Configure here.

metrics = get_metrics()
tags = {"step": name}
if error:
tags["error"] = error
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The ERRORS metric is not scaled by 1/sample_rate to account for sampling, unlike INPUT_MESSAGES, leading to undercounted errors.
Severity: HIGH

Suggested Fix

In the output_metrics block, scale the ERRORS metric by 1 / sample_rate to match the handling of INPUT_MESSAGES. Change metrics.increment(Metric.ERRORS, tags=tags) to metrics.increment(Metric.ERRORS, value=1 / sample_rate, tags=tags).

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: sentry_streams/sentry_streams/adapters/arroyo/rust_arroyo.py#L141

Potential issue: The `ERRORS` metric in `output_metrics` is not scaled by
`1/sample_rate`, while the `INPUT_MESSAGES` metric is. This inconsistency leads to a
systematic undercounting of errors whenever sampling is active. The calculated error
rate (`ERRORS`/`INPUT_MESSAGES`) will be lower than the true rate by a factor of
`sample_rate`. This can cause monitoring dashboards and alerts to miss real production
issues, making the system appear more reliable than it actually is. For instance, with a
`sample_rate` of 0.1, only 10% of the actual errors will be reported.

Did we get this right? 👍 / 👎 to inform future reviews.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit f2ca949. Configure here.

if _metrics is None:
_metrics = Metrics(_dummy_metrics_backend)

return _metrics
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caching dummy metrics breaks later configuration call

Medium Severity

get_metrics() now mutates the global _metrics by caching a dummy-backed Metrics instance when none is configured. Previously it returned a fresh throwaway object without side effects. If get_metrics() is ever called before configure_metrics(), the cached dummy will cause configure_metrics(force=False) to hit the assert _metrics is None guard and crash.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f2ca949. Configure here.

Comment on lines 474 to +479
def get_metrics() -> Metrics:
if _metrics_backend is None:
return Metrics(_dummy_metrics_backend)
return Metrics(_metrics_backend)
global _metrics
if _metrics is None:
_metrics = Metrics(_dummy_metrics_backend)

return _metrics
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The test fixture reset_metrics_backend was not updated after _metrics_backend was renamed to _metrics, causing state to leak between tests.
Severity: MEDIUM

Suggested Fix

In the reset_metrics_backend fixture located in test_metrics.py, update the line that resets the global variable to target the new name. Change metrics_module._metrics_backend = None to metrics_module._metrics = None.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: sentry_streams/sentry_streams/metrics/metrics.py#L474-L479

Potential issue: The global variable `_metrics_backend` was renamed to `_metrics`, but
the test fixture `reset_metrics_backend` was not updated to reflect this change. The
fixture continues to reset the old, non-existent `_metrics_backend` variable, causing
the state of the new `_metrics` variable to leak between tests. This leads to test
failures, particularly in functions like `configure_metrics()` which assert that
`_metrics` is `None` before initialization. The broken test fixture compromises the
integrity of the test suite, as tests will run with stale data from previous runs.

@fpacifici
Copy link
Copy Markdown
Collaborator Author

Replaced by #305

@fpacifici fpacifici closed this Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant