Reduce the impact of metrics on throughput by fpacifici · Pull Request #304 · getsentry/streams

fpacifici · 2026-04-26T05:02:20Z

Recording metrics at every step is still having a major impact on throughput.

In this profile we ran the protobuf parsing of a message + metrics + message instantiation 1M times:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)                                                                                    
  1000000    1.957    0.000   12.888    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/adapters/arroyo/steps_chain.py:55(fake_transform)                                                                                                                                                     
  1000000    0.867    0.000    5.923    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/adapters/arroyo/steps_chain.py:34(output_metrics)                                                                                                                                                     
  1199970    0.442    0.000    2.962    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/metrics/metrics.py:380(increment)          
  1000000    0.510    0.000    2.678    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/pipeline/msg_codecs.py:45(msg_parser)      
  2199970    0.779    0.000    2.567    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/metrics/metrics.py:302(__add_to_buffer)    
  1000000    0.363    0.000    2.430    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/metrics/metrics.py:397(timing)             
  1199970    0.450    0.000    2.192    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/metrics/metrics.py:326(increment)          
  1000000    0.374    0.000    1.800    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/metrics/metrics.py:339(timing)             
  2199970    1.190    0.000    1.787    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/metrics/metrics.py:319(__hash_tags)        
  1000000    0.325    0.000    1.294    0.000 /Users/filippopacifici/code/streams/sentry_streams/.venv/lib/python3.11/site-packages/sentry_kafka_schemas/codecs/protobuf.py:34(decode)                                                                                                                                
  1000000    0.970    0.000    0.970    0.000 {method 'ParseFromString' of 'google._upb._message.Message' objects}                                         
  1000000    0.279    0.000    0.853    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/adapters/arroyo/steps_chain.py:20(input_metrics)                                                                                                                                                      
  1000000    0.503    0.000    0.805    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/pipeline/msg_codecs.py:32(_get_codec_from_msg)                                   
  2000000    0.493    0.000    0.744    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/metrics/metrics.py:488(get_size)
  2199970    0.454    0.000    0.601    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/metrics/metrics.py:343(__throttled_flush)
  2199970    0.450    0.000    0.595    0.000 /opt/homebrew/Cellar/python@3.11/3.11.9_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/enum.py:193(__get__)                            
  2199970    0.370    0.000    0.370    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/metrics/metrics.py:320(<listcomp>)
  3000555    0.271    0.000    0.271    0.000 {built-in method builtins.isinstance}
  1000000    0.177    0.000    0.249    0.000 {built-in method builtins.hasattr}
  3299957    0.222    0.000    0.222    0.000 {built-in method time.time}
  3000000    0.209    0.000    0.209    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/pipeline/message.py:201(payload)
  1000000    0.173    0.000    0.173    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/pipeline/message.py:105(__init__)
  2000000    0.166    0.000    0.166    0.000 /Users/filippopacifici/code/streams/sentry_streams/sentry_streams/pipeline/message.py:213(schema)
  1000000    0.161    0.000    0.163    0.000 /Users/filippopacifici/code/streams/sentry_streams/.venv/lib/python3.11/site-packages/sentry_kafka_schemas/sentry_kafka_schemas.py:147(get_codec)
  2199970    0.145    0.000    0.145    0.000 /opt/homebrew/Cellar/python@3.11/3.11.9_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/enum.py:12

3/4 of the time is spent in recording metrics in the buffered backend.
This does not include sending the metrics, jsut the buffering logic.

This is not unexpected:

we record metrics at each step and step logic can be minimal, even less than recording metrics.
we record 5 metrics per step

In this PR:

Reduced the number of metrics we record at each step. We can introduce something again later
Introduced sampling so that we run the metrics logic a lot less

cursor · 2026-04-26T05:05:26Z


-    metrics.increment(Metric.OUTPUT_MESSAGES, tags=tags)
-    if message_size is not None:
-        metrics.increment(Metric.OUTPUT_BYTES, tags=tags, value=message_size)


Error counter not scaled to compensate for sampling

High Severity

The Metric.ERRORS counter in output_metrics is not scaled by 1 / sample_rate, unlike Metric.INPUT_MESSAGES in input_metrics which correctly uses value=1 / sample_rate. Since output_metrics returns early based on random.random() > sample_rate, errors are both sampled down and recorded as a value of 1, leading to systematic under-reporting by a factor of 1/sample_rate (e.g., 10x with the default 0.1 rate). This makes error counts unreliable in monitoring dashboards.

^{Reviewed by Cursor Bugbot for commit 6e7d58f. Configure here.}

sentry · 2026-04-26T05:06:01Z

    metrics = get_metrics()
    tags = {"step": name}
    if error:
        tags["error"] = error


Bug: The ERRORS metric is not scaled by 1/sample_rate to account for sampling, unlike INPUT_MESSAGES, leading to undercounted errors.
_{Severity: HIGH}

Suggested Fix

In the output_metrics block, scale the ERRORS metric by 1 / sample_rate to match the handling of INPUT_MESSAGES. Change metrics.increment(Metric.ERRORS, tags=tags) to metrics.increment(Metric.ERRORS, value=1 / sample_rate, tags=tags).

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: sentry_streams/sentry_streams/adapters/arroyo/rust_arroyo.py#L141 Potential issue: The `ERRORS` metric in `output_metrics` is not scaled by `1/sample_rate`, while the `INPUT_MESSAGES` metric is. This inconsistency leads to a systematic undercounting of errors whenever sampling is active. The calculated error rate (`ERRORS`/`INPUT_MESSAGES`) will be lower than the true rate by a factor of `sample_rate`. This can cause monitoring dashboards and alerts to miss real production issues, making the system appear more reliable than it actually is. For instance, with a `sample_rate` of 0.1, only 10% of the actual errors will be reported.

_{Did we get this right? 👍 / 👎 to inform future reviews.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit f2ca949. Configure here.}

cursor · 2026-04-26T05:11:50Z

+    if _metrics is None:
+        _metrics = Metrics(_dummy_metrics_backend)
+
+    return _metrics


Caching dummy metrics breaks later configuration call

Medium Severity

get_metrics() now mutates the global _metrics by caching a dummy-backed Metrics instance when none is configured. Previously it returned a fresh throwaway object without side effects. If get_metrics() is ever called before configure_metrics(), the cached dummy will cause configure_metrics(force=False) to hit the assert _metrics is None guard and crash.

Additional Locations (1)

sentry_streams/sentry_streams/metrics/metrics.py#L460-L463

^{Reviewed by Cursor Bugbot for commit f2ca949. Configure here.}

sentry · 2026-04-26T05:12:13Z

 def get_metrics() -> Metrics:
-    if _metrics_backend is None:
-        return Metrics(_dummy_metrics_backend)
-    return Metrics(_metrics_backend)
+    global _metrics
+    if _metrics is None:
+        _metrics = Metrics(_dummy_metrics_backend)
+
+    return _metrics


Bug: The test fixture reset_metrics_backend was not updated after _metrics_backend was renamed to _metrics, causing state to leak between tests.
_{Severity: MEDIUM}

Suggested Fix

In the reset_metrics_backend fixture located in test_metrics.py, update the line that resets the global variable to target the new name. Change metrics_module._metrics_backend = None to metrics_module._metrics = None.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: sentry_streams/sentry_streams/metrics/metrics.py#L474-L479 Potential issue: The global variable `_metrics_backend` was renamed to `_metrics`, but the test fixture `reset_metrics_backend` was not updated to reflect this change. The fixture continues to reset the old, non-existent `_metrics_backend` variable, causing the state of the new `_metrics` variable to leak between tests. This leads to test failures, particularly in functions like `configure_metrics()` which assert that `_metrics` is `None` before initialization. The broken test fixture compromises the integrity of the test suite, as tests will run with stale data from previous runs.

fpacifici · 2026-04-26T22:27:32Z

Replaced by #305

Reduce impact of metrics on perf

6e7d58f

fpacifici requested a review from a team as a code owner April 26, 2026 05:02

cursor Bot reviewed Apr 26, 2026

View reviewed changes

sentry Bot reviewed Apr 26, 2026

View reviewed changes

Avoid reinstantiation of metrics each time

f2ca949

cursor Bot reviewed Apr 26, 2026

View reviewed changes

sentry Bot reviewed Apr 26, 2026

View reviewed changes

fpacifici closed this Apr 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce the impact of metrics on throughput#304

Reduce the impact of metrics on throughput#304
fpacifici wants to merge 2 commits intomainfrom
fpacifici/metrics_perf

fpacifici commented Apr 26, 2026 •

edited

Loading

Uh oh!

cursor Bot Apr 26, 2026

Uh oh!

sentry Bot Apr 26, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 26, 2026

Uh oh!

sentry Bot Apr 26, 2026

Uh oh!

fpacifici commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

fpacifici commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot Apr 26, 2026

Choose a reason for hiding this comment

Error counter not scaled to compensate for sampling

Uh oh!

sentry Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 26, 2026

Choose a reason for hiding this comment

Caching dummy metrics breaks later configuration call

Uh oh!

sentry Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

fpacifici commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fpacifici commented Apr 26, 2026 •

edited

Loading