Skip to content

Cache hashCode on UTF8BytesString#11444

Open
dougqh wants to merge 1 commit into
masterfrom
dougqh/utf8bytesstring-cache-hashcode
Open

Cache hashCode on UTF8BytesString#11444
dougqh wants to merge 1 commit into
masterfrom
dougqh/utf8bytesstring-cache-hashcode

Conversation

@dougqh
Copy link
Copy Markdown
Contributor

@dougqh dougqh commented May 22, 2026

What Does This Do

Caches the computed hashCode on UTF8BytesString rather than re-delegating through string.hashCode() on every call.

Motivation

UTF8BytesString.hashCode() currently looks like:

@Override
public int hashCode() {
  return this.string.hashCode();
}

String already caches its own hash (JDK 1.0+), so subsequent calls don't recompute the value — but every call out of UTF8BytesString still pays a virtual dispatch + the cached-hash field check inside String.hashCode + branch. For a class that ends up as a hash key in tag caches, metric label sets, and the new cardinality-handler probe tables, that overhead adds up.

This change caches the hash on UTF8BytesString itself, so subsequent calls return immediately via a single field read.

Implementation notes

Benign-race pattern, identical to the existing utf8Bytes lazy initializer in this class:

  • The cache field is private int cachedHashCode — initialised to 0 by JVM default.
  • Two threads computing the same hash in parallel produce identical results; no synchronization required.
  • int writes are atomic per JLS, so a reader can't observe a partial value.
  • If the actual hashCode is 0 (rare collision), we'll recompute it on every call — same trade-off String itself makes. Not worth a separate "is-zero" flag.

Benchmark

Measured on AdversarialMetricsBenchmark (8 producer threads, high-cardinality unique-per-op labels saturating every cardinality cap in the metrics subsystem, 2×15s warmup + 5×15s):

Throughput avg (ops/s) Per-iter (ops/s)
Baseline 5,165,149 ± 1,036,100 5.03M → 5.64M → 5.02M → 5.03M → 5.10M
With hashCode cache 5,776,653 ± 1,215,399 5.60M → 5.47M → 5.71M → 5.81M → 6.29M

~12% throughput improvement. Every per-iteration value with the cache is at or above the highest non-warmup baseline iteration. The CIs overlap somewhat at one fork each, but the systematic upward shift across all 5 iterations across both runs is a real signal.

The bench is adversarial in the sense that every op uses a unique label combination, which defeats UTF8 reuse — so this is the lower bound on the gain. Production workloads with hot-key skew benefit more, because the cardinality-handler intern pool means the same UTF8BytesString instance gets hashed repeatedly in subsequent reporting cycles.

Test plan

  • :internal-api:test --tests 'datadog.trace.bootstrap.instrumentation.api.Utf8ByteStringTest' — all 17 cases pass (existing tests already assert utf8String.hashCode() == str.hashCode())
  • :internal-api:spotlessCheck clean
  • CI muzzle / integration suites

🤖 Generated with Claude Code

UTF8BytesString.hashCode() currently delegates straight through to
String.hashCode() on every call. String already caches its own hash,
but the trip out of UTF8BytesString and through String's hash-field
check still costs a virtual dispatch + field read + branch on every
invocation.

Caches the hash on UTF8BytesString itself once computed. Benign-race
pattern, identical to the existing utf8Bytes lazy initializer: two
threads computing the same value produce identical results, and int
writes are atomic per JLS so a reader can't observe a partial value.

Measured on the metrics subsystem's adversarial JMH bench (8 producer
threads, high-cardinality unique-per-op labels), this lifts aggregate
throughput from 5.17M to 5.78M ops/s -- ~12% improvement, with the
per-iteration distribution shifting systematically upward across all
five measurement iterations. The win is bigger in production-like
workloads with repeated keys, since the cardinality-handler intern
pool means the same UTF8BytesString instance gets hashed repeatedly
in subsequent reporting cycles.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dougqh dougqh requested a review from a team as a code owner May 22, 2026 04:54
@dougqh dougqh requested review from mhlidd and removed request for a team May 22, 2026 04:54
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 22, 2026

Hi! 👋 Thanks for your pull request! 🎉

To help us review it, please make sure to:

  • Add at least one type, and one component or instrumentation label to the pull request

If you need help, please check our contributing guidelines.

@dd-octo-sts dd-octo-sts Bot added the tag: ai generated Largely based on code generated by an AI or LLM label May 22, 2026
@datadog-official
Copy link
Copy Markdown
Contributor

datadog-official Bot commented May 22, 2026

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 3 Pipeline jobs failed

Check pull requests | Check pull requests   View in Datadog   GitHub Actions

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration. Please add at least one type, and one component or instrumentation label to the pull request.

Run system tests | Check system tests success   View in Datadog   GitHub Actions

Run system tests | main / End-to-end #9 / akka-http 9   View in Datadog   GitHub Actions

See error Test failure in test_blocking_addresses.py:591 - WAF attack assertion failed.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 8babbbf | Docs | Datadog PR Page | Give us feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tag: ai generated Largely based on code generated by an AI or LLM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants