Update client-side stats to use light weight Hashtable by dougqh · Pull Request #11382 · DataDog/dd-trace-java

dougqh · 2026-05-15T19:11:06Z

What Does This Do

Replaces the MetricKey based HashMap with a new AggregateTable based on the light Hashtable

Motivation

By using the light Hashtable, I'm able to avoid the biggest source of allocation in client-side stats: MetricKey

Hashtable provides utilities for searching the entries without constructing a new composite key object

First, the components are hashed together to find the corresponding bucket
Then the bucket can be traversed to see if the entries match the key components

And the custom Entry can hold multiple fields that comprise the data / metadata needed for metric aggregation and eviction policy

The end result is that both MetricKey and AggregateMetric can be merged into a single class AggregateEntry that is only constructed when there's no existing matching entry

Additional Notes

Stacked on top of #11381 (dougqh/conflating-metrics-background-work). #11409 — the Hashtable utility this PR builds on — landed in master separately, so the diff shown here is only the AggregateTable / AggregateEntry / ClearSignal work plus follow-up cleanups.

Restructures the consumer-side aggregate store. Three logical commits, intended to be reviewed in order:

1. Add `AggregateTable` + `AggregateEntry` backed by `Hashtable`

Introduces a multi-key hash table that lets the consumer thread look up the {labels → counters} entry directly from a SpanSnapshot's raw fields — no MetricKey allocation per snapshot, no per-snapshot UTF8 cache lookups, no CHM operations. Hot-path lookup is keyHash compute → Hashtable.Support.bucket → bucket walk → matches(keyHash, snapshot) → returned entry has the counters to mutate in place.

This commit is standalone — no call sites yet, only the new classes + unit tests for hit/miss/cap-overrun/expunge/clear behavior.

2. Swap `Aggregator` to use `AggregateTable` + route `disable()` clear through a `ClearSignal`

Replaces LRUCache<MetricKey, AggregateMetric> with AggregateTable in Aggregator. Drops the AggregateExpiry listener — drop reporting (onStatsAggregateDropped) moves to the cap-overrun path inside Drainer.accept.

Threading fix bundled here: ConflatingMetricsAggregator.disable() used to call aggregator.clearAggregates() and inbox.clear() directly from the Sink's IO callback thread, racing with the aggregator thread. That race was tolerable for LinkedHashMap (worst case = corrupted internal state right before everything got cleared anyway); it's not tolerable for Hashtable (chain corruption can NPE or loop). disable() now offers a ClearSignal to the inbox so the aggregator thread itself performs the clear — preserves the single-writer invariant for AggregateTable end-to-end. The offer is best-effort; the system self-heals on a subsequent downgrade cycle if the inbox happens to be full (commented at the call site).

Cap-overrun semantic change: the old LRUCache evicted least-recently-used in O(1). AggregateTable instead scans for a hitCount==0 entry to recycle (O(N) worst-case), and drops the new key if none exists. Practical impact: in steady state, an unrelated burst of new keys gets dropped (and reported via onStatsAggregateDropped) rather than evicting established keys. The cost trade-off is commented at the eviction site — eviction is expected rare because the cap is sized to the working set; cursor-caching is the future option if a workload runs persistently at cap. The existing test that asserted "service0 evicted in favor of service10" is updated to assert the new semantics; the other cap-related test ("evicted entry was already flushed") still passes unchanged.

3. Fold `MetricKey` + `AggregateMetric` into `AggregateEntry`

MetricKey existed for two reasons — being the LRUCache key (replaced by AggregateTable's Hashtable mechanics) and being the labels arg to MetricWriter.add (the only thing left). AggregateMetric was the counter/histogram counterpart. Folds both onto a single AggregateEntry (10 UTF8 label fields + 3 primitives + counters + histograms), changes MetricWriter.add(MetricKey, AggregateMetric) → add(AggregateEntry), and deletes MetricKey.java + MetricKeys.java + AggregateMetric.java.

The 12 UTF8 caches that used to be split between MetricKey (9) and ConflatingMetricsAggregator (3, with overlap) are consolidated on AggregateEntry. One cache per field type now.

Latent bug fix: the prior matches(SpanSnapshot) used Objects.equals on raw fields. If the same logical key was delivered once as String and once as UTF8BytesString (different CharSequence impls of identical content), Objects.equals returned false and the table would split into two entries for the same key. The new matches uses content-equality (UTF8BytesString.toString() returns the underlying String in O(1)), collapsing them correctly.

Test impact: AggregateEntry.of(...) mirrors the prior new MetricKey(...) positional args, so test diffs are mostly mechanical. About 56 test sites migrated across ConflatingMetricAggregatorTest, SerializingMetricWriterTest, and MetricsIntegrationTest.

Review polish

Follow-up commits address review feedback:

Use Hashtable.Support.create(maxAggregates, Support.MAX_RATIO) + Support.bucket + Support.insertHeadEntry(buckets, keyHash, entry) + Support.mutatingTableIterator to delegate to the helpers added on Add Hashtable and LongHashingUtils utilities #11409 — drops ~50 lines of bespoke bucket-array code.
Inline the report-time forEach lambda instead of a static BiConsumer constant (the JIT reuses non-capturing lambdas).
AggregateEntry.matches(long keyHash, SpanSnapshot) overload that pre-checks the hash, so chain walks read as one call.
@Nullable (javax.annotation) annotations on the four nullable label fields + their getters + of(...) parameters.
Objects.equals import in AggregateEntry.equals() (no more fully-qualified refs).
Design-trade-off comments on evictOneStale (O(N) scan rationale) and disable() (best-effort offer rationale).

Additional cleanups

A second round of review surfacing landed these:

MetricsIntegrationTest .aggregate runtime bug — the legacy entry.aggregate.recordDurations(...) form compiled under Groovy's dynamic dispatch but would have thrown MissingPropertyException at runtime. Fixed to call recordDurations directly on the entry. (Bot-flagged.)
Spock >> { closure } no-op assertions in ConflatingMetricAggregatorTest — the >> operator stubs a return value, so the closures verifying e.getHitCount() == X && ... were being evaluated and discarded. Wrapped 31 sites in assert so Groovy power-assert surfaces mismatches. All 41 tests still pass, so the previously-unverified assertions happened to hold. (Bot-flagged.)
Drop dead recordDurations(int, AtomicLongArray) batch API — vestige of master's Batch design. Production now only calls recordOneDuration(long). Migrated the three remaining test callers (AggregateEntryTest, SerializingMetricWriterTest, MetricsIntegrationTest) to loops of recordOneDuration calls, then deleted the batched method and its AtomicLongArray imports.
AggregateEntry.of() colon-split Javadoc warning — the test factory recovers (name, value) pairs from "name:value" strings by splitting at the first :, which is brittle if a peer-tag value contains a colon (URLs, IPv6, service:env). Added an explicit warning so callers know to keep test data colon-free in values.
ConflatingMetricsAggregatorDisableTest — new JUnit 5 coverage for the disable() → ClearSignal threading routing. The test fires DOWNGRADED from the test thread, waits for the no-flush window, then publishes a marker span with a distinct resource name and asserts the next flush captures only the marker — proving CLEAR actually wiped the original entry from the table. Catches both the missing-clear regression and the bucket-chain-corruption regression that the original threading race could produce.

Benchmarks

Producer `publish()` latency (single-threaded, 2 forks × 5 iter × 15s)

	Prior commit (stacked base)	This PR
SimpleSpan bench	3.116 µs/op	3.123 µs/op
DDSpan bench	2.506 µs/op	2.412 µs/op

All within noise on the producer side — this PR is a consumer-side refactor, so producer publish() shouldn't move much. The structural wins (one less class, no per-snapshot MetricKey allocation on the consumer, no double-cache lookups, smaller per-entry footprint) only become visible when the consumer is hammered hard enough that snapshot processing rate matters. That's exactly what the adversarial bench measures.

`AdversarialMetricsBenchmark` (8 producer threads, 2×15s warmup + 5×15s, 1 fork)

Same benchmark used on #11381 (high-cardinality (service, operation, resource, peer.hostname) per op, random durations across 1ns–1s, random error/topLevel flags — designed to saturate every capacity bound at once).

	master	#11381 (parent)	this PR
Throughput avg (ops/s)	395,806 ± 2,619,133	4,889,660 ± 390,175	27,915,800 ± 1,219,470
Per-iteration progression (ops/s)	warmup 2,536,145 → 205,314 → 95,888 → 47,301 → 24,378	4,886,778 → 4,875,195 → 4,731,827 → 4,959,992 → 4,994,511	28,043,909 → 28,112,828 → 27,354,721 → 28,074,395 → 27,993,147
Stdev (ops/s)	680,180	101,327	316,692
`onStatsInboxFull` (drops at handoff)	n/a (no inbox on master)	199,862,634	2,893,855,052
`onStatsAggregateDropped`	11,642,039	84,002,323	16,301,696

~70× faster than master, ~5.7× faster than #11381 alone. The producer fast-path drop check (inbox.size() >= inbox.capacity()) returns immediately when the inbox is saturated, so when the consumer can drain ~5× faster (this PR's consumer-side win), the producer spins through publish() at the fast-path rate and inbox-full drops climb correspondingly (200M on #11381 → 2.9 B here). That's the design working as intended: under attack, load shedding happens at the cheapest place in the pipeline.

The aggregate-cache drop count actually fell (84M → 16M) because the faster consumer keeps the table from staying at cap as often — fewer snapshots arrive at a cap-overrun state with no stale entry to recycle.

Per-iteration shape stays flat (27.4M–28.1M ops/s, no warmup → measurement degradation), confirming the design holds steady-state under sustained load.

Cardinality-isolation companions (8 producer threads, 2×15s warmup + 5×15s, 1 fork)

HighCardinalityResourceMetricsBenchmark and HighCardinalityPeerMetricsBenchmark (added in #11381) pin every dimension except one to attribute throughput deltas to a specific axis.

HighCardinalityResourceMetricsBenchmark — only resource varies (~1M distinct), service/operation/peer.hostname pinned:

	master	#11381 (parent)	this PR
Throughput avg (ops/s)	5,089,853 ± 587,996	8,983,726 ± 2,677,454	28,173,068 ± 485,923
Per-iter ops/s (5 measurement)	5,277K → 5,014K → 4,948K → 4,978K → 5,232K	8,043K → 8,844K → 9,962K → 9,217K → 8,852K	28,147K → 28,391K → 28,094K → 28,079K → 28,155K
`onStatsInboxFull`	0 (no inbox)	611,612,206	2,904,235,942
`onStatsAggregateDropped`	12,379,590	325,582,103	17,033,533

HighCardinalityPeerMetricsBenchmark — only peer.hostname varies (~32K distinct), service/operation/resource pinned:

	master	#11381 (parent)	this PR
Throughput avg (ops/s)	6,335,821 ± 1,068,632	9,255,788 ± 2,933,935	30,651,802 ± 1,108,285
Per-iter ops/s (5 measurement)	6,552K → 6,510K → 6,429K → 6,323K → 5,865K	9,137K → 10,258K → 9,752K → 8,794K → 8,337K	30,876K → 30,722K → 30,852K → 30,644K → 30,166K
`onStatsInboxFull`	0	744,795,870	3,107,668,314
`onStatsAggregateDropped`	7,039,180	208,556,622	17,532,646

+454 % over master / +214 % over #11381 on the resource axis; +384 % / +231 % on the peer axis. The pattern is consistent with the adversarial run: master is sensitive to which dimension is wide (resource at ~1M hits pending/keys CHM contention harder than peer.hostname at ~32K); this PR runs at ~28–30 M ops/s on either bench because the consumer is no longer the bottleneck. Variance also collapses (±486K / ±1.1M here vs ±2.7M / ±2.9M on #11381), reflecting that the new hashtable hits steady-state immediately with no JIT warmup tail.

The drop-counter shape on this PR flips even further toward inbox-full (170× ratio against aggregate-dropped on both benches), confirming that the consumer drains so fast the LRU aggregate cache barely needs to evict.

Net code delta: +1280 / −903 = +377 lines across 16 files. The growth is dominated by new test coverage (AggregateTableTest, AggregateEntryTest) plus the consolidated UTF8 caches landing on AggregateEntry; the production-code core (less MetricKey + MetricKeys + AggregateMetric minus AggregateEntry's additions) is roughly flat.

Test plan

./gradlew :dd-trace-core:test --tests 'datadog.trace.common.metrics.*' passes (incl. the new AggregateTableTest and AggregateEntryTest)
./gradlew :dd-trace-core:compileJava :dd-trace-core:compileTestGroovy :dd-trace-core:compileJmhJava :dd-trace-core:compileTraceAgentTestGroovy all green
./gradlew spotlessCheck clean
CI muzzle / integration suites
Validate stats.dropped_aggregates semantics at high cardinality (especially the new "drop new on cap overrun" path vs. the old "evict LRU" path)

🤖 Generated with Claude Code

Two general-purpose utilities used by the client-side stats aggregator work (PR #11382 and follow-ups), extracted into their own change so the metrics-specific PRs can build on a smaller, reviewable foundation. - Hashtable: a generic open-addressed-ish bucket table abstraction keyed by a 64-bit hash, with a public abstract Entry type so client code can subclass it for higher-arity keys. The metrics aggregator uses it to back its AggregateTable. - LongHashingUtils: chained 64-bit hash combiners with primitive overloads (boolean, short, int, long, Object). Used in place of varargs combiners to avoid Object[] allocation and boxing on the hot path. No callers within internal-api itself yet -- the metrics aggregator PR will introduce the first usages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Standalone classes for swapping the consumer-side LRUCache<MetricKey, AggregateMetric> with a multi-key Hashtable in the next commit. No call sites use them yet. - AggregateEntry extends Hashtable.Entry, holds the canonical MetricKey, the mutable AggregateMetric, and copies of the 13 raw SpanSnapshot fields for matches(). The 64-bit lookup hash is computed via chained LongHashingUtils.addToHash calls (no varargs, no boxing of short/boolean). - AggregateTable wraps a Hashtable.Entry[] from Hashtable.Support.create. findOrInsert(SpanSnapshot) walks the bucket comparing raw fields, falling back to MetricKeys.fromSnapshot on a true miss. On cap overrun, it scans for an entry with hitCount==0 and unlinks it; if none, it returns null and the caller drops the data point. - MetricKeys.fromSnapshot extracts the canonicalization logic (DDCache lookups + UTF8 encoding) from Aggregator.buildMetricKey, so the helper can be called from AggregateTable on miss. This also commits Hashtable and LongHashingUtils (added earlier, previously uncommitted) and lifts Hashtable.Entry / Hashtable.Support visibility so client code outside datadog.trace.util can build higher-arity tables -- the case the javadoc describes but the original visibility didn't actually support. Specifically: Entry is now public abstract with a protected ctor; keyHash, next(), and setNext() are public; Support's create / clear / bucketIndex / bucketIterator / mutatingBucketIterator methods are public. Tests: AggregateTableTest covers hit, miss, distinct-by-spanKind, peer-tag identity (including null vs non-null), cap overrun with stale victim, cap overrun with no victim (returns null), expungeStaleAggregates, forEach, clear, and that the canonical MetricKey is built at insert. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace LRUCache<MetricKey, AggregateMetric> with the AggregateTable added in the prior commit. The hot path in Drainer.accept becomes: AggregateMetric aggregate = aggregates.findOrInsert(snapshot); if (aggregate != null) { aggregate.recordOneDuration(snapshot.tagAndDuration); dirty = true; } else { healthMetrics.onStatsAggregateDropped(); } On the steady-state hit path the lookup is a 64-bit hash compute + bucket walk + matches(snapshot) -- no MetricKey allocation, no SERVICE_NAMES / SPAN_KINDS / PEER_TAGS_CACHE lookups. The canonical MetricKey is now built once per unique key at insert time, in MetricKeys.fromSnapshot. Behavioral change in the cap-overrun path ----------------------------------------- The old LRUCache evicted least-recently-used: at cap, a new insert would push out the oldest entry regardless of whether it was live or stale. AggregateTable instead scans for a hitCount==0 entry to recycle, and drops the new key if none exists. Practical impact: in the common case where the table holds a stable set of recurring keys, an unrelated burst of new keys is dropped (and reported via onStatsAggregateDropped) rather than evicting the established keys. The existing test that asserted "service0 evicted in favor of service10" is updated to assert the new semantics. The other cap-related test ("should not report dropped aggregate when evicted entry was already flushed") still passes unchanged: after report() clears all entries to hitCount=0, the next wave of inserts recycles them. Threading fix ------------- ConflatingMetricsAggregator.disable() used to call aggregator.clearAggregates() and inbox.clear() directly from the Sink's IO event thread, racing with the aggregator thread mid-write. The race was tolerable for LinkedHashMap; it is not for AggregateTable (chain corruption can NPE or loop). disable() now offers a ClearSignal to the inbox so the aggregator thread itself performs the table clear and the inbox.clear(). Adds one SignalItem subclass + one branch in Drainer.accept; preserves the single-writer invariant for AggregateTable end-to-end. Removed: LRUCache import, AggregateExpiry inner class, the static buildMetricKey / materializePeerTags / encodePeerTag helpers (now in MetricKeys). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MetricKey existed for two reasons -- the prior LRUCache key role (now handled by AggregateTable's Hashtable.Entry mechanics) and as the labels argument to MetricWriter.add. The first is gone; the second is the only thing keeping MetricKey alive. Fold its UTF8-encoded label fields onto AggregateEntry, change MetricWriter.add to take AggregateEntry directly, and delete MetricKey + MetricKeys. What AggregateEntry now holds ----------------------------- - 10 UTF8BytesString label fields (resource, service, operationName, serviceSource, type, spanKind, httpMethod, httpEndpoint, grpcStatusCode, and a List<UTF8BytesString> peerTags for serialization). - 3 primitives (httpStatusCode, synthetic, traceRoot). - AggregateMetric (the value being accumulated). - The raw String[] peerTagPairs is retained alongside the encoded peerTags -- matches() compares it positionally against the snapshot's pairs; the encoded form is only consumed by the writer. matches(SpanSnapshot) compares the entry's UTF8 forms to the snapshot's raw String / CharSequence fields via content-equality (UTF8BytesString.toString() returns the underlying String in O(1)). This closes a latent bug in the prior raw-vs-raw matches(): if one snapshot delivered a tag value as String and a later snapshot delivered the same content as UTF8BytesString, the old Objects.equals would return false and the table would split into two entries. Content-equality matching collapses them into one. Consolidated caches ------------------- The static UTF8 caches that used to live partly on MetricKey (RESOURCE_CACHE, OPERATION_CACHE, SERVICE_SOURCE_CACHE, TYPE_CACHE, KIND_CACHE, HTTP_METHOD_CACHE, HTTP_ENDPOINT_CACHE, GRPC_STATUS_CODE_CACHE, SERVICE_CACHE) and partly on ConflatingMetricsAggregator (SERVICE_NAMES, SPAN_KINDS, PEER_TAGS_CACHE) are all now on AggregateEntry. The split was duplicating work -- SERVICE_NAMES and SERVICE_CACHE both cached service-name to UTF8BytesString. One cache per field now. API change: MetricWriter.add ---------------------------- Was: add(MetricKey key, AggregateMetric aggregate) Now: add(AggregateEntry entry) The aggregate lives on the entry. Single-arg. SerializingMetricWriter reads the same UTF8 fields off AggregateEntry that it previously read off MetricKey; the wire format is byte-identical. Test impact ----------- AggregateEntry.of(...) takes the same 13 positional args new MetricKey(...) took, so test diffs are mostly mechanical: new MetricKey(args) -> AggregateEntry.of(args) writer.add(key, _) -> writer.add(entry) ValidatingSink in SerializingMetricWriterTest now iterates List<AggregateEntry> directly. ConflatingMetricAggregatorTest's Spock matchers (~36 sites) rely on AggregateEntry.equals comparing the 13 label fields (not the aggregate) so the mock matches by labels regardless of the aggregate state at call time; post-invocation closures verify aggregate state. Benchmarks (2 forks x 5 iter x 15s) ----------------------------------- The change is consumer-thread only; producer publish() is unchanged. SimpleSpan bench: 3.123 +- 0.025 us/op (prior: 3.119 +- 0.018) DDSpan bench: 2.412 +- 0.022 us/op (prior: 2.463 +- 0.041) Both within noise -- the win is structural (one less class, one less allocation per miss, one fewer cache layer) rather than benchmarked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

LongHashingUtilsTest (14 cases): - hashCodeX null sentinel + non-null pass-through - all primitive hash() overloads match the boxed Java hashCodes - hash(Object...) 2/3/4/5-arg overloads match the chained addToHash formula they are documented to constant-fold to - addToHash(long, primitive) overloads match the Object-version - linear-accumulation invariant (31 * h + v) holds across a sequence - iterable / deprecated int[] / deprecated Object[] variants match chained addToHash - intHash treats null as 0 (observable via hash(null, "x")) HashtableTest (24 cases across 5 nested classes): - D1: insert/get/remove/insertOrReplace/clear/forEach, in-place value mutation, null-key handling, hash-collision chaining with disambig- uating equals, remove-from-collided-chain leaves siblings intact - D2: pair-key identity, remove(pair), insertOrReplace matches on both parts, forEach - Support: capacity rounds up to a power of two, bucketIndex stays in range across a wide hash sample, clear nulls every slot - BucketIterator: walks only matching-hash entries in a chain, throws NoSuchElementException when exhausted - MutatingBucketIterator: remove from head-of-chain unlinks, replace swaps the entry while preserving chain, remove() without prior next() throws IllegalStateException Tests live in internal-api/src/test/java/datadog/trace/util and use the already-present JUnit 5 setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bring the new util/ files in line with google-java-format (tabs → spaces, line wrapping, javadoc list markup) so spotlessCheck passes in CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@threads

Compares Hashtable.D1 and Hashtable.D2 against equivalent HashMap usage for add, update, and iterate operations. Each benchmark thread owns its own map (Scope.Thread), but @threads(8) is used so the allocation/GC pressure that Hashtable is designed to avoid surfaces in the throughput numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Guard Support.sizeFor against overflow and use Integer.highestOneBit; reject capacities above 1 << 30 instead of looping forever. - Add braces around single-statement while bodies in BucketIterator. - Split HashtableBenchmark into HashtableD1Benchmark / HashtableD2Benchmark. - Add regression tests for Support.sizeFor bounds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The 5-arg Object overload was forwarding only obj0..obj3 to the int overload, silently dropping obj4. Also align LongHashingUtils.hash 3-arg signature with its 2/4/5-arg siblings (int parameters) and strengthen the 5-arg HashingUtilsTest to detect the missing-arg regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Split D1Tests and D2Tests into HashtableD1Test and HashtableD2Test; extract shared test entry classes into HashtableTestEntries. - Reduce visibility of LongHashingUtils.hash(int...) chaining overloads to package-private; they are internal building blocks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The iterator tests need a populated Hashtable.Entry[] to drive Support.bucketIterator / mutatingBucketIterator. Relaxing D1.buckets from private to package-private lets the same-package tests read it directly, removing the reflection helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The label fields and the mutable counters/histograms are 1:1 with each entry; carrying them on a separate object meant one extra allocation per unique key plus an indirection on every hot-path update. Merging them puts the counters directly on AggregateEntry, drops the entry.aggregate hop, and consolidates ERROR_TAG / TOP_LEVEL_TAG onto the same class the consumer uses to decode them. AggregateTable.findOrInsert now returns AggregateEntry. Callers in Aggregator and SerializingMetricWriter updated. Migrated AggregateMetricTest.groovy to AggregateEntryTest.java per project policy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add a context-passing forEach(T, BiConsumer) overload to AggregateTable, mirroring TagMap's pattern. Aggregator.report now hands the writer in as context to a static BiConsumer so no fresh Consumer is allocated each report cycle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors the TagMap pattern: pairs the existing forEach(Consumer) with a forEach(T context, BiConsumer<T, TEntry>) overload so callers can hand side-band state to a non-capturing lambda and avoid the fresh-Consumer-per-call allocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

datadog-datadog-prod-us1-2 · 2026-05-19T17:52:05Z

✨ Fix all issues with BitsAI

⚠️ Warnings

🚦 2 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-java | agent_integration_tests

🔧 Fix in code (Fix with Cursor).
4 failed tests due to java.lang.IllegalAccessError at MetricsIntegrationTest.groovy:45.

DataDog/apm-reliability/dd-trace-java | check_base

🔧 Fix in code (Fix with Cursor).
CodeNarc rule violations found. Exceeded maximum number of priority 3 violations during task ':dd-trace-core:codenarcTraceAgentTest'.

Useful? React with 👍 / 👎

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 4a4b72e | Docs | Datadog PR Page | Give us feedback!}

Factors the unchecked (TEntry) cast out of D1.forEach / D2.forEach (and the BiConsumer variants) into Support.forEach(buckets, ...). The cast now lives in one place, mirroring how Entry.next() handles it, and the D1/D2 methods become one-liners. Downstream higher-arity tables built on Support gain the same helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Now that Hashtable.Support exposes the parameterized forEach helpers, AggregateTable's own forEach methods can drop their duplicated loop body and the (AggregateEntry) cast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This method was a vestige of master's Batch design where multiple producer threads wrote into an AtomicLongArray slot concurrently and the aggregator drained ~64 durations per Batch in one call. The new producer/consumer split publishes one SpanSnapshot per span, so production only ever calls recordOneDuration(long). Migrate the three remaining callers (AggregateEntryTest, SerializingMetricWriterTest, MetricsIntegrationTest) to a loop of recordOneDuration(long) calls, then delete the batched method and its AtomicLongArray imports. Drops the recordDurationsIgnoresTrailingZeros test -- that behavior was a specific quirk of the batched API (count parameter shorter than the array length) and doesn't apply to recordOneDuration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The factory recovers (name, value) pairs from pre-encoded "name:value" strings by splitting at the FIRST colon. Test-only, but worth being explicit so callers don't hand it a peer-tag value containing a colon (URLs, IPv6, service:env) and get a silently wrong (name, value) pair. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The bundled fix in this PR routes the agent-downgrade clear through the inbox so the aggregator thread stays the sole writer to AggregateTable. Prior to this test, there was no regression coverage for that routing. The test fires DOWNGRADED from the test thread (production-like OkHttpSink callback path), waits for the immediate no-flush window, then publishes a marker span with a distinct resource name. The subsequent report's writer.add captor must see only the marker -- if CLEAR didn't actually wipe the original entry, the original "resource" would still be present and the assertion would catch it. Cannot directly verify thread identity of the clear from inside this test (CLEAR's inbox.clear() drops any latch signal we'd queue behind it), so this is an observable-contract test rather than a strict thread-id test. Still catches both the missing-clear regression and the bucket-chain-corruption regression that the original threading race could produce. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…optimize-metric-key

…dinality Resolved conflicts: - AggregateEntry.java: kept #11387's cardinality-handler constructor shape; dropped the dead recordDurations(int, AtomicLongArray) batch API and its AtomicLongArray import (the #11382 cleanup). The of() colon-split warning from #11382 doesn't apply because #11387's of() takes peerTags as a pre-built list (no colon-splitting on test reconstruction). - ClientStatsAggregator.java: kept #11387's PeerTagSchema.of(..., healthMetrics) 3-arg signature; preserved #11382's read-order race fix (lastTimeDiscovered read before peerTags); preserved the resetCardinalityHandlers method from HEAD. - ClientStatsAggregatorTest.groovy: kept HEAD's cardinality-focused tests; applied the #11382 Spock-assert fix (wrap '>> { closure }' body expressions in assert) to all 31 sites so power-assert surfaces mismatches. - ClientStatsAggregatorBootstrapTest.java: renamed the reconcileSwapsSchemaWhenTagSetChanges test's ConflatingMetricsAggregator references to ClientStatsAggregator. - ConflatingMetricsAggregatorDisableTest.java: renamed file + references to ClientStatsAggregatorDisableTest with corresponding class-name change. - AdversarialMetricsBenchmark.java: updated ConflatingMetricsAggregator + ConflatingMetricsAggregatorBenchmark references to ClientStatsAggregator / ClientStatsAggregatorBenchmark. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ory-efficiency Resolved conflicts: - AdversarialMetricsBenchmark.java: kept #11389's Javadoc (specific to AggregateTable + cardinality handlers), applied #11382's LongAdder + concise tearDown cleanup. - AggregateEntry.java: dropped recordDurations(int, AtomicLongArray) per the #11382 cleanup (no longer used after migrating the test callers to recordOneDuration). - MetricsIntegrationTest.groovy: migrated to per-duration recordOneDuration calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

amarziali · 2026-05-22T08:21:08Z

+  private AggregateEntry(SpanSnapshot s, long keyHash) {
+    super(keyHash);
+    this.resource = canonicalize(RESOURCE_CACHE, s.resourceName);
+    this.service = SERVICE_CACHE.computeIfAbsent(s.serviceName, UTF8_ENCODE);


before the service name was also set to EMPTY when null. I think this in reality cannot happen but perhaps it's worth going through canonicalize also for this

Yes, I need to revisit the null vs EMPTY story. As I recall the null handling was already a mix of two approaches before, but we should fix it.

From Claude (posted on dougqh's behalf):

Done in d1749389b9. service and spanKind now both go through canonicalize like every other label field — null maps to EMPTY instead of NPE'ing at the cache call. Dropped the Functions.UTF8_ENCODE import too since the canonicalize helper handles the create.

Pinned by AggregateTableTest.nullServiceAndSpanKindDoNotNpeAndCollapseWithEmpty which publishes (null, null), (null, null) again, and ("", ""), asserts a single entry and length-0 service/spanKind. Without the fix, the first publish would NPE'd at .computeIfAbsent.

amarziali · 2026-05-22T08:22:10Z

-    }
-  }
-
  public void clearAggregates() {


this looks now a dead method. Given that is not thread safe wrt the underlying hashtable I would remove it or reduce its visibility

From Claude (posted on dougqh's behalf):

Deleted in 21e75452d6 — the method had no callers after the disable() rewrite, and you're right that exposing a thread-unsafe mutator from outside the inbox routing would invite the exact race that motivated the ClearSignal path. The comment in ConflatingMetricsAggregator that named it has been updated to describe the inbox-routed clear instead.

amarziali · 2026-05-22T08:26:47Z

+    h = LongHashingUtils.addToHash(h, s.synthetic);
+    h = LongHashingUtils.addToHash(h, s.traceRoot);
+    h = LongHashingUtils.addToHash(h, s.spanKind);
+    if (s.peerTagSchema != null && s.peerTagValues != null) {


Minor: this null check makes the equals not symmetrical since we're skipping here the field from including into the hash - also combinations with different nulls on peerTagSchema/peerTagValues can produce the same hash theoretically

I think at this point I should just get rid of equals & hashCode. They aren't really used by the new table structure.

As for the nulls in the hash code, yes, that can lead to hash collisions, but that's fine.
That said, I do think I'll rework the hashOf routine to be a bit lighter weight. I may or may not change the null handling when I do so.

From Claude (posted on dougqh's behalf):

Both halves of this are now fixed:

Hash side (80778c4e87): dropped the conditional null-skip. hashOf now unconditionally mixes in LongHashingUtils.addToHash(h, s.peerTagSchema) and Arrays.hashCode(s.peerTagValues). PeerTagSchema.hashCode() is content-based on names (lazily cached), so distinct schema instances with identical names still produce a stable hash. Different-null combinations on the two fields can no longer collide.

Equals side (2536aa2e76): AggregateEntry.equals now compares the raw peerTagNames / peerTagValues arrays via Arrays.equals, mirroring exactly what hashOf hashes. Same for be13431744 on PeerTagSchema (symmetric equals/hashCode pair on names).

Two regression tests pin the contract: AggregateEntryTest.equalsConsistentWithHashCodeAcrossDifferentSchemaLayouts (entries with ["a","b"]/[null,"x"] vs ["b","c"]/["x",null] — same encoded peer-tags ["b:x"] but different raw layouts → not equal, different hashCodes) and equalEntriesHaveEqualHashCodes (the positive case).

amarziali · 2026-05-22T08:38:52Z

+   * persistently at cap, this is the place to consider caching a cursor across calls so the scan
+   * resumes where it left off.
+   */
+  private boolean evictOneStale() {


What highlighted in the comment is true. However if I'm not wrong when the aggregator start dropping (meaning that capacity is full and badly dimensioned) this will translate in a perf kill. I'm wondering if it's worth adding a field for the last scanned bucket to have the cost amortized right now

Can you clarify "perf kill"? Yes, this will end up doing an O(n) scan of the table, but the table is small.
Yes, that will slow down the metrics thread, but it won't kill the application.

I've primarily focused on making sure things don't fall over under an adversarial workload. In those cases, such last used caches don't work. Although, I suppose I could check a more realistic accidental case where only a single component has out of control cardinality.

was about needing to scan the full table for each span inserted that at the end cannot insert since will overflow. but perhaps I've misinterpret something

Okay, I'll double check the eviction logic. That is worth being careful about.
Although, the next part that introduces cardinality limiters largely renders the table limit unnecessary.

I updated the eviction to track the last bucket from one scan to the next.
To do that, I added a variation of the Hashtable iterator that scans a range of buckets.

Since only this code is using Hashtable right now, I went ahead and included that in this PR

amarziali · 2026-05-22T08:40:15Z

+          aggregates.clear();
+          inbox.clear();
+        }
+        ((SignalItem) item).complete();


should the dirty flag be also cleared up here?

Good question. I think I need to go back and review all the signaling a bit more.
There were some latent problems that already existed in the code that have been exposed during the reviews.

Added in the clearing of the dirty flag.

I wonder if that was a latent bug because I mostly intended just to shuffle around the existing code. It might have been that we were just getting a spurious reporting cycle previously.

Anyway, it is fixed now.

The constructor canonicalizes null fields through canonicalize() which returns UTF8BytesString.EMPTY for null inputs (or a cached UTF8BytesString("") for empty-string inputs). But matches() compared those entries against subsequent snapshots via contentEquals(...) / stringContentEquals(...), which treated non-null UTF8BytesString vs null CharSequence as inequal. Result: two snapshots with the same null-valued resource/operation/ type/serviceSource hashed to the same bucket (intHash(null) == 0 == "".hashCode()), but matches() returned false on the EMPTY-vs-null field comparison, so the second snapshot inserted a *duplicate* entry into the table. Same path for empty-string vs null. Unify the semantics: null and length-zero are treated as equivalent on either side of contentEquals/stringContentEquals. The hash already agreed (intHash(null) == "".hashCode() == 0), so this restores the matches() contract to match the existing hash contract. Adds AggregateTableTest.nullAndEmptyOptionalFieldsCollapseToOneEntry to pin the contract: two null-fielded and one empty-string-fielded snapshot must all hit the same entry. Test would have failed before the fix (a duplicate insert) but the existing 10 cases still pass. Resolves sarahchen6's review comment on AggregateEntry.java:113 and amarziali's related concern on AggregateEntry.java:114. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After CLEAR runs the table is empty but dirty would still carry over from any prior SpanSnapshot insert. The next report() would see dirty=true, expunge no-op the empty table, find isEmpty(), and log "skipped metrics reporting because no points have changed" -- same observable outcome, but resetting dirty here keeps the invariant "dirty implies there's data to flush" honest. Resolves amarziali's review comment on Aggregator.java:121. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously hashOf wrapped the peer-tag contribution in `if (s.peerTagSchema != null && s.peerTagValues != null)`. That meant two snapshots with different null arrangements (schema-null vs values-null) collapsed to the same hash, getting resolved only by the field-by-field matches() fallback at the bucket walk -- wasteful, and the asymmetry hurt hash quality generally. Replace with unconditional contributions: - PeerTagSchema now overrides hashCode() to be content-based on names (lazy + cached, benign-race pattern matching UTF8BytesString / utf8Bytes elsewhere). addToHash(h, schema) routes through that. - For the String[] values, pass Arrays.hashCode(values) through the int overload -- Object[].hashCode() is identity-based by default, so we have to compute content hash explicitly. Null arrays hash to 0 via Arrays.hashCode's contract. Null inputs on either side now hash to 0 distinctly from any real schema or non-empty values array, so all four null combinations are distinguishable. Same final hash for content-equal inputs across schema replacements (the reconcile path), which preserves the entry- hit invariant after the aggregator rebuilds the schema. Resolves amarziali's review comment on AggregateEntry.java:309 and dougqh's suggestion on AggregateEntry.java:310. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Once the ClearSignal routing replaced the direct disable()-to-table mutation, clearAggregates() lost all its call sites -- no production code, no test code. Worse, leaving it public invited future callers to bypass the ClearSignal contract and race against Drainer.accept on the aggregator thread. Drop the method outright. Update the inline comment in ConflatingMetricsAggregator.disable() to not name the deleted method. Resolves amarziali's review comment on Aggregator.java:82. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…terator Previously AggregateTable.evictOneStale walked the bucket array from bucket 0 on every call. Under sustained cap pressure with mostly-hot entries clustered in low buckets, every eviction re-scanned the same hot prefix before finding a cold entry. amarziali's review concern. Add a cursor: after a successful eviction, remember the bucket where it landed. The next call resumes from there. Worst case for a single call is still O(N) when nearly every entry is hot, but a sustained eviction stream amortizes to O(1) per call -- the hot prefix is never re-scanned more than twice across N evictions. Implemented as two iterators driving [cursor, length) then [0, cursor), which required a small Hashtable.Support API addition: - New `mutatingTableIterator(buckets, startBucket, endBucket)` overload for walking a half-open bucket range. The existing zero-arg overload is kept; it now delegates to the new ctor with [0, buckets.length). - New `MutatingTableIterator.currentBucket()` accessor exposing the bucket index of the entry last returned by next() (or -1 before any next/after a remove). AggregateTable saves this as the new cursor. - The empty-range case (startBucket == endBucket) yields an immediately-exhausted iterator -- this is what makes the wrap-around pass [0, cursor) naturally produce nothing when cursor == 0, so the two-pass driver in evictOneStale needs no special case. Tests: - 4 new HashtableTest cases covering the half-open API, empty ranges, out-of-range bounds, and currentBucket() behavior before/after next. - 2 new AggregateTableTest cases: backToBackEvictionsAllSucceed (drives 3x capacity worth of cap-overrun inserts; each must succeed, which only holds if the cursor advances correctly) and clearResetsCursorForSubsequentEvictions (clear() also resets the cursor so subsequent eviction passes start from bucket 0). Resolves amarziali's review comment on AggregateTable.java:75. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dougqh's review comment on AggregateEntry.java:153 asked to keep test code out of the production class. Move the factory to a new AggregateEntries helper in src/test/java/datadog/trace/common/metrics. Same package so it can call the package-private forSnapshot(); delegating to forSnapshot also means no need to widen the AggregateEntry constructor visibility. The 37 src/test/groovy call sites get a mechanical rewrite of AggregateEntry.of(...) -> AggregateEntries.of(...) (36 in ConflatingMetricAggregatorTest, 1 in SerializingMetricWriterTest). src/traceAgentTest is a separate source set without compile-time visibility into src/test, so its 2 MetricsIntegrationTest.groovy call sites can't use AggregateEntries. Migrated those to construct a SpanSnapshot inline + call AggregateEntry.forSnapshot(snapshot). Groovy's permissive package-private access makes this work from the default package the integration test currently sits in. Resolves dougqh's review comment on AggregateEntry.java:153. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

equals compared the pre-encoded peerTags List<UTF8BytesString> while hashCode (via hashOf) mixes in the raw peerTagSchema + values arrays. Two entries built from different schema layouts can collapse to the same encoded form -- e.g. tag "b" at index 1 in schema {a,b} with values {null,"x"} produces the same encoded ["b:x"] as schema {b,c} with values {"x",null}. equals returned true; hashCodes differed. Hashcode contract violated. Switch equals to compare the raw peerTagNames + peerTagValues arrays, mirroring matches(SpanSnapshot) and hashOf(SpanSnapshot). The production lookup path (AggregateTable.findOrInsert) already uses those, so this just brings equals in line with the rest of the class. Adds two regression tests on AggregateEntryTest: - equalsConsistentWithHashCodeAcrossDifferentSchemaLayouts: the failing-case shape above. Pre-fix, the encoded-list equals returned true while hashCodes differed; now equals returns false and the hashCodes differ in agreement. - equalEntriesHaveEqualHashCodes: positive case -- two entries from identical snapshots must equal and share hashCode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Prior CLEAR handler called inbox.clear() as belt-and-suspenders cleanup of in-flight snapshots. That would also erase any STOP signal queued behind CLEAR -- a real concern in disable() -> close() sequences, where the trampled STOP leaves the aggregator thread spinning until thread.join's timeout. sarahchen6 surfaced this from a Codex pass on the CLEAR logic; dougqh confirmed it's worth fixing. The CLEAR handler now clears only the aggregates table. Queued snapshots will drain naturally into the just-cleared table -- but since features.supportsMetrics() is already false by the time CLEAR was offered, producers have stopped publishing; the inbox drains and empties on its own. Worst case: one extra reporting cycle of wasted work on stale snapshots that the agent rejects, which triggers another DOWNGRADED -> disable() -> CLEAR. Self-healing, same as before. Adds ConflatingMetricsAggregatorDisableTest.clearDoesNotTrampleQueuedStopSignal: publish a snapshot, fire DOWNGRADED, call close(); the test bounds close() with its own 2s timeout and asserts the thread exits within it. Pre-fix this would have hung out THREAD_JOIN_TIMEOUT_MS; post-fix it returns in milliseconds. Resolves sarahchen6/Codex's CLEAR-trampling-STOP review comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The prior commit added a content-based hashCode() but left equals falling back to Object.equals (reference identity). That violates the hashCode contract for any caller that compares two distinct schema instances built from the same tag list -- e.g. before/after a reconcile rebuilds the cached schema with an unchanged tag set. equals() now mirrors hashCode(): content-equal on names. The reconcile- timing field lastTimeDiscovered is intentionally excluded from both -- it's bookkeeping for the aggregator's discovery-version compare, not part of schema identity. Tests: - equalsIsContentBasedOnNames -- same names, two instances, equal + matching hashCode. - equalsIgnoresLastTimeDiscovered -- pins that the bookkeeping field doesn't leak into identity. - equalsDistinguishesByOrder -- names is positional (pairs with SpanSnapshot.peerTagValues by index), so reordered schemas are not interchangeable. - equalsHandlesNullAndOtherTypes -- contract corners. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AggregateEntry's constructor canonicalized resource, operationName, type, and serviceSource (mapping null -> EMPTY via the canonicalize helper) but called SERVICE_CACHE.computeIfAbsent / SPAN_KIND_CACHE .computeIfAbsent directly for service and spanKind. Inputs of null would NPE on the cache call. Production paths never pass null for these -- DDSpan always supplies a service, and the producer defaults spanKind to "" via unsafeGetTag(SPAN_KIND, (CharSequence) "") -- so this is a latent- defense fix, not a live bug. But the matches/contentEquals logic already treats null and length-zero as equal on both sides, and every other label field in the constructor defends via canonicalize. Two unprotected outliers are an inconsistency that bites the next person who reaches for a new code path. Drops the Functions.UTF8_ENCODE import (its sole use was the service cache line) -- canonicalize internally creates the UTF8BytesString. Test: AggregateTableTest.nullServiceAndSpanKindDoNotNpeAndCollapseWithEmpty publishes (null, null), (null, null) again, and ("", ""); asserts a single entry results and that getService()/getSpanKind() are length-0. Without the fix, the first publish would have NPE'd at the .computeIfAbsent call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ound-work' into dougqh/optimize-metric-key # Conflicts: # dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java # dd-trace-core/src/test/java/datadog/trace/common/metrics/PeerTagSchemaTest.java

sarahchen6 · 2026-05-22T19:14:31Z

+    return aStr.contentEquals(b);
+  }
+
+  private static boolean stringContentEquals(UTF8BytesString a, String b) {


can remove this method and just use contentEquals above since String is a CharSequence?

sarahchen6 · 2026-05-22T19:17:29Z

+   * subsequent snapshot whose field is still null. {@code intHash(null) == 0 == "".hashCode()}, so
+   * the hash already agrees with this view.
+   */
+  private static boolean contentEquals(UTF8BytesString a, CharSequence b) {


nit: this method could be simplified to something like

private static boolean contentEquals(UTF8BytesString a, CharSequence b) { if (a == null || a.length() == 0) { return b == null || b.length() == 0; } return b != null && a.toString().contentEquals(b); }

sarahchen6 · 2026-05-22T19:19:06Z

+ * reach {@link AggregateEntry#forSnapshot(SpanSnapshot)} and the package-private {@link
+ * SpanSnapshot} constructor.
+ */
+public final class AggregateEntries {


since this is now test-specific, it would make sense to me to have a more test-related name, instead of reading nearly identical to AggregateEntry?

sarahchen6 · 2026-05-22T19:34:37Z

+    // ""), but the prior contentEquals impl treated `non-null vs null` as not-equal -- so a second
+    // snapshot with the same null fields hashed to the same bucket but failed matches(), causing a
+    // spurious duplicate insert. The fix unifies null and length-zero on both sides of
+    // contentEquals/stringContentEquals.


nit: I think this comment could be removed. It seems to explain a fix implemented during the PR iteration process and how this test case addresses that fix, but the test case can stand on its own, i.e. future code readers only need to know what the final code expected behavior is.

this may apply to the comment below too for nullServiceAndSpanKindDoNotNpeAndCollapseWithEmpty()

sarahchen6 · 2026-05-22T19:38:49Z

+    // Regression: prior CLEAR handler called inbox.clear(), which would erase any STOP signal
+    // queued behind it. close() then waited out thread.join's timeout because Drainer never saw
+    // the STOP and `stopped` was never set. Now the CLEAR handler clears only the aggregates
+    // table; queued signals (STOP, REPORT) survive and get processed normally.


same here... maybe keeping the comment as just:

CLEAR handler clears only the aggregates table; queued signals (STOP, REPORT) survive and get processed normally.

basically removing references to the PR iteration process and keeping just the expected behavior / objectively helpful context in comments

sarahchen6 · 2026-05-22T19:42:44Z

+    PeerTagSchema b = PeerTagSchema.testSchema(new String[] {"peer.hostname", "peer.service"});
+
+    assertEquals(a, b);
+    assertEquals(b, a);


nit: this line seems unnecessary given 100 and 102 🤔

sarahchen6

Left a few nits, but otherwise looks reasonable!!

Also this was Claude's advice for the failing agent integration test:

Fix options (in order of invasiveness):
1. Add a public test-support module — extract a small AggregateEntryTestSupport factory into a test-support
jar that traceAgentTest can depend on, keeping the production classes package-private.
2. Make testSchema + forSnapshot public on their classes — simplest, but widens the API surface of
intentionally-internal classes.
3. Rewrite the test to not reference these classes directly — drive the test entirely via
ConflatingMetricsAggregator publishing real DDSpans, which avoids any dependency on the internal
AggregateEntry/PeerTagSchema types.

Option 3 is the cleanest because it also makes the integration test more realistic (it exercises the full
pipeline instead of bypassing the producer path).

dougqh added type: enhancement Enhancements and improvements comp: core Tracer core tag: performance Performance related changes tag: no release notes Changes to exclude from release notes comp: metrics Metrics tag: ai generated Largely based on code generated by an AI or LLM labels May 15, 2026

dougqh mentioned this pull request May 15, 2026

Per-component / tag cardinality limits in client-side stats #11387

Draft

3 tasks

dougqh mentioned this pull request May 18, 2026

Add Hashtable and LongHashingUtils utilities #11409

Merged

2 tasks

dougqh and others added 3 commits May 18, 2026 15:18

dougqh force-pushed the dougqh/optimize-metric-key branch from 050a998 to 3738c85 Compare May 18, 2026 19:20

dougqh changed the base branch from dougqh/conflating-metrics-background-work to dougqh/util-hashtable May 18, 2026 19:21

dougqh and others added 8 commits May 18, 2026 15:40

Apply spotless formatting to Hashtable and LongHashingUtils

031dc89

Bring the new util/ files in line with google-java-format (tabs → spaces, line wrapping, javadoc list markup) so spotlessCheck passes in CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add benchmark results to HashtableBenchmark header

a534e4f

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dougqh commented May 19, 2026

View reviewed changes

Comment thread dd-trace-core/src/main/java/datadog/trace/common/metrics/Aggregator.java

dougqh and others added 3 commits May 19, 2026 13:41

dougqh and others added 3 commits May 19, 2026 13:58

Merge branch 'dougqh/util-hashtable' into dougqh/optimize-metric-key

7ab47f7

dougqh and others added 4 commits May 21, 2026 15:41

Merge branch 'dougqh/conflating-metrics-background-work' into dougqh/…

200028d

…optimize-metric-key

sarahchen6 reviewed May 22, 2026

View reviewed changes

Comment thread dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java

amarziali reviewed May 22, 2026

View reviewed changes

dougqh commented May 22, 2026

View reviewed changes

Comment thread dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java Outdated

dougqh commented May 22, 2026

View reviewed changes

Comment thread dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java Outdated

dougqh and others added 6 commits May 22, 2026 08:42

dougqh commented May 22, 2026

View reviewed changes

Comment thread dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java

dougqh and others added 5 commits May 22, 2026 09:33

dougqh mentioned this pull request May 22, 2026

Memory-efficiency pass on ClientStatsAggregator + adversarial benchmark #11389

Draft

5 tasks

sarahchen6 reviewed May 22, 2026

View reviewed changes

sarahchen6 approved these changes May 22, 2026

View reviewed changes

Conversation

dougqh commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What Does This Do

Motivation

Additional Notes

1. Add AggregateTable + AggregateEntry backed by Hashtable

2. Swap Aggregator to use AggregateTable + route disable() clear through a ClearSignal

3. Fold MetricKey + AggregateMetric into AggregateEntry

Review polish

Additional cleanups

Benchmarks

Producer publish() latency (single-threaded, 2 forks × 5 iter × 15s)

AdversarialMetricsBenchmark (8 producer threads, 2×15s warmup + 5×15s, 1 fork)

Cardinality-isolation companions (8 producer threads, 2×15s warmup + 5×15s, 1 fork)

Test plan

Uh oh!

Uh oh!

datadog-datadog-prod-us1-2 Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Warnings

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dougqh May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dougqh May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dougqh May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dougqh May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sarahchen6 May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sarahchen6 May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dougqh commented May 15, 2026 •

edited

Loading

1. Add `AggregateTable` + `AggregateEntry` backed by `Hashtable`

2. Swap `Aggregator` to use `AggregateTable` + route `disable()` clear through a `ClearSignal`

3. Fold `MetricKey` + `AggregateMetric` into `AggregateEntry`

Producer `publish()` latency (single-threaded, 2 forks × 5 iter × 15s)

`AdversarialMetricsBenchmark` (8 producer threads, 2×15s warmup + 5×15s, 1 fork)

datadog-datadog-prod-us1-2 Bot commented May 19, 2026 •

edited

Loading

dougqh May 22, 2026 •

edited

Loading

dougqh May 22, 2026 •

edited

Loading

dougqh May 22, 2026 •

edited

Loading

dougqh May 22, 2026 •

edited

Loading

sarahchen6 May 22, 2026 •

edited

Loading

sarahchen6 May 22, 2026 •

edited

Loading

sarahchen6 May 22, 2026 •

edited

Loading

sarahchen6 left a comment •

edited

Loading