Skip to content

Update client-side stats to use light weight Hashtable#11382

Open
dougqh wants to merge 89 commits into
dougqh/conflating-metrics-background-workfrom
dougqh/optimize-metric-key
Open

Update client-side stats to use light weight Hashtable#11382
dougqh wants to merge 89 commits into
dougqh/conflating-metrics-background-workfrom
dougqh/optimize-metric-key

Conversation

@dougqh
Copy link
Copy Markdown
Contributor

@dougqh dougqh commented May 15, 2026

What Does This Do

Replaces the MetricKey based HashMap with a new AggregateTable based on the light Hashtable

Motivation

By using the light Hashtable, I'm able to avoid the biggest source of allocation in client-side stats: MetricKey

Hashtable provides utilities for searching the entries without constructing a new composite key object

First, the components are hashed together to find the corresponding bucket
Then the bucket can be traversed to see if the entries match the key components

And the custom Entry can hold multiple fields that comprise the data / metadata needed for metric aggregation and eviction policy

The end result is that both MetricKey and AggregateMetric can be merged into a single class AggregateEntry that is only constructed when there's no existing matching entry

Additional Notes

Stacked on top of #11381 (dougqh/conflating-metrics-background-work). #11409 — the Hashtable utility this PR builds on — landed in master separately, so the diff shown here is only the AggregateTable / AggregateEntry / ClearSignal work plus follow-up cleanups.

Restructures the consumer-side aggregate store. Three logical commits, intended to be reviewed in order:

1. Add AggregateTable + AggregateEntry backed by Hashtable

Introduces a multi-key hash table that lets the consumer thread look up the {labels → counters} entry directly from a SpanSnapshot's raw fields — no MetricKey allocation per snapshot, no per-snapshot UTF8 cache lookups, no CHM operations. Hot-path lookup is keyHash computeHashtable.Support.bucket → bucket walk → matches(keyHash, snapshot) → returned entry has the counters to mutate in place.

This commit is standalone — no call sites yet, only the new classes + unit tests for hit/miss/cap-overrun/expunge/clear behavior.

2. Swap Aggregator to use AggregateTable + route disable() clear through a ClearSignal

Replaces LRUCache<MetricKey, AggregateMetric> with AggregateTable in Aggregator. Drops the AggregateExpiry listener — drop reporting (onStatsAggregateDropped) moves to the cap-overrun path inside Drainer.accept.

Threading fix bundled here: ConflatingMetricsAggregator.disable() used to call aggregator.clearAggregates() and inbox.clear() directly from the Sink's IO callback thread, racing with the aggregator thread. That race was tolerable for LinkedHashMap (worst case = corrupted internal state right before everything got cleared anyway); it's not tolerable for Hashtable (chain corruption can NPE or loop). disable() now offers a ClearSignal to the inbox so the aggregator thread itself performs the clear — preserves the single-writer invariant for AggregateTable end-to-end. The offer is best-effort; the system self-heals on a subsequent downgrade cycle if the inbox happens to be full (commented at the call site).

Cap-overrun semantic change: the old LRUCache evicted least-recently-used in O(1). AggregateTable instead scans for a hitCount==0 entry to recycle (O(N) worst-case), and drops the new key if none exists. Practical impact: in steady state, an unrelated burst of new keys gets dropped (and reported via onStatsAggregateDropped) rather than evicting established keys. The cost trade-off is commented at the eviction site — eviction is expected rare because the cap is sized to the working set; cursor-caching is the future option if a workload runs persistently at cap. The existing test that asserted "service0 evicted in favor of service10" is updated to assert the new semantics; the other cap-related test ("evicted entry was already flushed") still passes unchanged.

3. Fold MetricKey + AggregateMetric into AggregateEntry

MetricKey existed for two reasons — being the LRUCache key (replaced by AggregateTable's Hashtable mechanics) and being the labels arg to MetricWriter.add (the only thing left). AggregateMetric was the counter/histogram counterpart. Folds both onto a single AggregateEntry (10 UTF8 label fields + 3 primitives + counters + histograms), changes MetricWriter.add(MetricKey, AggregateMetric)add(AggregateEntry), and deletes MetricKey.java + MetricKeys.java + AggregateMetric.java.

The 12 UTF8 caches that used to be split between MetricKey (9) and ConflatingMetricsAggregator (3, with overlap) are consolidated on AggregateEntry. One cache per field type now.

Latent bug fix: the prior matches(SpanSnapshot) used Objects.equals on raw fields. If the same logical key was delivered once as String and once as UTF8BytesString (different CharSequence impls of identical content), Objects.equals returned false and the table would split into two entries for the same key. The new matches uses content-equality (UTF8BytesString.toString() returns the underlying String in O(1)), collapsing them correctly.

Test impact: AggregateEntry.of(...) mirrors the prior new MetricKey(...) positional args, so test diffs are mostly mechanical. About 56 test sites migrated across ConflatingMetricAggregatorTest, SerializingMetricWriterTest, and MetricsIntegrationTest.

Review polish

Follow-up commits address review feedback:

  • Use Hashtable.Support.create(maxAggregates, Support.MAX_RATIO) + Support.bucket + Support.insertHeadEntry(buckets, keyHash, entry) + Support.mutatingTableIterator to delegate to the helpers added on Add Hashtable and LongHashingUtils utilities #11409 — drops ~50 lines of bespoke bucket-array code.
  • Inline the report-time forEach lambda instead of a static BiConsumer constant (the JIT reuses non-capturing lambdas).
  • AggregateEntry.matches(long keyHash, SpanSnapshot) overload that pre-checks the hash, so chain walks read as one call.
  • @Nullable (javax.annotation) annotations on the four nullable label fields + their getters + of(...) parameters.
  • Objects.equals import in AggregateEntry.equals() (no more fully-qualified refs).
  • Design-trade-off comments on evictOneStale (O(N) scan rationale) and disable() (best-effort offer rationale).

Additional cleanups

A second round of review surfacing landed these:

  • MetricsIntegrationTest .aggregate runtime bug — the legacy entry.aggregate.recordDurations(...) form compiled under Groovy's dynamic dispatch but would have thrown MissingPropertyException at runtime. Fixed to call recordDurations directly on the entry. (Bot-flagged.)
  • Spock >> { closure } no-op assertions in ConflatingMetricAggregatorTest — the >> operator stubs a return value, so the closures verifying e.getHitCount() == X && ... were being evaluated and discarded. Wrapped 31 sites in assert so Groovy power-assert surfaces mismatches. All 41 tests still pass, so the previously-unverified assertions happened to hold. (Bot-flagged.)
  • Drop dead recordDurations(int, AtomicLongArray) batch API — vestige of master's Batch design. Production now only calls recordOneDuration(long). Migrated the three remaining test callers (AggregateEntryTest, SerializingMetricWriterTest, MetricsIntegrationTest) to loops of recordOneDuration calls, then deleted the batched method and its AtomicLongArray imports.
  • AggregateEntry.of() colon-split Javadoc warning — the test factory recovers (name, value) pairs from "name:value" strings by splitting at the first :, which is brittle if a peer-tag value contains a colon (URLs, IPv6, service:env). Added an explicit warning so callers know to keep test data colon-free in values.
  • ConflatingMetricsAggregatorDisableTest — new JUnit 5 coverage for the disable() → ClearSignal threading routing. The test fires DOWNGRADED from the test thread, waits for the no-flush window, then publishes a marker span with a distinct resource name and asserts the next flush captures only the marker — proving CLEAR actually wiped the original entry from the table. Catches both the missing-clear regression and the bucket-chain-corruption regression that the original threading race could produce.

Benchmarks

Producer publish() latency (single-threaded, 2 forks × 5 iter × 15s)

Prior commit (stacked base) This PR
SimpleSpan bench 3.116 µs/op 3.123 µs/op
DDSpan bench 2.506 µs/op 2.412 µs/op

All within noise on the producer side — this PR is a consumer-side refactor, so producer publish() shouldn't move much. The structural wins (one less class, no per-snapshot MetricKey allocation on the consumer, no double-cache lookups, smaller per-entry footprint) only become visible when the consumer is hammered hard enough that snapshot processing rate matters. That's exactly what the adversarial bench measures.

AdversarialMetricsBenchmark (8 producer threads, 2×15s warmup + 5×15s, 1 fork)

Same benchmark used on #11381 (high-cardinality (service, operation, resource, peer.hostname) per op, random durations across 1ns–1s, random error/topLevel flags — designed to saturate every capacity bound at once).

master #11381 (parent) this PR
Throughput avg (ops/s) 395,806 ± 2,619,133 4,889,660 ± 390,175 27,915,800 ± 1,219,470
Per-iteration progression (ops/s) warmup 2,536,145 → 205,314 → 95,888 → 47,301 → 24,378 4,886,778 → 4,875,195 → 4,731,827 → 4,959,992 → 4,994,511 28,043,909 → 28,112,828 → 27,354,721 → 28,074,395 → 27,993,147
Stdev (ops/s) 680,180 101,327 316,692
onStatsInboxFull (drops at handoff) n/a (no inbox on master) 199,862,634 2,893,855,052
onStatsAggregateDropped 11,642,039 84,002,323 16,301,696

~70× faster than master, ~5.7× faster than #11381 alone. The producer fast-path drop check (inbox.size() >= inbox.capacity()) returns immediately when the inbox is saturated, so when the consumer can drain ~5× faster (this PR's consumer-side win), the producer spins through publish() at the fast-path rate and inbox-full drops climb correspondingly (200M on #11381 → 2.9 B here). That's the design working as intended: under attack, load shedding happens at the cheapest place in the pipeline.

The aggregate-cache drop count actually fell (84M → 16M) because the faster consumer keeps the table from staying at cap as often — fewer snapshots arrive at a cap-overrun state with no stale entry to recycle.

Per-iteration shape stays flat (27.4M–28.1M ops/s, no warmup → measurement degradation), confirming the design holds steady-state under sustained load.

Cardinality-isolation companions (8 producer threads, 2×15s warmup + 5×15s, 1 fork)

HighCardinalityResourceMetricsBenchmark and HighCardinalityPeerMetricsBenchmark (added in #11381) pin every dimension except one to attribute throughput deltas to a specific axis.

HighCardinalityResourceMetricsBenchmark — only resource varies (~1M distinct), service/operation/peer.hostname pinned:

master #11381 (parent) this PR
Throughput avg (ops/s) 5,089,853 ± 587,996 8,983,726 ± 2,677,454 28,173,068 ± 485,923
Per-iter ops/s (5 measurement) 5,277K → 5,014K → 4,948K → 4,978K → 5,232K 8,043K → 8,844K → 9,962K → 9,217K → 8,852K 28,147K → 28,391K → 28,094K → 28,079K → 28,155K
onStatsInboxFull 0 (no inbox) 611,612,206 2,904,235,942
onStatsAggregateDropped 12,379,590 325,582,103 17,033,533

HighCardinalityPeerMetricsBenchmark — only peer.hostname varies (~32K distinct), service/operation/resource pinned:

master #11381 (parent) this PR
Throughput avg (ops/s) 6,335,821 ± 1,068,632 9,255,788 ± 2,933,935 30,651,802 ± 1,108,285
Per-iter ops/s (5 measurement) 6,552K → 6,510K → 6,429K → 6,323K → 5,865K 9,137K → 10,258K → 9,752K → 8,794K → 8,337K 30,876K → 30,722K → 30,852K → 30,644K → 30,166K
onStatsInboxFull 0 744,795,870 3,107,668,314
onStatsAggregateDropped 7,039,180 208,556,622 17,532,646

+454 % over master / +214 % over #11381 on the resource axis; +384 % / +231 % on the peer axis. The pattern is consistent with the adversarial run: master is sensitive to which dimension is wide (resource at ~1M hits pending/keys CHM contention harder than peer.hostname at ~32K); this PR runs at ~28–30 M ops/s on either bench because the consumer is no longer the bottleneck. Variance also collapses (±486K / ±1.1M here vs ±2.7M / ±2.9M on #11381), reflecting that the new hashtable hits steady-state immediately with no JIT warmup tail.

The drop-counter shape on this PR flips even further toward inbox-full (170× ratio against aggregate-dropped on both benches), confirming that the consumer drains so fast the LRU aggregate cache barely needs to evict.

Net code delta: +1280 / −903 = +377 lines across 16 files. The growth is dominated by new test coverage (AggregateTableTest, AggregateEntryTest) plus the consolidated UTF8 caches landing on AggregateEntry; the production-code core (less MetricKey + MetricKeys + AggregateMetric minus AggregateEntry's additions) is roughly flat.

Test plan

  • ./gradlew :dd-trace-core:test --tests 'datadog.trace.common.metrics.*' passes (incl. the new AggregateTableTest and AggregateEntryTest)
  • ./gradlew :dd-trace-core:compileJava :dd-trace-core:compileTestGroovy :dd-trace-core:compileJmhJava :dd-trace-core:compileTraceAgentTestGroovy all green
  • ./gradlew spotlessCheck clean
  • CI muzzle / integration suites
  • Validate stats.dropped_aggregates semantics at high cardinality (especially the new "drop new on cap overrun" path vs. the old "evict LRU" path)

🤖 Generated with Claude Code

@dougqh dougqh added type: enhancement Enhancements and improvements comp: core Tracer core tag: performance Performance related changes tag: no release notes Changes to exclude from release notes comp: metrics Metrics tag: ai generated Largely based on code generated by an AI or LLM labels May 15, 2026
Two general-purpose utilities used by the client-side stats aggregator
work (PR #11382 and follow-ups), extracted into their own change so the
metrics-specific PRs can build on a smaller, reviewable foundation.

  - Hashtable: a generic open-addressed-ish bucket table abstraction
    keyed by a 64-bit hash, with a public abstract Entry type so client
    code can subclass it for higher-arity keys. The metrics aggregator
    uses it to back its AggregateTable.

  - LongHashingUtils: chained 64-bit hash combiners with primitive
    overloads (boolean, short, int, long, Object). Used in place of
    varargs combiners to avoid Object[] allocation and boxing on the
    hot path.

No callers within internal-api itself yet -- the metrics aggregator PR
will introduce the first usages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dougqh and others added 3 commits May 18, 2026 15:18
Standalone classes for swapping the consumer-side LRUCache<MetricKey,
AggregateMetric> with a multi-key Hashtable in the next commit. No call sites
use them yet.

- AggregateEntry extends Hashtable.Entry, holds the canonical MetricKey, the
  mutable AggregateMetric, and copies of the 13 raw SpanSnapshot fields for
  matches(). The 64-bit lookup hash is computed via chained
  LongHashingUtils.addToHash calls (no varargs, no boxing of short/boolean).
- AggregateTable wraps a Hashtable.Entry[] from Hashtable.Support.create.
  findOrInsert(SpanSnapshot) walks the bucket comparing raw fields, falling
  back to MetricKeys.fromSnapshot on a true miss. On cap overrun, it scans
  for an entry with hitCount==0 and unlinks it; if none, it returns null and
  the caller drops the data point.
- MetricKeys.fromSnapshot extracts the canonicalization logic (DDCache
  lookups + UTF8 encoding) from Aggregator.buildMetricKey, so the helper can
  be called from AggregateTable on miss.

This also commits Hashtable and LongHashingUtils (added earlier, previously
uncommitted) and lifts Hashtable.Entry / Hashtable.Support visibility so
client code outside datadog.trace.util can build higher-arity tables -- the
case the javadoc describes but the original visibility didn't actually
support. Specifically: Entry is now public abstract with a protected ctor;
keyHash, next(), and setNext() are public; Support's create / clear /
bucketIndex / bucketIterator / mutatingBucketIterator methods are public.

Tests: AggregateTableTest covers hit, miss, distinct-by-spanKind, peer-tag
identity (including null vs non-null), cap overrun with stale victim, cap
overrun with no victim (returns null), expungeStaleAggregates, forEach,
clear, and that the canonical MetricKey is built at insert.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace LRUCache<MetricKey, AggregateMetric> with the AggregateTable added
in the prior commit. The hot path in Drainer.accept becomes:

  AggregateMetric aggregate = aggregates.findOrInsert(snapshot);
  if (aggregate != null) {
      aggregate.recordOneDuration(snapshot.tagAndDuration);
      dirty = true;
  } else {
      healthMetrics.onStatsAggregateDropped();
  }

On the steady-state hit path the lookup is a 64-bit hash compute + bucket
walk + matches(snapshot) -- no MetricKey allocation, no SERVICE_NAMES /
SPAN_KINDS / PEER_TAGS_CACHE lookups. The canonical MetricKey is now built
once per unique key at insert time, in MetricKeys.fromSnapshot.

Behavioral change in the cap-overrun path
-----------------------------------------

The old LRUCache evicted least-recently-used: at cap, a new insert would
push out the oldest entry regardless of whether it was live or stale.
AggregateTable instead scans for a hitCount==0 entry to recycle, and drops
the new key if none exists. Practical impact: in the common case where
the table holds a stable set of recurring keys, an unrelated burst of new
keys is dropped (and reported via onStatsAggregateDropped) rather than
evicting the established keys. The existing test that asserted "service0
evicted in favor of service10" is updated to assert the new semantics.
The other cap-related test ("should not report dropped aggregate when
evicted entry was already flushed") still passes unchanged: after report()
clears all entries to hitCount=0, the next wave of inserts recycles them.

Threading fix
-------------

ConflatingMetricsAggregator.disable() used to call aggregator.clearAggregates()
and inbox.clear() directly from the Sink's IO event thread, racing with the
aggregator thread mid-write. The race was tolerable for LinkedHashMap; it
is not for AggregateTable (chain corruption can NPE or loop). disable()
now offers a ClearSignal to the inbox so the aggregator thread itself
performs the table clear and the inbox.clear(). Adds one SignalItem
subclass + one branch in Drainer.accept; preserves the single-writer
invariant for AggregateTable end-to-end.

Removed: LRUCache import, AggregateExpiry inner class, the static
buildMetricKey / materializePeerTags / encodePeerTag helpers (now in
MetricKeys).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MetricKey existed for two reasons -- the prior LRUCache key role (now handled
by AggregateTable's Hashtable.Entry mechanics) and as the labels argument
to MetricWriter.add. The first is gone; the second is the only thing keeping
MetricKey alive. Fold its UTF8-encoded label fields onto AggregateEntry,
change MetricWriter.add to take AggregateEntry directly, and delete
MetricKey + MetricKeys.

What AggregateEntry now holds
-----------------------------

- 10 UTF8BytesString label fields (resource, service, operationName,
  serviceSource, type, spanKind, httpMethod, httpEndpoint, grpcStatusCode,
  and a List<UTF8BytesString> peerTags for serialization).
- 3 primitives (httpStatusCode, synthetic, traceRoot).
- AggregateMetric (the value being accumulated).
- The raw String[] peerTagPairs is retained alongside the encoded peerTags
  -- matches() compares it positionally against the snapshot's pairs; the
  encoded form is only consumed by the writer.

matches(SpanSnapshot) compares the entry's UTF8 forms to the snapshot's raw
String / CharSequence fields via content-equality (UTF8BytesString.toString()
returns the underlying String in O(1)). This closes a latent bug in the
prior raw-vs-raw matches(): if one snapshot delivered a tag value as String
and a later snapshot delivered the same content as UTF8BytesString, the old
Objects.equals would return false and the table would split into two
entries. Content-equality matching collapses them into one.

Consolidated caches
-------------------

The static UTF8 caches that used to live partly on MetricKey (RESOURCE_CACHE,
OPERATION_CACHE, SERVICE_SOURCE_CACHE, TYPE_CACHE, KIND_CACHE,
HTTP_METHOD_CACHE, HTTP_ENDPOINT_CACHE, GRPC_STATUS_CODE_CACHE, SERVICE_CACHE)
and partly on ConflatingMetricsAggregator (SERVICE_NAMES, SPAN_KINDS,
PEER_TAGS_CACHE) are all now on AggregateEntry. The split was duplicating
work -- SERVICE_NAMES and SERVICE_CACHE both cached service-name to
UTF8BytesString. One cache per field now.

API change: MetricWriter.add
----------------------------

Was: add(MetricKey key, AggregateMetric aggregate)
Now: add(AggregateEntry entry)

The aggregate lives on the entry. Single-arg.

SerializingMetricWriter reads the same UTF8 fields off AggregateEntry that it
previously read off MetricKey; the wire format is byte-identical.

Test impact
-----------

AggregateEntry.of(...) takes the same 13 positional args new MetricKey(...)
took, so test diffs are mostly mechanical:
  new MetricKey(args) -> AggregateEntry.of(args)
  writer.add(key, _)  -> writer.add(entry)

ValidatingSink in SerializingMetricWriterTest now iterates List<AggregateEntry>
directly. ConflatingMetricAggregatorTest's Spock matchers (~36 sites) rely
on AggregateEntry.equals comparing the 13 label fields (not the aggregate)
so the mock matches by labels regardless of the aggregate state at call time;
post-invocation closures verify aggregate state.

Benchmarks (2 forks x 5 iter x 15s)
-----------------------------------

The change is consumer-thread only; producer publish() is unchanged.

  SimpleSpan bench:   3.123 +- 0.025 us/op   (prior: 3.119 +- 0.018)
  DDSpan bench:       2.412 +- 0.022 us/op   (prior: 2.463 +- 0.041)

Both within noise -- the win is structural (one less class, one less
allocation per miss, one fewer cache layer) rather than benchmarked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dougqh dougqh force-pushed the dougqh/optimize-metric-key branch from 050a998 to 3738c85 Compare May 18, 2026 19:20
@dougqh dougqh changed the base branch from dougqh/conflating-metrics-background-work to dougqh/util-hashtable May 18, 2026 19:21
dougqh and others added 8 commits May 18, 2026 15:40
LongHashingUtilsTest (14 cases):
  - hashCodeX null sentinel + non-null pass-through
  - all primitive hash() overloads match the boxed Java hashCodes
  - hash(Object...) 2/3/4/5-arg overloads match the chained addToHash
    formula they are documented to constant-fold to
  - addToHash(long, primitive) overloads match the Object-version
  - linear-accumulation invariant (31 * h + v) holds across a sequence
  - iterable / deprecated int[] / deprecated Object[] variants match
    chained addToHash
  - intHash treats null as 0 (observable via hash(null, "x"))

HashtableTest (24 cases across 5 nested classes):
  - D1: insert/get/remove/insertOrReplace/clear/forEach, in-place value
    mutation, null-key handling, hash-collision chaining with disambig-
    uating equals, remove-from-collided-chain leaves siblings intact
  - D2: pair-key identity, remove(pair), insertOrReplace matches on
    both parts, forEach
  - Support: capacity rounds up to a power of two, bucketIndex stays
    in range across a wide hash sample, clear nulls every slot
  - BucketIterator: walks only matching-hash entries in a chain, throws
    NoSuchElementException when exhausted
  - MutatingBucketIterator: remove from head-of-chain unlinks, replace
    swaps the entry while preserving chain, remove() without prior
    next() throws IllegalStateException

Tests live in internal-api/src/test/java/datadog/trace/util and use the
already-present JUnit 5 setup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bring the new util/ files in line with google-java-format
(tabs → spaces, line wrapping, javadoc list markup) so
spotlessCheck passes in CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Compares Hashtable.D1 and Hashtable.D2 against equivalent HashMap
usage for add, update, and iterate operations. Each benchmark thread
owns its own map (Scope.Thread), but @threads(8) is used so the
allocation/GC pressure that Hashtable is designed to avoid surfaces
in the throughput numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Guard Support.sizeFor against overflow and use Integer.highestOneBit;
  reject capacities above 1 << 30 instead of looping forever.
- Add braces around single-statement while bodies in BucketIterator.
- Split HashtableBenchmark into HashtableD1Benchmark / HashtableD2Benchmark.
- Add regression tests for Support.sizeFor bounds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 5-arg Object overload was forwarding only obj0..obj3 to the int
overload, silently dropping obj4. Also align LongHashingUtils.hash 3-arg
signature with its 2/4/5-arg siblings (int parameters) and strengthen
the 5-arg HashingUtilsTest to detect the missing-arg regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Split D1Tests and D2Tests into HashtableD1Test and HashtableD2Test;
  extract shared test entry classes into HashtableTestEntries.
- Reduce visibility of LongHashingUtils.hash(int...) chaining overloads
  to package-private; they are internal building blocks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The iterator tests need a populated Hashtable.Entry[] to drive
Support.bucketIterator / mutatingBucketIterator. Relaxing D1.buckets
from private to package-private lets the same-package tests read it
directly, removing the reflection helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dougqh and others added 3 commits May 19, 2026 13:41
The label fields and the mutable counters/histograms are 1:1 with each
entry; carrying them on a separate object meant one extra allocation per
unique key plus an indirection on every hot-path update. Merging them
puts the counters directly on AggregateEntry, drops the entry.aggregate
hop, and consolidates ERROR_TAG / TOP_LEVEL_TAG onto the same class the
consumer uses to decode them.

AggregateTable.findOrInsert now returns AggregateEntry. Callers in
Aggregator and SerializingMetricWriter updated. Migrated
AggregateMetricTest.groovy to AggregateEntryTest.java per project policy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a context-passing forEach(T, BiConsumer) overload to AggregateTable,
mirroring TagMap's pattern. Aggregator.report now hands the writer in as
context to a static BiConsumer so no fresh Consumer is allocated each
report cycle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the TagMap pattern: pairs the existing forEach(Consumer) with a
forEach(T context, BiConsumer<T, TEntry>) overload so callers can hand
side-band state to a non-capturing lambda and avoid the
fresh-Consumer-per-call allocation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@datadog-datadog-prod-us1-2
Copy link
Copy Markdown
Contributor

datadog-datadog-prod-us1-2 Bot commented May 19, 2026

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 2 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-java | agent_integration_tests   View in Datadog   GitLab

🔧 Fix in code (Fix with Cursor). 4 failed tests due to java.lang.IllegalAccessError at MetricsIntegrationTest.groovy:45.

DataDog/apm-reliability/dd-trace-java | check_base   View in Datadog   GitLab

🔧 Fix in code (Fix with Cursor). CodeNarc rule violations found. Exceeded maximum number of priority 3 violations during task ':dd-trace-core:codenarcTraceAgentTest'.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 4a4b72e | Docs | Datadog PR Page | Give us feedback!

dougqh and others added 3 commits May 19, 2026 13:58
Factors the unchecked (TEntry) cast out of D1.forEach / D2.forEach (and
the BiConsumer variants) into Support.forEach(buckets, ...). The cast
now lives in one place, mirroring how Entry.next() handles it, and the
D1/D2 methods become one-liners. Downstream higher-arity tables built
on Support gain the same helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that Hashtable.Support exposes the parameterized forEach helpers,
AggregateTable's own forEach methods can drop their duplicated loop body
and the (AggregateEntry) cast.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dougqh and others added 4 commits May 21, 2026 15:41
This method was a vestige of master's Batch design where multiple
producer threads wrote into an AtomicLongArray slot concurrently and
the aggregator drained ~64 durations per Batch in one call. The new
producer/consumer split publishes one SpanSnapshot per span, so
production only ever calls recordOneDuration(long).

Migrate the three remaining callers (AggregateEntryTest,
SerializingMetricWriterTest, MetricsIntegrationTest) to a loop of
recordOneDuration(long) calls, then delete the batched method and its
AtomicLongArray imports.

Drops the recordDurationsIgnoresTrailingZeros test -- that behavior
was a specific quirk of the batched API (count parameter shorter than
the array length) and doesn't apply to recordOneDuration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The factory recovers (name, value) pairs from pre-encoded "name:value"
strings by splitting at the FIRST colon. Test-only, but worth being
explicit so callers don't hand it a peer-tag value containing a colon
(URLs, IPv6, service:env) and get a silently wrong (name, value) pair.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bundled fix in this PR routes the agent-downgrade clear through
the inbox so the aggregator thread stays the sole writer to
AggregateTable. Prior to this test, there was no regression coverage
for that routing.

The test fires DOWNGRADED from the test thread (production-like
OkHttpSink callback path), waits for the immediate no-flush window,
then publishes a marker span with a distinct resource name. The
subsequent report's writer.add captor must see only the marker -- if
CLEAR didn't actually wipe the original entry, the original
"resource" would still be present and the assertion would catch it.

Cannot directly verify thread identity of the clear from inside this
test (CLEAR's inbox.clear() drops any latch signal we'd queue behind
it), so this is an observable-contract test rather than a strict
thread-id test. Still catches both the missing-clear regression and
the bucket-chain-corruption regression that the original threading
race could produce.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dougqh added a commit that referenced this pull request May 21, 2026
…dinality

Resolved conflicts:
- AggregateEntry.java: kept #11387's cardinality-handler constructor shape;
  dropped the dead recordDurations(int, AtomicLongArray) batch API and its
  AtomicLongArray import (the #11382 cleanup). The of() colon-split
  warning from #11382 doesn't apply because #11387's of() takes peerTags
  as a pre-built list (no colon-splitting on test reconstruction).
- ClientStatsAggregator.java: kept #11387's PeerTagSchema.of(...,
  healthMetrics) 3-arg signature; preserved #11382's read-order race fix
  (lastTimeDiscovered read before peerTags); preserved the
  resetCardinalityHandlers method from HEAD.
- ClientStatsAggregatorTest.groovy: kept HEAD's cardinality-focused
  tests; applied the #11382 Spock-assert fix (wrap '>> { closure }'
  body expressions in assert) to all 31 sites so power-assert surfaces
  mismatches.
- ClientStatsAggregatorBootstrapTest.java: renamed the
  reconcileSwapsSchemaWhenTagSetChanges test's ConflatingMetricsAggregator
  references to ClientStatsAggregator.
- ConflatingMetricsAggregatorDisableTest.java: renamed file +
  references to ClientStatsAggregatorDisableTest with corresponding
  class-name change.
- AdversarialMetricsBenchmark.java: updated ConflatingMetricsAggregator
  + ConflatingMetricsAggregatorBenchmark references to ClientStatsAggregator
  / ClientStatsAggregatorBenchmark.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dougqh added a commit that referenced this pull request May 21, 2026
…ory-efficiency

Resolved conflicts:
- AdversarialMetricsBenchmark.java: kept #11389's Javadoc (specific to
  AggregateTable + cardinality handlers), applied #11382's LongAdder
  + concise tearDown cleanup.
- AggregateEntry.java: dropped recordDurations(int, AtomicLongArray)
  per the #11382 cleanup (no longer used after migrating the test
  callers to recordOneDuration).
- MetricsIntegrationTest.groovy: migrated to per-duration
  recordOneDuration calls.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
private AggregateEntry(SpanSnapshot s, long keyHash) {
super(keyHash);
this.resource = canonicalize(RESOURCE_CACHE, s.resourceName);
this.service = SERVICE_CACHE.computeIfAbsent(s.serviceName, UTF8_ENCODE);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before the service name was also set to EMPTY when null. I think this in reality cannot happen but perhaps it's worth going through canonicalize also for this

Copy link
Copy Markdown
Contributor Author

@dougqh dougqh May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I need to revisit the null vs EMPTY story. As I recall the null handling was already a mix of two approaches before, but we should fix it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Claude (posted on dougqh's behalf):

Done in d1749389b9. service and spanKind now both go through canonicalize like every other label field — null maps to EMPTY instead of NPE'ing at the cache call. Dropped the Functions.UTF8_ENCODE import too since the canonicalize helper handles the create.

Pinned by AggregateTableTest.nullServiceAndSpanKindDoNotNpeAndCollapseWithEmpty which publishes (null, null), (null, null) again, and ("", ""), asserts a single entry and length-0 service/spanKind. Without the fix, the first publish would NPE'd at .computeIfAbsent.

}
}

public void clearAggregates() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks now a dead method. Given that is not thread safe wrt the underlying hashtable I would remove it or reduce its visibility

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Claude (posted on dougqh's behalf):

Deleted in 21e75452d6 — the method had no callers after the disable() rewrite, and you're right that exposing a thread-unsafe mutator from outside the inbox routing would invite the exact race that motivated the ClearSignal path. The comment in ConflatingMetricsAggregator that named it has been updated to describe the inbox-routed clear instead.

h = LongHashingUtils.addToHash(h, s.synthetic);
h = LongHashingUtils.addToHash(h, s.traceRoot);
h = LongHashingUtils.addToHash(h, s.spanKind);
if (s.peerTagSchema != null && s.peerTagValues != null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: this null check makes the equals not symmetrical since we're skipping here the field from including into the hash - also combinations with different nulls on peerTagSchema/peerTagValues can produce the same hash theoretically

Copy link
Copy Markdown
Contributor Author

@dougqh dougqh May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at this point I should just get rid of equals & hashCode. They aren't really used by the new table structure.

As for the nulls in the hash code, yes, that can lead to hash collisions, but that's fine.
That said, I do think I'll rework the hashOf routine to be a bit lighter weight. I may or may not change the null handling when I do so.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Claude (posted on dougqh's behalf):

Both halves of this are now fixed:

  • Hash side (80778c4e87): dropped the conditional null-skip. hashOf now unconditionally mixes in LongHashingUtils.addToHash(h, s.peerTagSchema) and Arrays.hashCode(s.peerTagValues). PeerTagSchema.hashCode() is content-based on names (lazily cached), so distinct schema instances with identical names still produce a stable hash. Different-null combinations on the two fields can no longer collide.

  • Equals side (2536aa2e76): AggregateEntry.equals now compares the raw peerTagNames / peerTagValues arrays via Arrays.equals, mirroring exactly what hashOf hashes. Same for be13431744 on PeerTagSchema (symmetric equals/hashCode pair on names).

Two regression tests pin the contract: AggregateEntryTest.equalsConsistentWithHashCodeAcrossDifferentSchemaLayouts (entries with ["a","b"]/[null,"x"] vs ["b","c"]/["x",null] — same encoded peer-tags ["b:x"] but different raw layouts → not equal, different hashCodes) and equalEntriesHaveEqualHashCodes (the positive case).

* persistently at cap, this is the place to consider caching a cursor across calls so the scan
* resumes where it left off.
*/
private boolean evictOneStale() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What highlighted in the comment is true. However if I'm not wrong when the aggregator start dropping (meaning that capacity is full and badly dimensioned) this will translate in a perf kill. I'm wondering if it's worth adding a field for the last scanned bucket to have the cost amortized right now

Copy link
Copy Markdown
Contributor Author

@dougqh dougqh May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify "perf kill"? Yes, this will end up doing an O(n) scan of the table, but the table is small.
Yes, that will slow down the metrics thread, but it won't kill the application.

I've primarily focused on making sure things don't fall over under an adversarial workload. In those cases, such last used caches don't work. Although, I suppose I could check a more realistic accidental case where only a single component has out of control cardinality.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was about needing to scan the full table for each span inserted that at the end cannot insert since will overflow. but perhaps I've misinterpret something

Copy link
Copy Markdown
Contributor Author

@dougqh dougqh May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll double check the eviction logic. That is worth being careful about.
Although, the next part that introduces cardinality limiters largely renders the table limit unnecessary.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the eviction to track the last bucket from one scan to the next.
To do that, I added a variation of the Hashtable iterator that scans a range of buckets.

Since only this code is using Hashtable right now, I went ahead and included that in this PR

aggregates.clear();
inbox.clear();
}
((SignalItem) item).complete();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the dirty flag be also cleared up here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I think I need to go back and review all the signaling a bit more.
There were some latent problems that already existed in the code that have been exposed during the reviews.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in the clearing of the dirty flag.

I wonder if that was a latent bug because I mostly intended just to shuffle around the existing code. It might have been that we were just getting a spurious reporting cycle previously.

Anyway, it is fixed now.

Comment thread dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java Outdated
Comment thread dd-trace-core/src/main/java/datadog/trace/common/metrics/AggregateEntry.java Outdated
dougqh and others added 6 commits May 22, 2026 08:42
The constructor canonicalizes null fields through canonicalize() which
returns UTF8BytesString.EMPTY for null inputs (or a cached
UTF8BytesString("") for empty-string inputs). But matches() compared
those entries against subsequent snapshots via contentEquals(...) /
stringContentEquals(...), which treated non-null UTF8BytesString vs
null CharSequence as inequal.

Result: two snapshots with the same null-valued resource/operation/
type/serviceSource hashed to the same bucket (intHash(null) == 0 ==
"".hashCode()), but matches() returned false on the EMPTY-vs-null
field comparison, so the second snapshot inserted a *duplicate*
entry into the table. Same path for empty-string vs null.

Unify the semantics: null and length-zero are treated as equivalent
on either side of contentEquals/stringContentEquals. The hash already
agreed (intHash(null) == "".hashCode() == 0), so this restores the
matches() contract to match the existing hash contract.

Adds AggregateTableTest.nullAndEmptyOptionalFieldsCollapseToOneEntry
to pin the contract: two null-fielded and one empty-string-fielded
snapshot must all hit the same entry. Test would have failed before
the fix (a duplicate insert) but the existing 10 cases still pass.

Resolves sarahchen6's review comment on AggregateEntry.java:113 and
amarziali's related concern on AggregateEntry.java:114.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After CLEAR runs the table is empty but dirty would still carry over
from any prior SpanSnapshot insert. The next report() would see
dirty=true, expunge no-op the empty table, find isEmpty(), and log
"skipped metrics reporting because no points have changed" -- same
observable outcome, but resetting dirty here keeps the invariant
"dirty implies there's data to flush" honest.

Resolves amarziali's review comment on Aggregator.java:121.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously hashOf wrapped the peer-tag contribution in
`if (s.peerTagSchema != null && s.peerTagValues != null)`. That meant
two snapshots with different null arrangements (schema-null vs
values-null) collapsed to the same hash, getting resolved only by the
field-by-field matches() fallback at the bucket walk -- wasteful, and
the asymmetry hurt hash quality generally.

Replace with unconditional contributions:
- PeerTagSchema now overrides hashCode() to be content-based on names
  (lazy + cached, benign-race pattern matching UTF8BytesString /
  utf8Bytes elsewhere). addToHash(h, schema) routes through that.
- For the String[] values, pass Arrays.hashCode(values) through the
  int overload -- Object[].hashCode() is identity-based by default,
  so we have to compute content hash explicitly. Null arrays hash to
  0 via Arrays.hashCode's contract.

Null inputs on either side now hash to 0 distinctly from any real
schema or non-empty values array, so all four null combinations are
distinguishable. Same final hash for content-equal inputs across
schema replacements (the reconcile path), which preserves the entry-
hit invariant after the aggregator rebuilds the schema.

Resolves amarziali's review comment on AggregateEntry.java:309 and
dougqh's suggestion on AggregateEntry.java:310.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Once the ClearSignal routing replaced the direct disable()-to-table
mutation, clearAggregates() lost all its call sites -- no production
code, no test code. Worse, leaving it public invited future callers
to bypass the ClearSignal contract and race against Drainer.accept
on the aggregator thread.

Drop the method outright. Update the inline comment in
ConflatingMetricsAggregator.disable() to not name the deleted method.

Resolves amarziali's review comment on Aggregator.java:82.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…terator

Previously AggregateTable.evictOneStale walked the bucket array from
bucket 0 on every call. Under sustained cap pressure with mostly-hot
entries clustered in low buckets, every eviction re-scanned the same
hot prefix before finding a cold entry. amarziali's review concern.

Add a cursor: after a successful eviction, remember the bucket where
it landed. The next call resumes from there. Worst case for a single
call is still O(N) when nearly every entry is hot, but a sustained
eviction stream amortizes to O(1) per call -- the hot prefix is never
re-scanned more than twice across N evictions.

Implemented as two iterators driving [cursor, length) then [0, cursor),
which required a small Hashtable.Support API addition:

- New `mutatingTableIterator(buckets, startBucket, endBucket)` overload
  for walking a half-open bucket range. The existing zero-arg overload
  is kept; it now delegates to the new ctor with [0, buckets.length).
- New `MutatingTableIterator.currentBucket()` accessor exposing the
  bucket index of the entry last returned by next() (or -1 before any
  next/after a remove). AggregateTable saves this as the new cursor.
- The empty-range case (startBucket == endBucket) yields an
  immediately-exhausted iterator -- this is what makes the wrap-around
  pass [0, cursor) naturally produce nothing when cursor == 0, so the
  two-pass driver in evictOneStale needs no special case.

Tests:
- 4 new HashtableTest cases covering the half-open API, empty ranges,
  out-of-range bounds, and currentBucket() behavior before/after next.
- 2 new AggregateTableTest cases: backToBackEvictionsAllSucceed (drives
  3x capacity worth of cap-overrun inserts; each must succeed, which
  only holds if the cursor advances correctly) and
  clearResetsCursorForSubsequentEvictions (clear() also resets the
  cursor so subsequent eviction passes start from bucket 0).

Resolves amarziali's review comment on AggregateTable.java:75.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dougqh's review comment on AggregateEntry.java:153 asked to keep test
code out of the production class. Move the factory to a new
AggregateEntries helper in src/test/java/datadog/trace/common/metrics.
Same package so it can call the package-private forSnapshot();
delegating to forSnapshot also means no need to widen the
AggregateEntry constructor visibility.

The 37 src/test/groovy call sites get a mechanical rewrite of
AggregateEntry.of(...) -> AggregateEntries.of(...) (36 in
ConflatingMetricAggregatorTest, 1 in SerializingMetricWriterTest).

src/traceAgentTest is a separate source set without compile-time
visibility into src/test, so its 2 MetricsIntegrationTest.groovy call
sites can't use AggregateEntries. Migrated those to construct a
SpanSnapshot inline + call AggregateEntry.forSnapshot(snapshot).
Groovy's permissive package-private access makes this work from the
default package the integration test currently sits in.

Resolves dougqh's review comment on AggregateEntry.java:153.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dougqh and others added 5 commits May 22, 2026 09:33
equals compared the pre-encoded peerTags List<UTF8BytesString> while
hashCode (via hashOf) mixes in the raw peerTagSchema + values arrays.
Two entries built from different schema layouts can collapse to the
same encoded form -- e.g. tag "b" at index 1 in schema {a,b} with
values {null,"x"} produces the same encoded ["b:x"] as schema {b,c}
with values {"x",null}. equals returned true; hashCodes differed.
Hashcode contract violated.

Switch equals to compare the raw peerTagNames + peerTagValues arrays,
mirroring matches(SpanSnapshot) and hashOf(SpanSnapshot). The
production lookup path (AggregateTable.findOrInsert) already uses
those, so this just brings equals in line with the rest of the class.

Adds two regression tests on AggregateEntryTest:
- equalsConsistentWithHashCodeAcrossDifferentSchemaLayouts: the
  failing-case shape above. Pre-fix, the encoded-list equals returned
  true while hashCodes differed; now equals returns false and the
  hashCodes differ in agreement.
- equalEntriesHaveEqualHashCodes: positive case -- two entries from
  identical snapshots must equal and share hashCode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Prior CLEAR handler called inbox.clear() as belt-and-suspenders cleanup
of in-flight snapshots. That would also erase any STOP signal queued
behind CLEAR -- a real concern in disable() -> close() sequences,
where the trampled STOP leaves the aggregator thread spinning until
thread.join's timeout. sarahchen6 surfaced this from a Codex pass on
the CLEAR logic; dougqh confirmed it's worth fixing.

The CLEAR handler now clears only the aggregates table. Queued
snapshots will drain naturally into the just-cleared table -- but
since features.supportsMetrics() is already false by the time CLEAR
was offered, producers have stopped publishing; the inbox drains and
empties on its own. Worst case: one extra reporting cycle of wasted
work on stale snapshots that the agent rejects, which triggers
another DOWNGRADED -> disable() -> CLEAR. Self-healing, same as before.

Adds ConflatingMetricsAggregatorDisableTest.clearDoesNotTrampleQueuedStopSignal:
publish a snapshot, fire DOWNGRADED, call close(); the test bounds
close() with its own 2s timeout and asserts the thread exits within
it. Pre-fix this would have hung out THREAD_JOIN_TIMEOUT_MS;
post-fix it returns in milliseconds.

Resolves sarahchen6/Codex's CLEAR-trampling-STOP review comment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The prior commit added a content-based hashCode() but left equals
falling back to Object.equals (reference identity). That violates the
hashCode contract for any caller that compares two distinct schema
instances built from the same tag list -- e.g. before/after a
reconcile rebuilds the cached schema with an unchanged tag set.

equals() now mirrors hashCode(): content-equal on names. The reconcile-
timing field lastTimeDiscovered is intentionally excluded from both --
it's bookkeeping for the aggregator's discovery-version compare, not
part of schema identity.

Tests:
- equalsIsContentBasedOnNames -- same names, two instances, equal +
  matching hashCode.
- equalsIgnoresLastTimeDiscovered -- pins that the bookkeeping field
  doesn't leak into identity.
- equalsDistinguishesByOrder -- names is positional (pairs with
  SpanSnapshot.peerTagValues by index), so reordered schemas are not
  interchangeable.
- equalsHandlesNullAndOtherTypes -- contract corners.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AggregateEntry's constructor canonicalized resource, operationName,
type, and serviceSource (mapping null -> EMPTY via the canonicalize
helper) but called SERVICE_CACHE.computeIfAbsent / SPAN_KIND_CACHE
.computeIfAbsent directly for service and spanKind. Inputs of null
would NPE on the cache call.

Production paths never pass null for these -- DDSpan always supplies
a service, and the producer defaults spanKind to "" via
unsafeGetTag(SPAN_KIND, (CharSequence) "") -- so this is a latent-
defense fix, not a live bug. But the matches/contentEquals logic
already treats null and length-zero as equal on both sides, and every
other label field in the constructor defends via canonicalize. Two
unprotected outliers are an inconsistency that bites the next person
who reaches for a new code path.

Drops the Functions.UTF8_ENCODE import (its sole use was the service
cache line) -- canonicalize internally creates the UTF8BytesString.

Test: AggregateTableTest.nullServiceAndSpanKindDoNotNpeAndCollapseWithEmpty
publishes (null, null), (null, null) again, and ("", ""); asserts a
single entry results and that getService()/getSpanKind() are length-0.
Without the fix, the first publish would have NPE'd at the
.computeIfAbsent call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ound-work' into dougqh/optimize-metric-key

# Conflicts:
#	dd-trace-core/src/main/java/datadog/trace/common/metrics/PeerTagSchema.java
#	dd-trace-core/src/test/java/datadog/trace/common/metrics/PeerTagSchemaTest.java
return aStr.contentEquals(b);
}

private static boolean stringContentEquals(UTF8BytesString a, String b) {
Copy link
Copy Markdown
Contributor

@sarahchen6 sarahchen6 May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove this method and just use contentEquals above since String is a CharSequence?

* subsequent snapshot whose field is still null. {@code intHash(null) == 0 == "".hashCode()}, so
* the hash already agrees with this view.
*/
private static boolean contentEquals(UTF8BytesString a, CharSequence b) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this method could be simplified to something like

private static boolean contentEquals(UTF8BytesString a, CharSequence b) {
  if (a == null || a.length() == 0) {
    return b == null || b.length() == 0;
  }
  return b != null && a.toString().contentEquals(b);
}

* reach {@link AggregateEntry#forSnapshot(SpanSnapshot)} and the package-private {@link
* SpanSnapshot} constructor.
*/
public final class AggregateEntries {
Copy link
Copy Markdown
Contributor

@sarahchen6 sarahchen6 May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is now test-specific, it would make sense to me to have a more test-related name, instead of reading nearly identical to AggregateEntry?

// ""), but the prior contentEquals impl treated `non-null vs null` as not-equal -- so a second
// snapshot with the same null fields hashed to the same bucket but failed matches(), causing a
// spurious duplicate insert. The fix unifies null and length-zero on both sides of
// contentEquals/stringContentEquals.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this comment could be removed. It seems to explain a fix implemented during the PR iteration process and how this test case addresses that fix, but the test case can stand on its own, i.e. future code readers only need to know what the final code expected behavior is.

this may apply to the comment below too for nullServiceAndSpanKindDoNotNpeAndCollapseWithEmpty()

// Regression: prior CLEAR handler called inbox.clear(), which would erase any STOP signal
// queued behind it. close() then waited out thread.join's timeout because Drainer never saw
// the STOP and `stopped` was never set. Now the CLEAR handler clears only the aggregates
// table; queued signals (STOP, REPORT) survive and get processed normally.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here... maybe keeping the comment as just:

CLEAR handler clears only the aggregates table; queued signals (STOP, REPORT) survive and get processed normally.

basically removing references to the PR iteration process and keeping just the expected behavior / objectively helpful context in comments

PeerTagSchema b = PeerTagSchema.testSchema(new String[] {"peer.hostname", "peer.service"});

assertEquals(a, b);
assertEquals(b, a);
Copy link
Copy Markdown
Contributor

@sarahchen6 sarahchen6 May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this line seems unnecessary given 100 and 102 🤔

Copy link
Copy Markdown
Contributor

@sarahchen6 sarahchen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few nits, but otherwise looks reasonable!!

Also this was Claude's advice for the failing agent integration test:

Fix options (in order of invasiveness):
1. Add a public test-support module — extract a small AggregateEntryTestSupport factory into a test-support
jar that traceAgentTest can depend on, keeping the production classes package-private.
2. Make testSchema + forSnapshot public on their classes — simplest, but widens the API surface of
intentionally-internal classes.
3. Rewrite the test to not reference these classes directly — drive the test entirely via
ConflatingMetricsAggregator publishing real DDSpans, which avoids any dependency on the internal
AggregateEntry/PeerTagSchema types.

Option 3 is the cleanest because it also makes the integration test more realistic (it exercises the full
pipeline instead of bypassing the producer path).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp: core Tracer core comp: metrics Metrics tag: ai generated Largely based on code generated by an AI or LLM tag: no release notes Changes to exclude from release notes tag: performance Performance related changes type: enhancement Enhancements and improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants