Skip to content

Fix telemetry misattribution across connections on the same thread#1531

Open
sreekanth-db wants to merge 1 commit into
databricks:mainfrom
sreekanth-db:fix/es-1961329-stale-threadlocal-telemetry
Open

Fix telemetry misattribution across connections on the same thread#1531
sreekanth-db wants to merge 1 commit into
databricks:mainfrom
sreekanth-db:fix/es-1961329-stale-threadlocal-telemetry

Conversation

@sreekanth-db

@sreekanth-db sreekanth-db commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

Telemetry export previously read the connection context from a static ThreadLocal (DatabricksThreadContextHolder) at export time. When multiple connections are used on the same thread (e.g. a Thrift connection and a SEA connection), the ThreadLocal holds whichever connection was created last, so per-statement telemetry events were tagged with the wrong connection's context — including the transport mode (driverMode), connection UUID, and other connection parameters.

This caused a measurable data-quality issue: ~22% of distinct SEA statement IDs in JDBC telemetry had no corresponding server-side record, because Thrift statements executed on the same thread were mislabeled as SEA (ES-1961329).

Root cause

TelemetryHelper.exportTelemetryLog(...) resolved the connection via DatabricksThreadContextHolder.getConnectionContext(). That ThreadLocal is set once per DatabricksConnection construction and never refreshed per statement, so the last-created connection on a thread "wins" and all subsequent telemetry on that thread inherits its context.

Fix

Each TelemetryCollector is already per-connection (keyed by connectionUuid in TelemetryCollectorManager). It now:

  • Stores its own IDatabricksConnectionContext at construction time.
  • Passes that context explicitly to TelemetryHelper.exportTelemetryLog(...) instead of relying on the shared ThreadLocal.

A new exportTelemetryLog(connectionContext, details, logLevel) overload takes the context explicitly; the old 2-arg overload is retained (deprecated) for backward compatibility.

Real connections always have a unique connectionUuid, so each gets its own collector holding its own context. Only context-less calls share the DEFAULT_CONNECTION collector; those events carry a null context and are skipped at export time, so no cross-connection misattribution can occur (documented inline).

Files changed

  • TelemetryCollector.java — stores connectionContext; passes it to all three exportTelemetryLog call sites; adds getConnectionContext() accessor.
  • TelemetryCollectorManager.java — forwards the context into the collector constructor; documents the DEFAULT_CONNECTION trade-off.
  • TelemetryHelper.java — new explicit-context exportTelemetryLog overload; old overload deprecated.
  • Tests + NEXT_CHANGELOG.md.

Testing

Automated unit tests (added)

  • TelemetryCollectorTest: collector stores its own context (testCollectorStoresConnectionContext); recordOperationLatency calls the 3-arg exportTelemetryLog.
  • TelemetryCollectorManagerTest: each collector stores the correct per-connection context (testCollectorStoresCorrectConnectionContext); a collector's context is not affected by subsequently creating another connection's collector (testCollectorContextNotAffectedBySubsequentConnectionCreation).
  • TelemetryHelperTest: exportTelemetryLog uses the provided context and not the ThreadLocal (testExportTelemetryLogWithExplicitContext_UsesProvidedContext); null context short-circuits without export (testExportTelemetryLogWithExplicitContext_NullContextSkipsExport).

Test runs

  • Telemetry suites — TelemetryCollectorTest, TelemetryCollectorManagerTest, TelemetryHelperTest: 128 tests, 0 failures.
  • Broader related suites — connection / statement / session / http / retry / sdk / thrift / streaming / pooled connection: 519 tests, 0 failures.

Manual end-to-end reproduction (SEA + Thrift on one thread)

Created a SEA connection and a Thrift connection on the same thread, executed a query on the SEA connection, and dumped the actual DriverConnectionParameters telemetry payload.

Before the fix — payload for the SEA statement was built from the stale ThreadLocal (which held the Thrift connection):

stmt1 actual mode  = SEA
ThreadLocal mode   = THRIFT
=== PAYLOAD for stmt1 (sent to backend) ===
{ "mode" : "THRIFT", ... }   <-- WRONG

After the fix — the collector uses its own stored context:

=== FIX VERIFICATION ===
Collector for conn1: mode=SEA
Collector for conn2: mode=THRIFT
FIX WORKS: true

Fixes ES-1961329.

Telemetry export previously read the connection context from a static
ThreadLocal (DatabricksThreadContextHolder) at export time. When multiple
connections were used on the same thread (e.g. Thrift and SEA), the
ThreadLocal held whichever connection was created last, so per-statement
telemetry events were tagged with the wrong connection's context
(transport mode, connection parameters, etc.).

Each TelemetryCollector is already per-connection (keyed by connectionUuid),
so it now stores its own connection context and passes it explicitly to
TelemetryHelper.exportTelemetryLog instead of relying on the ThreadLocal.
The old 2-arg overload is retained (deprecated) for backward compatibility.

Fixes ES-1961329.

Signed-off-by: Sreekanth Vadigi <sreekanth.vadigi@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant