Skip to content

Report Kafka client connection status in DSM payloads#11442

Draft
piochelepiotr wants to merge 1 commit into
masterfrom
feat/kafka-config-connection-status
Draft

Report Kafka client connection status in DSM payloads#11442
piochelepiotr wants to merge 1 commit into
masterfrom
feat/kafka-config-connection-status

Conversation

@piochelepiotr
Copy link
Copy Markdown
Contributor

Summary

  • Kafka client configs now carry a connectionStatus field (connected / failed) inside the DSM msgpack payload (Configs[*].ConnectionStatus, peer to Type, KafkaClusterId, ConsumerGroup, Config).
  • New advice on org.apache.kafka.clients.Metadata.failedUpdate in both kafka-clients-0.11 and kafka-clients-3.8 so failed clients are reported (previously they were silently dropped — reporting was gated on Metadata.update succeeding).
  • Expanded the value-allowlist with non-secret auth selectors (sasl.mechanism, ssl.protocol, ssl.enabled.protocols, ssl.endpoint.identification.algorithm, ssl.truststore.type, ssl.keystore.type, ssl.cipher.suites, sasl.kerberos.service.name, sasl.login.callback.handler.class). Credentials remain masked.

Why

End goal is a DSM UI flow where a user can diff a failing client's config against working clients on the same cluster — typoed bootstrap.servers, missing sasl.mechanism, SSL truststore drift, etc. Without capturing failed clients at all, the comparison is impossible today.

Downstream

  • Migration: DataDog/k8s-resources#157530 (adds connection_status column + 8th positional proc param with DEFAULT 'connected')
  • Edge + writer + dsm-api: dd-go / dd-source PRs (linked separately)

Test plan

  • Unit tests updated (DataStreamsWritingTest, DefaultDataStreamsMonitoringTest, KafkaConfigHelperTest)
  • End-to-end: local docker compose (kafka + dd-agent) with the built agent jar shows Reporting kafka_producer config with status=connected for a working producer/consumer and status=failed for a producer pointed at a closed port; agent forwards both to /v0.1/pipeline_stats.

Known gap (follow-up)

A client whose bootstrap.servers fails at constructor DNS-resolve time (e.g. bogus.invalid:9092) reports nothing — the producer never builds, so no Metadata ever exists. Adding @Advice.OnMethodExit(onThrowable=...) on the producer constructor would close that gap.

tag: ai generated

🤖 Generated with Claude Code

Previously, the DSM payload only carried Kafka client configs once
`Metadata.update` fired with a valid cluster ID — so clients that never
authenticated or never reached a broker were silently dropped, and we
couldn't compare their configs against working clients.

Now every config is also reported with a `connectionStatus` field
("connected" / "failed") on the per-bucket `Configs` entry, including
on `Metadata.failedUpdate`. Also expands the value-allowlist with
non-secret auth selectors (`sasl.mechanism`, `ssl.protocol`,
`ssl.endpoint.identification.algorithm`, etc.) so the comparison flow
can surface mechanism typos without leaking credentials.

tag: ai generated

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@piochelepiotr piochelepiotr added type: enhancement Enhancements and improvements inst: kafka Kafka instrumentation tag: ai generated Largely based on code generated by an AI or LLM labels May 21, 2026
@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 21, 2026

Kafka / producer-benchmark

Parameters

Baseline Candidate
baseline_or_candidate baseline candidate
git_branch master feat/kafka-config-connection-status
git_commit_date 1779305023 1779387517
git_commit_sha f7a0a44 1c8096d
See matching parameters
Baseline Candidate
ci_job_date 1779388487 1779388487
ci_job_id 1703999194 1703999194
ci_pipeline_id 114571186 114571186
cpu_model Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
jdkVersion 11.0.25 11.0.25
jmhVersion 1.36 1.36
jvm /usr/lib/jvm/java-11-openjdk-amd64/bin/java /usr/lib/jvm/java-11-openjdk-amd64/bin/java
jvmArgs -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/go/src/github.com/DataDog/apm-reliability/dd-trace-java/platform/src/producer-benchmark/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/go/src/github.com/DataDog/apm-reliability/dd-trace-java/platform/src/producer-benchmark/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant
vmName OpenJDK 64-Bit Server VM OpenJDK 64-Bit Server VM
vmVersion 11.0.25+9-post-Ubuntu-1ubuntu122.04 11.0.25+9-post-Ubuntu-1ubuntu122.04

Summary

Found 0 performance improvements and 0 performance regressions! Performance is the same for 3 metrics, 0 unstable metrics.

See unchanged results
scenario Δ mean throughput
scenario:not-instrumented/KafkaProduceBenchmark.benchProduce same
scenario:only-tracing-dsm-disabled-benchmarks/KafkaProduceBenchmark.benchProduce same
scenario:only-tracing-dsm-enabled-benchmarks/KafkaProduceBenchmark.benchProduce same

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

inst: kafka Kafka instrumentation tag: ai generated Largely based on code generated by an AI or LLM type: enhancement Enhancements and improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant