Add Additional OTel JVM Runtime Metrics and Gate "Developmental" Metrics#11411
Add Additional OTel JVM Runtime Metrics and Gate "Developmental" Metrics#11411mhlidd wants to merge 5 commits into
Conversation
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 62d9b50d1d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
90ddfc2 to
de166ab
Compare
|
Hi! 👋 Thanks for your pull request! 🎉 To help us review it, please make sure to:
If you need help, please check our contributing guidelines. |
|
@codex review |
|
Codex Review: Didn't find any major issues. Another round soon, please! ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
|
|
||
| private static void recordGcDuration( | ||
| OtelMetricStorage storage, GarbageCollectionNotificationInfo info, boolean captureGcCause) { | ||
| double durationSeconds = info.getGcInfo().getDuration() / 1000d; |
There was a problem hiding this comment.
We should probably add a null check in case info doesn’t contain GC info
There was a problem hiding this comment.
I'd suggesting adding that check in handleNotification before calling recordGcDuration
| ManagementFactory.getOperatingSystemMXBean(); | ||
| OperatingSystemMXBean osBean = | ||
| rawOsBean instanceof OperatingSystemMXBean ? (OperatingSystemMXBean) rawOsBean : null; | ||
| OperatingSystemMXBean osBean = sunOsBean(); |
| false, | ||
| GarbageCollectorMXBean.class.getClassLoader()); | ||
| return true; | ||
| } catch (ClassNotFoundException e) { |
There was a problem hiding this comment.
I'd widen this to catch Exception or Throwable
| static final int DEFAULT_METRICS_OTEL_TIMEOUT = 7_500; // ms | ||
| static final int DEFAULT_METRICS_OTEL_CARDINALITY_LIMIT = 2_000; | ||
|
|
||
| public static final boolean DEFAULT_METRICS_OTEL_EXPERIMENTAL_ENABLED = true; |
There was a problem hiding this comment.
Default for this in OTel is false - do we want to match that?
| } | ||
|
|
||
| java.lang.management.OperatingSystemMXBean stdOsBean = | ||
| ManagementFactory.getOperatingSystemMXBean(); |
There was a problem hiding this comment.
There's quite a few calls to ManagementFactory.getOperatingSystemMXBean(); here - some use sunOsBean() which returns null if it's not the right type, while other places have their own instanceof checks.
It might actually be more readable and consistent to just get the MBean with ManagementFactory.getOperatingSystemMXBean(); everywhere and check+cast it to the right type. The sunOsBean() helper doesn't really add much IMHO.
mcculls
left a comment
There was a problem hiding this comment.
One question about whether the default should really be true since OTel defaults it to false at the moment: https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/runtime-telemetry/README.md
Also a few cleanup / robustness comments to be addressed before merging - otherwise looks good.
What Does This Do
Follow-up to the parent PR for
maximo/otlp-runtime-metricsthat expands the OTLP JVM runtime metrics surface and gates Development-status metrics behind a new opt-out flag.New config
dd.metrics.otel.experimental.enabled(default:true) — mirrors OTel'sotel.instrumentation.runtime-telemetry.emit-experimental-metrics. Whenfalse, only metrics marked Stable in the OTel JVM semantic conventions are emitted; Development-status metrics are suppressed. Settable via either env var:DD_METRICS_OTEL_EXPERIMENTAL_ENABLED(Datadog form)OTEL_INSTRUMENTATION_RUNTIME_TELEMETRY_EMIT_EXPERIMENTAL_METRICS(OTel-spec form, mapped throughOtelEnvironmentConfigSource)Both env vars are registered in
metadata/supported-configurations.json.Metrics added or reclassified (all under the
datadog.jvm.runtimescope, OTel-native names)jvm.memory.used_after_last_gcjvm.gc.durationjvm.gc.causeattribute is gated on the experimental flag (the cause attribute is not in OTel's stable attribute set);jvm.gc.nameandjvm.gc.actionare always attached.jvm.memory.initjvm.buffer.memory.used/limit/countjvm.system.cpu.utilizationjvm.system.cpu.load_1mjvm.file_descriptor.count/limitUnixOperatingSystemMXBean)Value-guard alignment with OTel reference implementation
jvm.memory.limitandjvm.memory.initnow skip recording only whengetMax()/getInit()returns the documented-1sentinel (was> 0, which incorrectly also skipped legitimate0values).>= 0, null checks) match the corresponding callbacks inio.opentelemetry.instrumentation.runtimetelemetry.internal.*.Test coverage
JvmOtlpRuntimeMetricsTestwas extended to assert all newly added metric names are registered (with platform-conditional checks for the Unix-only file descriptor metrics) and to coverjvm.gc.durationemission viaSystem.gc().JvmOtlpRuntimeMetricsForkedTestruns in an isolated JVM, callsstart(false), and verifies that Development-status instruments are absent and thatjvm.gc.causeis not attached tojvm.gc.durationdata points when experimental metrics are disabled. Forked becauseJvmOtlpRuntimeMetrics.start(...)is guarded by a process-wideAtomicBooleanand the registry / JMX listeners are JVM-global, so a single JVM cannot exercise both flag values.startIsIdempotenttest that only checked the metric-nameSetsize — it could not detect duplicate JMX listeners or duplicate observable callbacks under the same instrument name, which are the actual failure modes if the guard were removed.Misc
sunOsBean()helper to remove duplicatedinstanceof OperatingSystemMXBeancast logic betweenregisterCpuMetrics()and the newregisterSystemCpuMetrics().Motivation
The parent PR established the OTLP JVM runtime metrics pipeline but only emitted a subset of the OTel JVM semantic conventions. This follow-up brings the surface in line with what
opentelemetry-java-instrumentation'sruntime-telemetrylibrary emits, and adds the standard experimental-metrics opt-out so users who want only the Stable subset (smaller cardinality, fewer dashboard surprises) can disable Development metrics without losing the integration entirely.Aligning the value guards with OTel's reference implementation prevents two real-world divergences:
0vs-1fix, uncapped non-heap pools (wheregetMax() == 0on some JVM/version combos) would silently produce nojvm.memory.limitdata point — they should publish0to indicate "no limit observed."Additional Notes
JvmOtlpRuntimeMetrics.start(...). TheOTLP_JMX_CONFIG-skip path is unchanged.otel.instrumentation.runtime-telemetry.emit-experimental-metricsis captured inOtelEnvironmentConfigSourceso an unmodified OTel-style config picks up the flag automatically.Contributor Checklist
type:and (comp:orinst:) labels in addition to any other useful labelsclose,fix, or any linking keywords when referencing an issueUse
solvesinstead, and assign the PR milestone to the issueJira ticket: [PROJ-IDENT]