Skip to content

Add Additional OTel JVM Runtime Metrics and Gate "Developmental" Metrics#11411

Open
mhlidd wants to merge 5 commits into
masterfrom
mhlidd/otlp_runtime_metrics_follow_up
Open

Add Additional OTel JVM Runtime Metrics and Gate "Developmental" Metrics#11411
mhlidd wants to merge 5 commits into
masterfrom
mhlidd/otlp_runtime_metrics_follow_up

Conversation

@mhlidd
Copy link
Copy Markdown
Contributor

@mhlidd mhlidd commented May 18, 2026

What Does This Do

Follow-up to the parent PR for maximo/otlp-runtime-metrics that expands the OTLP JVM runtime metrics surface and gates Development-status metrics behind a new opt-out flag.

New config

  • dd.metrics.otel.experimental.enabled (default: true) — mirrors OTel's otel.instrumentation.runtime-telemetry.emit-experimental-metrics. When false, only metrics marked Stable in the OTel JVM semantic conventions are emitted; Development-status metrics are suppressed. Settable via either env var:

    • DD_METRICS_OTEL_EXPERIMENTAL_ENABLED (Datadog form)
    • OTEL_INSTRUMENTATION_RUNTIME_TELEMETRY_EMIT_EXPERIMENTAL_METRICS (OTel-spec form, mapped through OtelEnvironmentConfigSource)

    Both env vars are registered in metadata/supported-configurations.json.

Metrics added or reclassified (all under the datadog.jvm.runtime scope, OTel-native names)

Metric OTel status When emitted
jvm.memory.used_after_last_gc Stable Always (moved into the always-on memory group)
jvm.gc.duration Stable Always. The jvm.gc.cause attribute is gated on the experimental flag (the cause attribute is not in OTel's stable attribute set); jvm.gc.name and jvm.gc.action are always attached.
jvm.memory.init Development Only when experimental flag is on
jvm.buffer.memory.used / limit / count Development Only when experimental flag is on
jvm.system.cpu.utilization Development Only when experimental flag is on
jvm.system.cpu.load_1m Development Only when experimental flag is on
jvm.file_descriptor.count / limit Development Only when experimental flag is on, and only on Unix-like JVMs (UnixOperatingSystemMXBean)

Value-guard alignment with OTel reference implementation

  • jvm.memory.limit and jvm.memory.init now skip recording only when getMax() / getInit() returns the documented -1 sentinel (was > 0, which incorrectly also skipped legitimate 0 values).
  • All other per-metric guards (>= 0, null checks) match the corresponding callbacks in io.opentelemetry.instrumentation.runtimetelemetry.internal.*.

Test coverage

  • JvmOtlpRuntimeMetricsTest was extended to assert all newly added metric names are registered (with platform-conditional checks for the Unix-only file descriptor metrics) and to cover jvm.gc.duration emission via System.gc().
  • New JvmOtlpRuntimeMetricsForkedTest runs in an isolated JVM, calls start(false), and verifies that Development-status instruments are absent and that jvm.gc.cause is not attached to jvm.gc.duration data points when experimental metrics are disabled. Forked because JvmOtlpRuntimeMetrics.start(...) is guarded by a process-wide AtomicBoolean and the registry / JMX listeners are JVM-global, so a single JVM cannot exercise both flag values.
  • Removed a weak startIsIdempotent test that only checked the metric-name Set size — it could not detect duplicate JMX listeners or duplicate observable callbacks under the same instrument name, which are the actual failure modes if the guard were removed.

Misc

  • Extracted sunOsBean() helper to remove duplicated instanceof OperatingSystemMXBean cast logic between registerCpuMetrics() and the new registerSystemCpuMetrics().
  • Added debug logs when an MXBean isn't available so it's obvious why a metric didn't show up.

Motivation

The parent PR established the OTLP JVM runtime metrics pipeline but only emitted a subset of the OTel JVM semantic conventions. This follow-up brings the surface in line with what opentelemetry-java-instrumentation's runtime-telemetry library emits, and adds the standard experimental-metrics opt-out so users who want only the Stable subset (smaller cardinality, fewer dashboard surprises) can disable Development metrics without losing the integration entirely.

Aligning the value guards with OTel's reference implementation prevents two real-world divergences:

  1. Without the 0 vs -1 fix, uncapped non-heap pools (where getMax() == 0 on some JVM/version combos) would silently produce no jvm.memory.limit data point — they should publish 0 to indicate "no limit observed."
  2. The experimental gate ensures dashboards built against OTel's stable-only output won't differ between OTel SDK collection and DD-agent collection.

Additional Notes

  • No change to JMXFetch behavior beyond passing the new flag through JvmOtlpRuntimeMetrics.start(...). The OTLP_JMX_CONFIG-skip path is unchanged.
  • The OTel-spec env var otel.instrumentation.runtime-telemetry.emit-experimental-metrics is captured in OtelEnvironmentConfigSource so an unmodified OTel-style config picks up the flag automatically.

Contributor Checklist

Jira ticket: [PROJ-IDENT]

@mhlidd
Copy link
Copy Markdown
Contributor Author

mhlidd commented May 18, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 62d9b50d1d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Base automatically changed from maximo/otlp-runtime-metrics to master May 19, 2026 18:23
@datadog-prod-us1-5
Copy link
Copy Markdown
Contributor

datadog-prod-us1-5 Bot commented May 19, 2026

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 1 Pipeline job failed

Check pull requests | Check pull requests   View in Datadog   GitHub Actions

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration. Label validation failed: Please add at least one type, and one component or instrumentation label to the pull request.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 40ef357 | Docs | Datadog PR Page | Give us feedback!

@mhlidd mhlidd force-pushed the mhlidd/otlp_runtime_metrics_follow_up branch from 90ddfc2 to de166ab Compare May 20, 2026 18:24
@mhlidd mhlidd changed the title init Add Additional OTel JVM Runtime Metrics and Gate "Developmental" Metrics May 20, 2026
@mhlidd mhlidd marked this pull request as ready for review May 20, 2026 19:53
@mhlidd mhlidd requested review from a team as code owners May 20, 2026 19:53
@mhlidd mhlidd requested review from ValentinZakharov, bric3 and mcculls and removed request for a team May 20, 2026 19:53
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 20, 2026

Hi! 👋 Thanks for your pull request! 🎉

To help us review it, please make sure to:

  • Add at least one type, and one component or instrumentation label to the pull request

If you need help, please check our contributing guidelines.

@mhlidd mhlidd added type: enhancement Enhancements and improvements inst: opentelemetry OpenTelemetry instrumentation tag: ai generated Largely based on code generated by an AI or LLM labels May 20, 2026
@mhlidd
Copy link
Copy Markdown
Contributor Author

mhlidd commented May 20, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Another round soon, please!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown
Contributor

@ValentinZakharov ValentinZakharov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify whether the following differences from the JVM semantic conventions are intentional?

  • jvm.thread.count seems to be missing the recommended attributes thread.daemon and thread.state (spec
  • jvm.memory.init is not split by memory pool and seems to be missing the jvm.memory.pool.name attribute (spec)


private static void recordGcDuration(
OtelMetricStorage storage, GarbageCollectionNotificationInfo info, boolean captureGcCause) {
double durationSeconds = info.getGcInfo().getDuration() / 1000d;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add a null check in case info doesn’t contain GC info

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggesting adding that check in handleNotification before calling recordGcDuration

ManagementFactory.getOperatingSystemMXBean();
OperatingSystemMXBean osBean =
rawOsBean instanceof OperatingSystemMXBean ? (OperatingSystemMXBean) rawOsBean : null;
OperatingSystemMXBean osBean = sunOsBean();
Copy link
Copy Markdown
Contributor

@mcculls mcculls May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe findOsBean() ?

false,
GarbageCollectorMXBean.class.getClassLoader());
return true;
} catch (ClassNotFoundException e) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd widen this to catch Exception or Throwable

static final int DEFAULT_METRICS_OTEL_TIMEOUT = 7_500; // ms
static final int DEFAULT_METRICS_OTEL_CARDINALITY_LIMIT = 2_000;

public static final boolean DEFAULT_METRICS_OTEL_EXPERIMENTAL_ENABLED = true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

java.lang.management.OperatingSystemMXBean stdOsBean =
ManagementFactory.getOperatingSystemMXBean();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's quite a few calls to ManagementFactory.getOperatingSystemMXBean(); here - some use sunOsBean() which returns null if it's not the right type, while other places have their own instanceof checks.

It might actually be more readable and consistent to just get the MBean with ManagementFactory.getOperatingSystemMXBean(); everywhere and check+cast it to the right type. The sunOsBean() helper doesn't really add much IMHO.

Copy link
Copy Markdown
Contributor

@mcculls mcculls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question about whether the default should really be true since OTel defaults it to false at the moment: https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/runtime-telemetry/README.md

Also a few cleanup / robustness comments to be addressed before merging - otherwise looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

inst: opentelemetry OpenTelemetry instrumentation tag: ai generated Largely based on code generated by an AI or LLM type: enhancement Enhancements and improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants