Skip to content

feat: JVM metrics via hsperfdata (no hooking)#384

Draft
debot-macmini1 wants to merge 10 commits into
mainfrom
spike/jvm-metrics-nodemon
Draft

feat: JVM metrics via hsperfdata (no hooking)#384
debot-macmini1 wants to merge 10 commits into
mainfrom
spike/jvm-metrics-nodemon

Conversation

@debot-macmini1
Copy link
Copy Markdown

@debot-macmini1 debot-macmini1 commented May 25, 2026

Summary

Adds JVM metrics discovery + export in zxporter-nodemon without attach/JMX/javaagent by reading HotSpot hsperfdata via /proc + cgroups.

What’s included

  • New nodemon endpoint: GET /container/jvm-metrics
    • heap used/size/max
    • GC collector time + safepoint time
    • JVM flags extraction from:
      • /proc/<pid>/cmdline
      • /proc/<pid>/environ (JAVA_TOOL_OPTIONS, JDK_JAVA_OPTIONS, JAVA_OPTS)
    • flag_sources attribution (cmdline vs which env var)
  • Helm chart gating: jvmMetrics.enabled
    • enables hostPID: true, runAsUser: 0, and CAP_SYS_PTRACE
    • rationale: required to read cross-container /proc/<pid>/root/tmp/hsperfdata_*/* and, in practice, /proc/<pid>/environ
  • Collector integration mirrors the GPU nodemon flow:
    • prefetch JVM metrics once per cycle
    • index by (namespace, pod, container)
    • attach to ContainerMetricsSnapshot

Smoke test matrix (Kind) — full req/resp bodies

All examples below were captured from a Kind cluster with the nodemon DaemonSet deployed with jvmMetrics.enabled=true.

Common request (fetch all JVM metrics for namespace)

Request:

curl -sS 'http://127.0.0.1:6061/container/jvm-metrics?namespace=jvm-spike'

Response (full JSON array):

[
  {
    "node_name": "jvm-spike-control-plane",
    "pod": "java-sleeper-5d444487b8-mv48f",
    "namespace": "jvm-spike",
    "container": "app",
    "container_id": "2b31ba56f3a6a14491cd381a224c5c02f05fae85aa4c6dae7685edcd090aa880",
    "pid_host": 2326,
    "pid_ns": 1,
    "java_command": "DummyMain",
    "java_version": "21.0.11",
    "heap_size_bytes": 134287360,
    "heap_used_bytes": 0,
    "heap_max_size_bytes": 268435456,
    "gc_time_seconds_total": {
      "Serial full collection pauses": 0.001146047,
      "Serial young collection pauses": 0.000271418
    },
    "safepoint_time_seconds_total": 0.001670841,
    "safepoint_sync_time_seconds_total": 0.000010959,
    "flags_extracted": {
      "xms_bytes": 67108864,
      "xmx_bytes": 268435456,
      "max_ram_percentage": 70,
      "use_container_support": true
    },
    "flag_sources": {
      "xms_bytes": "cmdline",
      "xmx_bytes": "cmdline",
      "max_ram_percentage": "cmdline",
      "use_container_support": "cmdline"
    },
    "raw_cmdline": "java -Xms64m -Xmx256m -XX:MaxRAMPercentage=70 -XX:+UseContainerSupport -cp /tmp DummyMain  -Dexample.tool.options=true",
    "timestamp": "2026-05-25T18:01:06.828335777Z"
  },
  {
    "node_name": "jvm-spike-control-plane",
    "pod": "java8-xmx-only-5b69c64479-tf2qx",
    "namespace": "jvm-spike",
    "container": "app",
    "container_id": "6945383b4828b08a0fc77b81d90a4752cbcc85300d776b799677c1a460a0325b",
    "pid_host": 30009,
    "pid_ns": 1,
    "java_command": "DummyMain",
    "java_version": "1.8.0_492",
    "heap_size_bytes": 42078208,
    "heap_used_bytes": 0,
    "heap_max_size_bytes": 201326592,
    "gc_time_seconds_total": {
      "Copy": 0.000289087,
      "MSC": 0.000437005
    },
    "safepoint_time_seconds_total": 0.011397506,
    "safepoint_sync_time_seconds_total": 0.000032208,
    "flags_extracted": {
      "xmx_bytes": 201326592
    },
    "flag_sources": {
      "xmx_bytes": "cmdline"
    },
    "raw_cmdline": "java -Xmx192m -cp /tmp DummyMain",
    "timestamp": "2026-05-25T18:01:06.828466529Z"
  },
  {
    "node_name": "jvm-spike-control-plane",
    "pod": "java17-maxram-only-7794b6cf4f-gvhjf",
    "namespace": "jvm-spike",
    "container": "app",
    "container_id": "c9fefdee55607a540f6174fab2058e6775d3de5db3da427ce896c7e1f080b1fd",
    "pid_host": 30077,
    "pid_ns": 1,
    "java_command": "DummyMain",
    "java_version": "17.0.19",
    "heap_size_bytes": 16912384,
    "heap_used_bytes": 0,
    "heap_max_size_bytes": 216006656,
    "gc_time_seconds_total": {
      "Serial full collection pauses": 0.000466714,
      "Serial young collection pauses": 0.000358171
    },
    "safepoint_time_seconds_total": 0.000893928,
    "safepoint_sync_time_seconds_total": 0.000013291,
    "flags_extracted": {
      "max_ram_percentage": 40,
      "use_container_support": true
    },
    "flag_sources": {
      "max_ram_percentage": "cmdline",
      "use_container_support": "cmdline"
    },
    "raw_cmdline": "java -XX:MaxRAMPercentage=40 -XX:+UseContainerSupport -cp /tmp DummyMain",
    "timestamp": "2026-05-25T18:01:06.828589447Z"
  },
  {
    "node_name": "jvm-spike-control-plane",
    "pod": "java11-tool-options-5df9bfcb98-twtc9",
    "namespace": "jvm-spike",
    "container": "app",
    "container_id": "5d7c501a26cf5f3b440d8a726c45418b8dfca199b6314d3f52c556f5189c4a53",
    "pid_host": 30153,
    "pid_ns": 1,
    "java_command": "DummyMain",
    "java_version": "11.0.31",
    "heap_size_bytes": 50331648,
    "heap_used_bytes": 0,
    "heap_max_size_bytes": 167772160,
    "gc_time_seconds_total": {
      "Copy": 0,
      "MSC": 0
    },
    "safepoint_time_seconds_total": 0.000294419,
    "safepoint_sync_time_seconds_total": 0.000093209,
    "flags_extracted": {
      "xms_bytes": 50331648,
      "xmx_bytes": 167772160,
      "max_ram_percentage": 65
    },
    "flag_sources": {
      "xms_bytes": "JAVA_TOOL_OPTIONS",
      "xmx_bytes": "JAVA_TOOL_OPTIONS",
      "max_ram_percentage": "JAVA_TOOL_OPTIONS"
    },
    "raw_cmdline": "java -cp /tmp DummyMain  -Xms48m -Xmx160m -XX:MaxRAMPercentage=65 -Dfrom=tooloptions",
    "timestamp": "2026-05-25T18:01:06.829114328Z"
  }
]

Note: not-java does not appear in the response (it’s a busybox sleep pod).


Case A: java8 cmdline heap sizing (-Xmx192m)

Workload: jvm-spike/deploy/java8-xmx-only

Request:

curl -sS 'http://127.0.0.1:6061/container/jvm-metrics?namespace=jvm-spike' | jq '.[] | select(.pod|startswith("java8-xmx-only"))'

Response (full object):

{
  "container": "app",
  "container_id": "6945383b4828b08a0fc77b81d90a4752cbcc85300d776b799677c1a460a0325b",
  "flag_sources": {
    "xmx_bytes": "cmdline"
  },
  "flags_extracted": {
    "xmx_bytes": 201326592
  },
  "gc_time_seconds_total": {
    "Copy": 0.000289087,
    "MSC": 0.000437005
  },
  "heap_max_size_bytes": 201326592,
  "heap_size_bytes": 42078208,
  "heap_used_bytes": 0,
  "java_command": "DummyMain",
  "java_version": "1.8.0_492",
  "namespace": "jvm-spike",
  "node_name": "jvm-spike-control-plane",
  "pid_host": 30009,
  "pid_ns": 1,
  "pod": "java8-xmx-only-5b69c64479-tf2qx",
  "raw_cmdline": "java -Xmx192m -cp /tmp DummyMain",
  "safepoint_sync_time_seconds_total": 3.2208e-05,
  "safepoint_time_seconds_total": 0.011397506,
  "timestamp": "2026-05-25T18:01:06.828466529Z"
}

Case B: java11 env-injected options (JAVA_TOOL_OPTIONS)

Workload: jvm-spike/deploy/java11-tool-options

Request:

curl -sS 'http://127.0.0.1:6061/container/jvm-metrics?namespace=jvm-spike' | jq '.[] | select(.pod|startswith("java11-tool-options"))'

Response (full object):

{
  "container": "app",
  "container_id": "5d7c501a26cf5f3b440d8a726c45418b8dfca199b6314d3f52c556f5189c4a53",
  "flag_sources": {
    "max_ram_percentage": "JAVA_TOOL_OPTIONS",
    "xms_bytes": "JAVA_TOOL_OPTIONS",
    "xmx_bytes": "JAVA_TOOL_OPTIONS"
  },
  "flags_extracted": {
    "max_ram_percentage": 65,
    "xms_bytes": 50331648,
    "xmx_bytes": 167772160
  },
  "gc_time_seconds_total": {
    "Copy": 0,
    "MSC": 0
  },
  "heap_max_size_bytes": 167772160,
  "heap_size_bytes": 50331648,
  "heap_used_bytes": 0,
  "java_command": "DummyMain",
  "java_version": "11.0.31",
  "namespace": "jvm-spike",
  "node_name": "jvm-spike-control-plane",
  "pid_host": 30153,
  "pid_ns": 1,
  "pod": "java11-tool-options-5df9bfcb98-twtc9",
  "raw_cmdline": "java -cp /tmp DummyMain  -Xms48m -Xmx160m -XX:MaxRAMPercentage=65 -Dfrom=tooloptions",
  "safepoint_sync_time_seconds_total": 9.3209e-05,
  "safepoint_time_seconds_total": 0.000294419,
  "timestamp": "2026-05-25T18:01:06.829114328Z"
}

Key assertions visible in response:

  • flags_extracted.xmx_bytes=167772160 (160MiB)
  • flag_sources.xmx_bytes="JAVA_TOOL_OPTIONS"

Case C: java17 percentage-based heap sizing (-XX:MaxRAMPercentage=40)

Workload: jvm-spike/deploy/java17-maxram-only

Request:

curl -sS 'http://127.0.0.1:6061/container/jvm-metrics?namespace=jvm-spike' | jq '.[] | select(.pod|startswith("java17-maxram-only"))'

Response (full object):

{
  "container": "app",
  "container_id": "c9fefdee55607a540f6174fab2058e6775d3de5db3da427ce896c7e1f080b1fd",
  "flag_sources": {
    "max_ram_percentage": "cmdline",
    "use_container_support": "cmdline"
  },
  "flags_extracted": {
    "max_ram_percentage": 40,
    "use_container_support": true
  },
  "gc_time_seconds_total": {
    "Serial full collection pauses": 0.000466714,
    "Serial young collection pauses": 0.000358171
  },
  "heap_max_size_bytes": 216006656,
  "heap_size_bytes": 16912384,
  "heap_used_bytes": 0,
  "java_command": "DummyMain",
  "java_version": "17.0.19",
  "namespace": "jvm-spike",
  "node_name": "jvm-spike-control-plane",
  "pid_host": 30077,
  "pid_ns": 1,
  "pod": "java17-maxram-only-7794b6cf4f-gvhjf",
  "raw_cmdline": "java -XX:MaxRAMPercentage=40 -XX:+UseContainerSupport -cp /tmp DummyMain",
  "safepoint_sync_time_seconds_total": 1.3291e-05,
  "safepoint_time_seconds_total": 0.000893928,
  "timestamp": "2026-05-25T18:01:06.828589447Z"
}

Key assertions visible in response:

  • flags_extracted.max_ram_percentage=40
  • flag_sources.max_ram_percentage="cmdline"

Case D: baseline java-sleeper (cmdline + percent)

Workload: jvm-spike/deploy/java-sleeper

Request:

curl -sS 'http://127.0.0.1:6061/container/jvm-metrics?namespace=jvm-spike' | jq '.[] | select(.pod|startswith("java-sleeper"))'

Response (full object):

{
  "container": "app",
  "container_id": "2b31ba56f3a6a14491cd381a224c5c02f05fae85aa4c6dae7685edcd090aa880",
  "flag_sources": {
    "max_ram_percentage": "cmdline",
    "use_container_support": "cmdline",
    "xms_bytes": "cmdline",
    "xmx_bytes": "cmdline"
  },
  "flags_extracted": {
    "max_ram_percentage": 70,
    "use_container_support": true,
    "xms_bytes": 67108864,
    "xmx_bytes": 268435456
  },
  "gc_time_seconds_total": {
    "Serial full collection pauses": 0.001146047,
    "Serial young collection pauses": 0.000271418
  },
  "heap_max_size_bytes": 268435456,
  "heap_size_bytes": 134287360,
  "heap_used_bytes": 0,
  "java_command": "DummyMain",
  "java_version": "21.0.11",
  "namespace": "jvm-spike",
  "node_name": "jvm-spike-control-plane",
  "pid_host": 2326,
  "pid_ns": 1,
  "pod": "java-sleeper-5d444487b8-mv48f",
  "raw_cmdline": "java -Xms64m -Xmx256m -XX:MaxRAMPercentage=70 -XX:+UseContainerSupport -cp /tmp DummyMain  -Dexample.tool.options=true",
  "safepoint_sync_time_seconds_total": 1.0959e-05,
  "safepoint_time_seconds_total": 0.001670841,
  "timestamp": "2026-05-25T18:01:06.828335777Z"
}

CI / Integration test coverage

Adds a Kind workflow that deploys the same JVM workload matrix and asserts:

  • java8-xmx-only: xmx_bytes == 201326592
  • java11-tool-options: xmx/xms/max_ram_percentage extracted and flag_sources mentions JAVA_TOOL_OPTIONS
  • java17-maxram-only: max_ram_percentage == 40
  • not-java: absent from JVM metrics output

Workflow: .github/workflows/jvm-metrics-kind-test.yml
Fixture: test/fixtures/jvm-apps.yaml


Security/ops notes

  • JVM metrics mode requires elevated permissions (hostPID + root + CAP_SYS_PTRACE). Kept behind jvmMetrics.enabled=false by default.

Debot MacMini1 added 9 commits May 24, 2026 09:41
- Add nodemon /container/jvm-metrics endpoint backed by hsperfdata
- Extract heap used/size/max + GC/safepoint counters
- Extract flags from cmdline + JAVA_TOOL_OPTIONS/JDK_JAVA_OPTIONS/JAVA_OPTS with source attribution
- Guard permissions behind jvmMetrics.enabled (hostPID + runAsUser=0 + SYS_PTRACE)
- Integrate JVM metrics into container resource collector (mirrors GPU nodemon flow)
- Add Kind workflow test covering JVM app matrix
Comment thread internal/nodemon/jvm_collector.go Outdated
Comment thread internal/nodemon/jvm_discovery.go Outdated
Comment thread internal/nodemon/jvm_discovery.go Outdated
@debot-macmini1
Copy link
Copy Markdown
Author

Update:

  • ✅ Percentage-based heap sizing is covered:

    • java17-maxram-only uses -XX:MaxRAMPercentage=40 (cmdline)
    • java11-tool-options includes -XX:MaxRAMPercentage=65 via JAVA_TOOL_OPTIONS
    • Kind workflow asserts max_ram_percentage=40 for java17.
  • ✅ Addressed Gitar findings + CI lint:

    • Safepoint counters are ms-based; now converted ms→seconds (no HRT freq division)
    • Added cgroupv2 bare 64-hex container-id fallback match
    • Discovery now uses isJavaProcess(comm, cmdline) (covers comm != java)
    • golangci-lint (only-new-issues) clean locally: golangci-lint run --new-from-rev=origin/main
  • 🧹 Removed temporary build marker plumbing (JVM_METRICS_BUILD_MARKER + chart values + workflow set).

Commit: f076b1e}

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 25, 2026

Code Review ✅ Approved 3 resolved / 3 findings

Implements JVM metrics scraping via hsperfdata to expose heap, GC, and flag data without requiring JMX or javaagents. Resolved issues include incorrect safepoint time units, cgroup v2 regex path matching, and dead code removal.

✅ 3 resolved
Bug: Safepoint times divided by wrong frequency (ns vs ms)

📄 internal/nodemon/jvm_collector.go:195-198
In HotSpot's hsperfdata, sun.rt.safepointTime and sun.rt.safepointSyncTime are stored in milliseconds, not in high-resolution timer ticks. The GC collector times (sun.gc.collector.N.time) use HRT ticks, but safepoint counters use milliseconds. Dividing by freq (typically 1,000,000,000) produces near-zero values instead of meaningful seconds.

Edge Case: Cgroup v2 bare container ID paths not matched by regex

📄 internal/nodemon/jvm_discovery.go:17-21 📄 internal/nodemon/jvm_discovery.go:119-131
The container ID regexes require a .scope suffix (e.g., cri-containerd-<id>.scope). On many Kubernetes clusters using cgroupv2 with containerd, the cgroup path is /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/…/<container-id> — a bare 64-char hex string without any prefix or .scope suffix. Java processes in such environments will be silently skipped because parseCgroupContainerID won't match.

Quality: Unused isJavaProcess function (dead code)

📄 internal/nodemon/jvm_discovery.go:63-65 📄 internal/nodemon/jvm_discovery.go:107-116
The function isJavaProcess (lines 107-116) is defined and tested but never called. The discovery loop at line 63 directly checks comm != "java" instead of using this function. This means JVMs started via a full path (e.g., /usr/lib/jvm/java-21/bin/java) with a non-java comm will be missed, and the helper exists to handle that case but isn't wired in.

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant