Skip to content

Observe a running node as rosgraph_msgs/Node via ros2 nodl describe#83

Open
lsy3 wants to merge 17 commits into
ros-tooling:mainfrom
lsy3:lsy3/observe
Open

Observe a running node as rosgraph_msgs/Node via ros2 nodl describe#83
lsy3 wants to merge 17 commits into
ros-tooling:mainfrom
lsy3:lsy3/observe

Conversation

@lsy3

@lsy3 lsy3 commented Jun 7, 2026

Copy link
Copy Markdown

Implements "Observe" (#68): observe a running node → its runtime rosgraph_msgs/Node, published latched on /nodl/observed_node and rendered by ros2 nodl describe. The Node message is the contract that "Describe" (#53) consumes. Validated across humble, jazzy, kilted, lyrical, rolling × {fastrtps, cyclonedds, zenoh} with gtest unit + pytest integration layers.

Disclosure: developed with the assistance of Claude (Claude Code). I've reviewed the changes.

Important

Updated per the WG review: nodl_observe is now C++ (ament_cmake), not Python — covering review points #2#5 (Humble support, MCAP fixtures, C++ reimpl, one-container RMW testing). The Python impl is kept locally as an untracked reference; the verb keeps its CLI and shells out to a C++ observe binary.

CLI

ros2 nodl describe NODE_NAME [--timeout SEC] [--no-params] [--topic NAME] [-o FILE]

NODE_NAME FQN (hidden nodes work); --timeout 5.0 (discovery + param ceiling); --no-params skips the only contact with the target; --topic /nodl/observed_node; -o format inferred from .yaml/.yml/.json (default stdout YAML). Output is the rendered Node, not yet a NoDL document — that's #53, behind this same verb.

Architecture

  • Library-firstnodl_observe::observe_node(node, fqn, opts) uses a caller-provided node, never spins its own; pure builders underneath, so graph-monitor can link the library directly. An observe executable wraps it (observe + latch-publish).
  • Verb ↔ C++ = shell-outros2 nodl describe spawns the binary, receives the latched Node, renders with rosidl_runtime_py. The serialized message is the only language boundary (no pybind).

Design decisions

  • Observe records, Describe interprets — all endpoints reported unfiltered (/rosout, param services, …); filtering is "Describe": convert observed state to NoDL document #53's call. Schema exception: <action>/_action/* constituents fold into their Action; orphans stay flat.
  • Actions drop to the rcl_action C API — no rclcpp_action wrapper for the action graph (issue below).
  • Infinite/overflow QoS durations → {INT32_MAX, 0} uniformly — the rmw sentinel overflows Duration.sec (int32) and won't round-trip CDR/MCAP; this is valid, cross-distro-identical, and fixes Humble.
  • Service QoS → *_UNKNOWN — no info-by-service API in rclcpp/rmw; honest-unknown over plausible-wrong. @emersonknapp input welcome.

Humble (pre-Iron): supported + tested

Message-identical to Iron+; the REP-2011 type hash and BEST_AVAILABLE enum are compiled out (ROS2_<DISTRO>), so on Humble the type hash is left unset (like services) — differences live only in unfilled fields. Built and observation-tested with its own fixtures, no longer gated out.

Testing

  • Unit (gtest) — QoS mapping (incl. duration clamp), endpoint building + type hash, action folding incl. orphans, parameter pairing, FQN split, and the parameter-collection degradation path (absent target → empty, never throws).
  • Integration (pytest) — 3 scenario graphs (minimal / full-surface / multi-node isolation) observed via the observe binary, compared field-by-field vs committed MCAP fixtures. The can't-observe limits are test-locked (services created with explicit QoS still read *_UNKNOWN).

MCAP fixtures (replacing the YAML goldens) — one .mcap per (distro, RMW), resolver <distro>_<rmw><rmw>base, collapsing to 5 files:

Fixture Covers
base.mcap full observation — jazzy/kilted/lyrical/rolling, RMWs that propagate fully
rmw_cyclonedds_cpp.mcap cyclonedds KEEP_ALL depth0 (every distro)
jazzy_rmw_fastrtps_cpp.mcap jazzy's older fastrtps drops history/depth
humble_rmw_{fastrtps,cyclonedds}_cpp.mcap Humble (no type hash)

test/mcap_fixtures.py print\|diff is a human-readable viewer; regen with REGEN_FIXTURES=1. Everything else (reliability, durability, deadline, type name, RIHS hash, structure, folding, service *_UNKNOWN) is byte-identical, so kilted/lyrical/rolling resolve to base.mcap with no override. zenoh is Iron+ only (no Humble package); its CI leg starts rmw_zenohd.

Note

In C++ a missing rosgraph_msgs/Node.msg is a build failure, not a silent skip — so requires_node_msg is gone. CI still best-effort pulls the message from ros2-testing and installs mcap so the comparison actually runs.

Upstream issues to file

  1. No service-QoS introspection — no rmw_get_*_info_by_service on any RMW.
  2. rcl_action_*_get_names_and_types[_by_node] not wrapped in rclcpp_action — request parity.

Closes #68

@read-the-docs-community

read-the-docs-community Bot commented Jun 7, 2026

Copy link
Copy Markdown

Documentation build overview

📚 nodl | 🛠️ Build #33224056 | 📁 Comparing 043df91 against latest (18bc7d5)

  🔍 Preview build  

2 files changed
± index.html
± schema.html

lsy3 added 3 commits June 7, 2026 19:29
Stage one of the Observe -> Describe pipeline (ros-tooling#68): observe_node() fills
a rosgraph_msgs/Node from the live graph - topics with actual QoS and RIHS
type hash, services with honest *_UNKNOWN QoS (not observable externally),
actions folded from their hidden _action/* constituents, and parameter
descriptors/values via the target's parameter services with graceful
degradation under a shared timeout ceiling.

Tested in three layers: pure-builder unit tests (no executor), scenario
graphs diffed against per-distro golden YAML files, and ROS-free rendering
tests fed by those goldens. The goldens double as input fixtures for
Describe (ros-tooling#53). Requires rosgraph_msgs >= 2.0.4 (Node.msg).

Signed-off-by: Luke Sy <sylukewicent@gmail.com>
Observes the target node via nodl_observe, prints the rosgraph_msgs/Node
as YAML to stdout (or -o FILE with the format inferred from the .yaml/.yml/
.json extension), and publishes it latched (reliable, transient_local,
keep_last(1)) on /nodl/observed_node.

Publish-once semantics: delivery to currently-matched subscribers is
confirmed via wait_for_all_acked, bounded by --timeout; the latched history
dies with the publisher, so consumers must subscribe before the verb runs -
locked in by the smoke test. --no-params skips the only part of observation
that contacts the target.

Signed-off-by: Luke Sy <sylukewicent@gmail.com>
nodl_observe's tests are distro-aware: golden files live under
test/expected/<ROS_DISTRO>/ (jazzy committed), and every test module skips
cleanly on distros whose rosgraph_msgs predates 2.0.4 (no Node.msg), so the
existing humble..rolling matrix stays green and flips tests on per distro
as the rosgraph_msgs sync lands.

Signed-off-by: Luke Sy <sylukewicent@gmail.com>
lsy3 added 9 commits June 7, 2026 19:30
Signed-off-by: Luke Sy <sylukewicent@gmail.com>
__init__ now just re-exports the public surface (observe_node,
NodeNotFoundError, latched_qos); the graph polling and endpoint collection
move to _observe, matching the _-prefixed private-module pattern already
used for _endpoints/_parameters/_qos.  No behaviour change.

Signed-off-by: Luke Sy <sylukewicent@gmail.com>
Adds rmw_cyclonedds_cpp alongside rmw_fastrtps_cpp in CI (distro x RMW
matrix) and proves they observe QoS differently: cyclonedds propagates a
remote endpoint's history policy and depth over discovery, fastrtps does
not (history -> UNKNOWN, depth -> 0).  Goldens are now resolved
most-specific-first under expected/<distro>/<rmw>/, falling back to a shared
expected/<distro>/ when RMWs agree, so identical sets are stored once.

Commit a single YAML golden per case (the canonical, human-readable form);
the JSON renderer is proven by an equivalence test instead of a duplicate
JSON golden.  The one RMW-divergent assertion is now keyed by RMW so each
middleware's behaviour is locked in explicitly.

Signed-off-by: Luke Sy <sylukewicent@gmail.com>
Node.msg ships on the jazzy line (>= 2.0.4) but is not yet in kilted
(rosgraph_msgs 2.3.1 = Clock only) or current rolling, so version numbers
are not comparable across distros.  The package builds regardless; the
import guard skips observation tests where Node.msg is absent.

Signed-off-by: Luke Sy <sylukewicent@gmail.com>
The graph messages landed in rolling first and were backported to jazzy
(the only distro shipping them in a pullable image today).  The earlier
'rolling lacks them' note was an artifact of probing the EOL Ubuntu 24.04
rolling image (frozen at 2.4.4); live rolling/lyrical run on Ubuntu 26.04
images that are not published yet.  Kilted's backport release simply has
not been cut (2.3.1 ships Clock only).

Signed-off-by: Luke Sy <sylukewicent@gmail.com>
The RMW axis is now driven entirely by the CI matrix 'rmw:' list: the
install step derives each apt package name (rmw_x_cpp -> ros-<distro>-rmw-x-cpp),
the golden resolver is already (distro, RMW)-keyed, and the per-RMW
history-over-discovery expectation is a documented _HISTORY_OVER_DISCOVERY
map.  Adding an RMW is 'drop in goldens' -- the harness needs no per-RMW
setup (every scenario runs in one process / one session).

Also closes the silent-skip trap: observation tests importorskip when
rosgraph_msgs lacks Node.msg, which on a distro that *should* support observe
reads as green-having-tested-nothing.  A best-effort step pulls Node.msg from
ros2-testing where it leads the main index (e.g. jazzy 2.0.4), and a
matrix-gated assertion (requires_node_msg) fails the leg loudly if Node.msg
is still missing where it is required.

Signed-off-by: Luke Sy <sylukewicent@gmail.com>
Zenoh slots in with no harness change -- scenarios run in one process / one
session, so it discovers without a router daemon.  Behaviourally it matches
cyclonedds (propagates history and depth over discovery; identical endpoint
set and type hashes to the other RMWs).  Adds its (jazzy) goldens and its
_HISTORY_OVER_DISCOVERY entry.

Signed-off-by: Luke Sy <sylukewicent@gmail.com>
Node.msg has now landed across jazzy (2.0.4), kilted (2.3.2), lyrical
(2.4.5) and rolling (2.5.0) -- all via the best-effort ros2-testing install
where it leads the main index -- so all four are flagged requires_node_msg
and run the observation suite (humble still skips; no Node.msg yet).

Empirically, every (distro, RMW) observes the same Node *except* two gaps:
cyclonedds reports a KEEP_ALL queue's depth as 0 (every distro), and jazzy's
older fastrtps drops history/depth entirely (kilted-onward fastrtps does
propagate).  Goldens are deduplicated to match: a single _base/ holds the
full observation, with overrides only where a combination differs --
rmw_cyclonedds_cpp/s2 and jazzy/rmw_fastrtps_cpp/.  The resolver searches
<distro>/<rmw>/ -> <rmw>/ -> _base/.  Nine files now cover 4 distros x 3
RMWs x 4 scenarios (verified green on noble and resolute).

Drops the _HISTORY_OVER_DISCOVERY assertion map: history/depth is
(distro, RMW)-specific (fastrtps differs by distro), and the per-combination
golden already locks it exactly.

Signed-off-by: Luke Sy <sylukewicent@gmail.com>
Node.msg has now reached every distro including humble (rosgraph_msgs 1.2.3
via ros2-testing), so the package gets exercised on pre-Iron rclpy for the
first time -- and it did not import there.  Two defensive fixes make it
import-safe everywhere: map the BEST_AVAILABLE QoS enum only where rclpy
defines it (added in Iron), and read TopicEndpointInfo.topic_type_hash via
getattr (REP-2011, Iron+) so an absent hash is simply left unset like a
service's.

Full observation still needs Iron+ (type hashes, BEST_AVAILABLE, and an
int32-safe infinite QoS deadline that humble's builtin_interfaces overflows
on), so the observation/rendering tests and the describe smoke tests are
capability-gated to Iron+ (BEST_AVAILABLE presence as the proxy).  Humble's
CI leg now builds, runs the pure-argument and serialization tests, and skips
the rest -- green instead of crashing.  Full pre-Iron support is tracked as
a working-group follow-up.  No change on Iron+ (jazzy 154/0).

Signed-off-by: Luke Sy <sylukewicent@gmail.com>
@lsy3

lsy3 commented Jun 9, 2026

Copy link
Copy Markdown
Author

@emersonknapp a quick note on scope.

The feature is kept minimal and faithful to #68 — the observer and the ros2 nodl describe verb. The testing is where it's extensive: a 3 RMW × 5 distro matrix, deduplicated goldens (one _base/ plus small per-(distro, RMW) overrides), a guard that fails CI loudly rather than skipping when Node.msg is missing where required, and pre-Iron handling so Humble builds and skips cleanly.

I'd prefer to keep the tests alongside the implementation they validate, but I can split the multi-RMW/distro coverage into a follow-up if you'd rather a leaner first pass — the commits are already separated for it.

@luke-alloy luke-alloy requested a review from emersonknapp June 9, 2026 07:05
@lsy3

lsy3 commented Jun 9, 2026

Copy link
Copy Markdown
Author

WG review - 9 June 2026

@emersonknapp feel free to add if i missed anything but below is a summary of related points raised during the WG meeting

  1. Service QoS → *_UNKNOWN. No info_by_service introspection in RCL on any RMW — report unknown, not a blocker.
  2. Humble stays message-identical. Pre-Iron gaps (no type hash; int32 Duration for the infinite-deadline sentinel) get a best-effort fill — no Humble-specific message changes.
  3. Fixtures → MCAP. YAML goldens (~2k lines) are past the readability point. Switch to MCAP (schemas + packed structs, smaller on disk) + a short script using the mcap lib to print/diff them human-readably. Base fixture + edge-case copies.
  4. Reimplement in C++ (rclcpp). Reusable directly in graph-monitor (no Python boundary). Observer takes a node interface in (doesn't own the node); CLI wraps a thin node. Action introspection via direct rcl_action C calls — the rclcpp wrapper gap is confirmed.
  5. Test all 3 RMWs in one container, not a CI matrix. Build once, then re-run the RMW-sensitive tests under each RMW_IMPLEMENTATION (instead of a separate rebuilding container per RMW); drop the requires_node_msg guard.

lsy3 added 5 commits June 18, 2026 03:55
Replace the ament_python nodl_observe with an ament_cmake C++ package: a
reusable observe_node(...) library plus an `observe` executable that
latch-publishes the observed rosgraph_msgs/Node on /nodl/observed_node.
The Python implementation is kept locally as an untracked reference.
Addresses the WG review on this PR (points ros-tooling#2-ros-tooling#5 of plan_observe_cpp.md).

- Port the pure builders 1:1 (QoS enum mapping, topic/service endpoints
  with REP-2011 type hash, action folding, parameter pairing, FQN split),
  with gtest unit tests mirroring the Python tests.
- Actions use the rcl_action C API directly (no rclcpp_action wrapper).
- Parameters via AsyncParametersClient driven by a short-lived executor,
  with graceful degradation on an unresponsive target (covered by a test).
- Canonicalise infinite/overflow QoS durations to {INT32_MAX, 0} uniformly
  on every distro -- CDR-valid for MCAP and fixes Humble's int32 overflow.
- Humble (pre-Iron) is a supported, tested runtime target: the type hash
  and BEST_AVAILABLE QoS enum are compiled out via ROS2_<DISTRO>; the
  message stays structurally identical, differences live only in unfilled
  fields.
- Replace the ~2k-line YAML goldens with MCAP fixtures (one per
  (distro, RMW), most-specific-first resolver: <distro>_<rmw> -> <rmw> ->
  base) plus a human-readable print/diff helper. Verified field-for-field
  parity with the rclpy output (only the duration sentinel differs).
- Rewire `ros2 nodl describe` to shell out to the `observe` binary and
  render via rosidl_runtime_py.
- CI: one job per distro, build once, re-run the RMW-sensitive integration
  test over fastrtps/cyclonedds/zenoh; drop requires_node_msg.

Validated in Docker on humble, jazzy, kilted, and lyrical (build + gtest
unit tests + integration across each distro's RMWs).

Signed-off-by: Luke Sy <sylukewicent@gmail.com>
- ARCHITECTURE.md: layered data-flow diagram, module table, and the
  observe_node step-by-step for contributors.
- mcap_fixtures.py: add node_to_json + 'print -f yaml|json' so the
  fixture viewer matches the verb's -o output (was YAML only).

Signed-off-by: Luke Sy <sylukewicent@gmail.com>
…_msgs on humble

- rosidl_runtime_py was transitive via the old Python nodl_observe; the
  C++ rewrite dropped it, but the integration test (mcap_fixtures.py) and
  the describe verb still use it -> declare it (+ rclpy, ament_index_python
  for the verb, which the C++ lib no longer provides transitively).
- humble's rosgraph_msgs ships Node.msg only via ros2-testing; the bridge
  used --only-upgrade, a no-op when the package isn't pre-installed (the
  rostooling image), so rosdep then pulled the main version without Node.msg
  and the C++ build failed. Plain install + apt-mark hold instead.

Signed-off-by: Luke Sy <sylukewicent@gmail.com>
- Bridge silently no-op'd on humble: the keyring glob ros2*archive-keyring
  missed the actual ros-archive-keyring.gpg, so '[ -n key ] || exit 0' bailed
  and rosdep then installed main rosgraph_msgs (no Node.msg). Broaden the glob
  and fall back to [trusted=yes] instead of skipping.
- Per-RMW integration steps passed the test but 'colcon test-result --all'
  defaulted --test-result-base to 'build'; the action-ros-ci workspace is
  'ros_ws/build'. Point it there.

Signed-off-by: Luke Sy <sylukewicent@gmail.com>
The CI image (rostooling/setup-ros-docker) installs ros2cli but not the
ros2run package, so 'ros2 run rmw_zenoh_cpp rmw_zenohd' errors with
"invalid choice: 'run'"; the zenoh router never starts, cross-process
discovery fails, and every observation times out (this is why jazzy/kilted/
lyrical/rolling failed their zenoh leg while humble -- which has no zenoh --
passed).  Locate and exec the rmw_zenohd binary directly.

Validated end-to-end in the actual rostooling images: jazzy
(fastrtps/cyclonedds/zenoh all 14 passed) and humble (fastrtps/cyclonedds).

Signed-off-by: Luke Sy <sylukewicent@gmail.com>
@luke-alloy luke-alloy marked this pull request as ready for review June 19, 2026 22:56
@luke-alloy

Copy link
Copy Markdown

@emersonknapp this is ready for review.

Recap: nodl_observe is now C++ (ament_cmake) per the WG review — covering points #2#5 (Humble support, MCAP fixtures replacing the YAML goldens, the C++ reimplementation, and one-container 3-RMW testing). The Python implementation is kept locally as an untracked reference, and ros2 nodl describe keeps its CLI and shells out to the new observe binary. Full details are in the PR description.

CI is green across all five distros — humble, jazzy, kilted, lyrical, rolling — each over its RMWs (fastrtps / cyclonedds, plus zenoh on Iron+).

One open question I'd value your take on (you wrote Service.msg and the rmw stats shim): service QoS is reported *_UNKNOWN because there is no info-by-service API in rclcpp/rmw — is that the right call, or is there a better play? It's one of the two upstream issues noted in the description to file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"Observe": produce the rosgraph_msgs/Node for a running node

2 participants