feat: add spec.traces and spec.logs for OTel forwarding by cooktheryan · Pull Request #15 · redhat-et/claw-operator

cooktheryan · 2026-06-16T19:17:21Z

Summary

Adds spec.traces and spec.logs as top-level CRD fields alongside the existing spec.metrics, enabling OpenTelemetry trace and log forwarding through the OTel Collector sidecar to external OTLP endpoints
Refactors the static OTel Collector config into a dynamic builder that generates pipelines per signal (metrics/traces/logs), supporting split endpoints for traces vs logs
Auto-installs @openclaw/diagnostics-otel and @openclaw/diagnostics-prometheus plugins when the corresponding signals are enabled
Injects OTel resource attributes (OTEL_SERVICE_NAME, service.instance.id, k8s.namespace.name) via Downward API for multi-instance discrimination
Configures parentbased_traceidratio sampler with user-configurable sampling ratio

Example CR

spec:
  metrics:
    enabled: true
  traces:
    enabled: true
    endpoint: "http://otel-collector.observability.svc:4318"
    samplingRatio: "0.1"
  logs:
    enabled: true
    # endpoint defaults to traces endpoint when omitted

Test plan

make generate — deepcopy generated for new TracesSpec and LogsSpec types
make manifests — CRD schema includes traces and logs fields
make lint — 0 issues (extracted injectObservabilityResources to stay under cyclomatic complexity limit)
make test — all tests pass including new diagnostics tests
Deploy to cluster with traces endpoint pointed at an OTel Collector, verify spans arrive
Deploy with both metrics and traces, verify sidecar serves Prometheus on :9464 and forwards OTLP

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

New Features
- Added OpenTelemetry (OTel) trace forwarding with a configurable sampling ratio.
- Added OTel log forwarding with an optional endpoint (logs default to the traces endpoint when not explicitly set).
- Enabled automatic OTel diagnostics sidecar injection when traces/logs/metrics are requested.
- Automatically manages network policy egress rules needed for external telemetry endpoints.
Quality Improvements
- Expanded end-to-end diagnostics coverage to validate traces, logs, metrics, generated configuration, environment variables, and network policy behavior.

Add traces and logs as top-level spec fields alongside metrics, enabling OpenTelemetry trace and log forwarding through the existing OTel Collector sidecar to external OTLP endpoints. Auto-installs diagnostics plugins and injects OTel resource attributes for multi-instance discrimination. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-06-16T19:17:38Z

Warning

Review limit reached

@cooktheryan, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 41 minutes and 32 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 38f08c31-e4b8-4c77-bf76-ce7d9ab15089

📥 Commits

Reviewing files that changed from the base of the PR and between f2e8a72 and c634562.

📒 Files selected for processing (2)

.github/workflows/publish-images.yml
.github/workflows/test-kind.yml

📝 Walkthrough

Walkthrough

Adds OTLP-based traces and logs forwarding to the Claw operator. New TracesSpec and LogsSpec types extend ClawSpec with optional endpoints and sampling configuration. A new claw_diagnostics.go module generates OTel Collector YAML, injects gateway environment variables with Kubernetes downward API, and manages NetworkPolicy egress rules for telemetry endpoints. The reconciler, metrics module, and plugin system are updated to call these helpers. Unit tests and a Kind-based E2E CI workflow validate the full observability pipeline.

Changes

OTel Traces and Logs Observability

Layer / File(s)	Summary
API types and CRD schema `api/v1alpha1/claw_types.go`, `api/v1alpha1/zz_generated.deepcopy.go`, `config/crd/bases/claw.sandbox.redhat.com_claws.yaml`	Adds exported `TracesSpec` (enabled, endpoint, samplingRatio with regex constraint `^(0(\.\d+)?\|1(\.0+)?)$`) and `LogsSpec` (enabled, endpoint) to `ClawSpec`. Updates CRD OpenAPI schema and generates deepcopy methods for both types.
Diagnostics implementation: helpers and resource injection `internal/controller/claw_diagnostics.go`	Implements signal enablement checks (`isTracesEnabled`, `isLogsEnabled`), endpoint resolution with fallback (logs default to traces), sampling ratio default (`"1"`), `otelSidecarNeeded` predicate, and three injection paths: `injectDiagnosticsConfig` to set `diagnostics.otel` in operator.json, `injectOTelEnvVars` to inject downward-API and sampler env vars into the gateway Deployment, `injectTelemetryEgressRules` to append TCP egress rules to the gateway NetworkPolicy, and `buildCollectorConfig` to dynamically generate OTel Collector YAML conditionally enabling metrics/traces/logs pipelines.
Diagnostics unit tests `internal/controller/claw_diagnostics_test.go`	Comprehensive test coverage for helper getters, config injection with endpoint and flag validation, env var injection with downward-API fieldRefs and trace sampler settings, collector config generation for all signal combinations and endpoint variants, NetworkPolicy egress rule injection with deduplication and endpoint classification, and diagnostics plugin selection logic.
Reconciler and metrics system wiring `internal/controller/claw_resource_controller.go`, `internal/controller/claw_metrics.go`, `internal/controller/claw_plugins.go`	`enrichConfigAndNetworkPolicy` calls `injectDiagnosticsConfig` after metrics enrichment; `configureDeployments` updates sidecar gate to `otelSidecarNeeded`, calls `injectObservabilityResources` and `injectOTelEnvVars`; `injectOTelCollectorConfig` delegates to `buildCollectorConfig`; `effectivePlugins` appends `requiredDiagnosticsPlugins` to implicit plugin list; `requiredDiagnosticsPlugins` returns OTEL/Prometheus diagnostics plugins based on enabled signals.
E2E Kind-based diagnostics workflow `.github/workflows/test-diagnostics.yml`	GitHub Actions workflow provisioning a Kind cluster, building and loading operator/proxy images, installing cert-manager and the operator, deploying a mock OTLP Collector backend, applying a `Claw` CR with traces/logs/metrics enabled, validating sidecar injection and collector config structure, confirming env vars and Prometheus port exposure, testing split-exporter configuration patch, testing external endpoint NetworkPolicy patch, with comprehensive failure-only debug collection and unconditional cluster teardown.

Sequence Diagram(s)

sequenceDiagram
  participant reconciler as Reconciler
  participant inject as injectObservabilityResources
  participant builder as buildCollectorConfig
  participant diagCfg as injectDiagnosticsConfig
  participant envVars as injectOTelEnvVars
  participant netpol as injectTelemetryEgressRules

  reconciler->>inject: objects, instance
  inject->>builder: generate OTel Collector YAML
  inject->>diagCfg: set diagnostics.otel flags and endpoint
  inject->>netpol: append TCP egress rules for endpoints
  reconciler->>envVars: inject OTEL_* and trace sampler env vars

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 29.63% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'feat: add spec.traces and spec.logs for OTel forwarding' accurately and concisely summarizes the main change: adding new CRD fields for OpenTelemetry trace and log forwarding configuration.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/diagnostics-otel

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cooktheryan · 2026-06-16T19:26:09Z

/hold until i can run through code rabbit

cooktheryan · 2026-06-17T12:53:43Z

@coderabbitai review

coderabbitai · 2026-06-17T12:53:50Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Rename the deprecated `otlphttp` exporter alias to `otlp_http` in the OTel Collector config to suppress warnings on collector 0.152.1+. Add a dedicated GitHub Actions workflow that validates the full OTel diagnostics stack (traces, logs, metrics) on a Kind cluster. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/test-diagnostics.yml:
- Around line 31-36: The Kind installation step downloads from the `/dl/latest/`
endpoint in the curl command, which is not reproducible as the latest version
can change over time. Replace the `/dl/latest/` path with a specific pinned
version number (e.g., `/dl/v0.20.0/`) in the curl command that downloads Kind
from kind.sigs.k8s.io to ensure consistent and reproducible test runs.
- Around line 11-22: Add a top-level permissions block to the workflow that
explicitly defines the minimum required GITHUB_TOKEN scopes. Currently, the
workflow lacks a permissions declaration which results in using default
permissions that are broader than necessary, violating the principle of least
privilege. Place the permissions block at the same level as the jobs key to
restrict token access to only the specific permissions needed for this OTel
Diagnostics workflow (typically read or none for workflows that only run tests
without deploying or modifying repository content).
- Line 24: The test-diagnostics.yml workflow uses semantic version tags (like
`@v4`) for GitHub Actions references instead of commit hashes, which is
inconsistent with the pinning approach used in publish-images.yml. Update all
uses statements in the test-diagnostics.yml workflow to pin actions to their
full commit hashes instead of version tags, following the same pattern already
established in publish-images.yml for supply-chain hardening and consistency
across the repository.

In `@api/v1alpha1/claw_types.go`:
- Around line 699-707: The TracesSpec and LogsSpec fields in the claw_types.go
file need CRD-level validation to prevent silent misconfiguration. Add
kubebuilder validation markers (such as +kubebuilder:validation:XValidationRule)
to enforce that when Traces or Logs specs have enabled set to true, the
corresponding endpoint fields must be explicitly configured. This ensures the
validation happens at admission time rather than silently failing during
collector configuration, preventing users from creating invalid resource
combinations where endpoints would not be wired for export.

In `@internal/controller/claw_diagnostics.go`:
- Around line 276-282: The isExternalEndpoint function suppresses errors from
classifyServiceURL by returning false on failure, which hides configuration
issues and silently skips egress rule injection. Modify isExternalEndpoint to
return both a boolean and an error instead of only a boolean, then propagate any
error returned by classifyServiceURL up through the call chain via
collectTelemetryEndpoints and finally through injectTelemetryEgressRules so that
reconciliation and status updates reflect misconfiguration issues immediately.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 2ab0c029-6158-4696-ad50-d544ea85e162

📥 Commits

Reviewing files that changed from the base of the PR and between f8c18ad and ad5b275.

📒 Files selected for processing (9)

.github/workflows/test-diagnostics.yml
api/v1alpha1/claw_types.go
api/v1alpha1/zz_generated.deepcopy.go
config/crd/bases/claw.sandbox.redhat.com_claws.yaml
internal/controller/claw_diagnostics.go
internal/controller/claw_diagnostics_test.go
internal/controller/claw_metrics.go
internal/controller/claw_plugins.go
internal/controller/claw_resource_controller.go

coderabbitai · 2026-06-17T14:31:38Z

+	// Traces configures distributed tracing via OTLP, forwarded through the
+	// OTel Collector sidecar to an external OTLP endpoint.
+	// +optional
+	Traces *TracesSpec `json:"traces,omitempty"`
+
+	// Logs configures log forwarding via OTLP, forwarded through the
+	// OTel Collector sidecar to an external OTLP endpoint.
+	// +optional
+	Logs *LogsSpec `json:"logs,omitempty"`


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add CRD-level validation for enabled traces/logs endpoint combinations.

enabled: true currently permits endpoint-less specs, but the collector config path only wires trace/log export when endpoints resolve. That creates a silent misconfiguration path. Reject invalid combinations at admission time.

Suggested guardrails

+// +kubebuilder:validation:XValidation:rule="!has(self.traces) || !self.traces.enabled || self.traces.endpoint != ''",message="spec.traces.endpoint is required when spec.traces.enabled is true" +// +kubebuilder:validation:XValidation:rule="!has(self.logs) || !self.logs.enabled || self.logs.endpoint != '' || (has(self.traces) && self.traces.enabled && self.traces.endpoint != '')",message="spec.logs.endpoint is required when spec.logs.enabled is true unless spec.traces.endpoint is set" type ClawSpec struct {

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/v1alpha1/claw_types.go` around lines 699 - 707, The TracesSpec and LogsSpec fields in the claw_types.go file need CRD-level validation to prevent silent misconfiguration. Add kubebuilder validation markers (such as +kubebuilder:validation:XValidationRule) to enforce that when Traces or Logs specs have enabled set to true, the corresponding endpoint fields must be explicitly configured. This ensures the validation happens at admission time rather than silently failing during collector configuration, preventing users from creating invalid resource combinations where endpoints would not be wired for export.

coderabbitai · 2026-06-17T14:31:38Z

+func isExternalEndpoint(rawURL string, instance *clawv1alpha1.Claw) bool {
+	target, err := classifyServiceURL(rawURL, instance.Namespace)
+	if err != nil {
+		return false
+	}
+	return target.External
+}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t suppress telemetry endpoint classification errors.

Returning false on classification failure hides invalid endpoint configs and silently skips egress rule injection. Bubble this error up so reconcile/status reflects misconfiguration immediately.

Minimal direction

-func isExternalEndpoint(rawURL string, instance *clawv1alpha1.Claw) bool { +func isExternalEndpoint(rawURL string, instance *clawv1alpha1.Claw) (bool, error) { target, err := classifyServiceURL(rawURL, instance.Namespace) if err != nil { - return false + return false, err } - return target.External + return target.External, nil }

Then propagate the error through collectTelemetryEndpoints → injectTelemetryEgressRules.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/controller/claw_diagnostics.go` around lines 276 - 282, The isExternalEndpoint function suppresses errors from classifyServiceURL by returning false on failure, which hides configuration issues and silently skips egress rule injection. Modify isExternalEndpoint to return both a boolean and an error instead of only a boolean, then propagate any error returned by classifyServiceURL up through the call chain via collectTelemetryEndpoints and finally through injectTelemetryEgressRules so that reconciliation and status updates reflect misconfiguration issues immediately.

- Fix injectTelemetryEgressRules to handle in-cluster endpoints using classifyServiceURL/buildInClusterEgressRule (was silently blocking) - Rename workflow to "kind test" - Use quay.io/redhat-et/ image names for Kind compatibility - Add IMAGE_PULL_POLICY=IfNotPresent for Kind deploys - Enhance debug logging: dump container logs, configmap, kubectl get all Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ntain permissions' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

- Add permissions: contents: read to limit GITHUB_TOKEN scope - Add persist-credentials: false on checkout - Pin Kind to v0.32.0 for reproducible builds Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Send test OTLP data to the sidecar and verify it flows through: - Traces: sidecar receives span, backend logs it via debug exporter - Logs: sidecar receives log record, backend logs it - Metrics: sidecar receives OTLP metric, Prometheus endpoint serves it Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

These tested the OTel Collector generically, not OpenClaw's actual telemetry output. Keep the operator artifact validations (sidecar, config, env vars, NP rules) which are what this test is for. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge publish-images.yml into test-kind.yml so images are only pushed to quay.io after successful Kind testing and only on merge to main. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Align test job action references with publish job for supply-chain hardening consistency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-advanced-security AI found potential problems Jun 17, 2026

View reviewed changes

Comment thread .github/workflows/test-diagnostics.yml Fixed

coderabbitai Bot reviewed Jun 17, 2026

View reviewed changes

cooktheryan and others added 2 commits June 17, 2026 10:35

Merge remote-tracking branch 'origin/main' into feat/diagnostics-otel

295f7b2

github-advanced-security AI found potential problems Jun 17, 2026

View reviewed changes

Comment thread .github/workflows/test-kind.yml Fixed

cooktheryan and others added 7 commits June 17, 2026 11:03

Potential fix for pull request finding 'CodeQL / Workflow does not co…

a82fac2

…ntain permissions' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

chore: address PR feedback on workflow security

d3e6ac6

- Add permissions: contents: read to limit GITHUB_TOKEN scope - Add persist-credentials: false on checkout - Pin Kind to v0.32.0 for reproducible builds Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: combine kind test and image publish into single CI workflow

d79b35a

Merge publish-images.yml into test-kind.yml so images are only pushed to quay.io after successful Kind testing and only on merge to main. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: pin actions to commit hashes in test job

c634562

Align test job action references with publish job for supply-chain hardening consistency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chore: remove workflow changes (moved to separate PR)

bdfe03f

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cooktheryan mentioned this pull request Jun 17, 2026

ci: combine kind test and image publish into single workflow #18

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add spec.traces and spec.logs for OTel forwarding#15

feat: add spec.traces and spec.logs for OTel forwarding#15
cooktheryan wants to merge 11 commits into
mainfrom
feat/diagnostics-otel

cooktheryan commented Jun 16, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

cooktheryan commented Jun 16, 2026

Uh oh!

cooktheryan commented Jun 17, 2026

Uh oh!

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 17, 2026

Uh oh!

coderabbitai Bot Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cooktheryan commented Jun 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Example CR

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

cooktheryan commented Jun 16, 2026

Uh oh!

cooktheryan commented Jun 17, 2026

Uh oh!

coderabbitai Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cooktheryan commented Jun 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 16, 2026 •

edited

Loading

coderabbitai Bot commented Jun 17, 2026 •

edited

Loading