Skip to content

feat: add spec.traces and spec.logs for OTel forwarding#15

Open
cooktheryan wants to merge 11 commits into
mainfrom
feat/diagnostics-otel
Open

feat: add spec.traces and spec.logs for OTel forwarding#15
cooktheryan wants to merge 11 commits into
mainfrom
feat/diagnostics-otel

Conversation

@cooktheryan

@cooktheryan cooktheryan commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds spec.traces and spec.logs as top-level CRD fields alongside the existing spec.metrics, enabling OpenTelemetry trace and log forwarding through the OTel Collector sidecar to external OTLP endpoints
  • Refactors the static OTel Collector config into a dynamic builder that generates pipelines per signal (metrics/traces/logs), supporting split endpoints for traces vs logs
  • Auto-installs @openclaw/diagnostics-otel and @openclaw/diagnostics-prometheus plugins when the corresponding signals are enabled
  • Injects OTel resource attributes (OTEL_SERVICE_NAME, service.instance.id, k8s.namespace.name) via Downward API for multi-instance discrimination
  • Configures parentbased_traceidratio sampler with user-configurable sampling ratio

Example CR

spec:
  metrics:
    enabled: true
  traces:
    enabled: true
    endpoint: "http://otel-collector.observability.svc:4318"
    samplingRatio: "0.1"
  logs:
    enabled: true
    # endpoint defaults to traces endpoint when omitted

Test plan

  • make generate — deepcopy generated for new TracesSpec and LogsSpec types
  • make manifests — CRD schema includes traces and logs fields
  • make lint — 0 issues (extracted injectObservabilityResources to stay under cyclomatic complexity limit)
  • make test — all tests pass including new diagnostics tests
  • Deploy to cluster with traces endpoint pointed at an OTel Collector, verify spans arrive
  • Deploy with both metrics and traces, verify sidecar serves Prometheus on :9464 and forwards OTLP

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Added OpenTelemetry (OTel) trace forwarding with a configurable sampling ratio.
    • Added OTel log forwarding with an optional endpoint (logs default to the traces endpoint when not explicitly set).
    • Enabled automatic OTel diagnostics sidecar injection when traces/logs/metrics are requested.
    • Automatically manages network policy egress rules needed for external telemetry endpoints.
  • Quality Improvements

    • Expanded end-to-end diagnostics coverage to validate traces, logs, metrics, generated configuration, environment variables, and network policy behavior.

Add traces and logs as top-level spec fields alongside metrics,
enabling OpenTelemetry trace and log forwarding through the existing
OTel Collector sidecar to external OTLP endpoints. Auto-installs
diagnostics plugins and injects OTel resource attributes for
multi-instance discrimination.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@cooktheryan, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 41 minutes and 32 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 38f08c31-e4b8-4c77-bf76-ce7d9ab15089

📥 Commits

Reviewing files that changed from the base of the PR and between f2e8a72 and c634562.

📒 Files selected for processing (2)
  • .github/workflows/publish-images.yml
  • .github/workflows/test-kind.yml
📝 Walkthrough

Walkthrough

Adds OTLP-based traces and logs forwarding to the Claw operator. New TracesSpec and LogsSpec types extend ClawSpec with optional endpoints and sampling configuration. A new claw_diagnostics.go module generates OTel Collector YAML, injects gateway environment variables with Kubernetes downward API, and manages NetworkPolicy egress rules for telemetry endpoints. The reconciler, metrics module, and plugin system are updated to call these helpers. Unit tests and a Kind-based E2E CI workflow validate the full observability pipeline.

Changes

OTel Traces and Logs Observability

Layer / File(s) Summary
API types and CRD schema
api/v1alpha1/claw_types.go, api/v1alpha1/zz_generated.deepcopy.go, config/crd/bases/claw.sandbox.redhat.com_claws.yaml
Adds exported TracesSpec (enabled, endpoint, samplingRatio with regex constraint ^(0(\.\d+)?|1(\.0+)?)$) and LogsSpec (enabled, endpoint) to ClawSpec. Updates CRD OpenAPI schema and generates deepcopy methods for both types.
Diagnostics implementation: helpers and resource injection
internal/controller/claw_diagnostics.go
Implements signal enablement checks (isTracesEnabled, isLogsEnabled), endpoint resolution with fallback (logs default to traces), sampling ratio default ("1"), otelSidecarNeeded predicate, and three injection paths: injectDiagnosticsConfig to set diagnostics.otel in operator.json, injectOTelEnvVars to inject downward-API and sampler env vars into the gateway Deployment, injectTelemetryEgressRules to append TCP egress rules to the gateway NetworkPolicy, and buildCollectorConfig to dynamically generate OTel Collector YAML conditionally enabling metrics/traces/logs pipelines.
Diagnostics unit tests
internal/controller/claw_diagnostics_test.go
Comprehensive test coverage for helper getters, config injection with endpoint and flag validation, env var injection with downward-API fieldRefs and trace sampler settings, collector config generation for all signal combinations and endpoint variants, NetworkPolicy egress rule injection with deduplication and endpoint classification, and diagnostics plugin selection logic.
Reconciler and metrics system wiring
internal/controller/claw_resource_controller.go, internal/controller/claw_metrics.go, internal/controller/claw_plugins.go
enrichConfigAndNetworkPolicy calls injectDiagnosticsConfig after metrics enrichment; configureDeployments updates sidecar gate to otelSidecarNeeded, calls injectObservabilityResources and injectOTelEnvVars; injectOTelCollectorConfig delegates to buildCollectorConfig; effectivePlugins appends requiredDiagnosticsPlugins to implicit plugin list; requiredDiagnosticsPlugins returns OTEL/Prometheus diagnostics plugins based on enabled signals.
E2E Kind-based diagnostics workflow
.github/workflows/test-diagnostics.yml
GitHub Actions workflow provisioning a Kind cluster, building and loading operator/proxy images, installing cert-manager and the operator, deploying a mock OTLP Collector backend, applying a Claw CR with traces/logs/metrics enabled, validating sidecar injection and collector config structure, confirming env vars and Prometheus port exposure, testing split-exporter configuration patch, testing external endpoint NetworkPolicy patch, with comprehensive failure-only debug collection and unconditional cluster teardown.

Sequence Diagram(s)

sequenceDiagram
  participant reconciler as Reconciler
  participant inject as injectObservabilityResources
  participant builder as buildCollectorConfig
  participant diagCfg as injectDiagnosticsConfig
  participant envVars as injectOTelEnvVars
  participant netpol as injectTelemetryEgressRules

  reconciler->>inject: objects, instance
  inject->>builder: generate OTel Collector YAML
  inject->>diagCfg: set diagnostics.otel flags and endpoint
  inject->>netpol: append TCP egress rules for endpoints
  reconciler->>envVars: inject OTEL_* and trace sampler env vars
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'feat: add spec.traces and spec.logs for OTel forwarding' accurately and concisely summarizes the main change: adding new CRD fields for OpenTelemetry trace and log forwarding configuration.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/diagnostics-otel

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@cooktheryan

Copy link
Copy Markdown
Collaborator Author

/hold until i can run through code rabbit

@cooktheryan

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Rename the deprecated `otlphttp` exporter alias to `otlp_http` in the
OTel Collector config to suppress warnings on collector 0.152.1+. Add a
dedicated GitHub Actions workflow that validates the full OTel
diagnostics stack (traces, logs, metrics) on a Kind cluster.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread .github/workflows/test-diagnostics.yml Fixed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/test-diagnostics.yml:
- Around line 31-36: The Kind installation step downloads from the `/dl/latest/`
endpoint in the curl command, which is not reproducible as the latest version
can change over time. Replace the `/dl/latest/` path with a specific pinned
version number (e.g., `/dl/v0.20.0/`) in the curl command that downloads Kind
from kind.sigs.k8s.io to ensure consistent and reproducible test runs.
- Around line 11-22: Add a top-level permissions block to the workflow that
explicitly defines the minimum required GITHUB_TOKEN scopes. Currently, the
workflow lacks a permissions declaration which results in using default
permissions that are broader than necessary, violating the principle of least
privilege. Place the permissions block at the same level as the jobs key to
restrict token access to only the specific permissions needed for this OTel
Diagnostics workflow (typically read or none for workflows that only run tests
without deploying or modifying repository content).
- Line 24: The test-diagnostics.yml workflow uses semantic version tags (like
`@v4`) for GitHub Actions references instead of commit hashes, which is
inconsistent with the pinning approach used in publish-images.yml. Update all
uses statements in the test-diagnostics.yml workflow to pin actions to their
full commit hashes instead of version tags, following the same pattern already
established in publish-images.yml for supply-chain hardening and consistency
across the repository.

In `@api/v1alpha1/claw_types.go`:
- Around line 699-707: The TracesSpec and LogsSpec fields in the claw_types.go
file need CRD-level validation to prevent silent misconfiguration. Add
kubebuilder validation markers (such as +kubebuilder:validation:XValidationRule)
to enforce that when Traces or Logs specs have enabled set to true, the
corresponding endpoint fields must be explicitly configured. This ensures the
validation happens at admission time rather than silently failing during
collector configuration, preventing users from creating invalid resource
combinations where endpoints would not be wired for export.

In `@internal/controller/claw_diagnostics.go`:
- Around line 276-282: The isExternalEndpoint function suppresses errors from
classifyServiceURL by returning false on failure, which hides configuration
issues and silently skips egress rule injection. Modify isExternalEndpoint to
return both a boolean and an error instead of only a boolean, then propagate any
error returned by classifyServiceURL up through the call chain via
collectTelemetryEndpoints and finally through injectTelemetryEgressRules so that
reconciliation and status updates reflect misconfiguration issues immediately.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 2ab0c029-6158-4696-ad50-d544ea85e162

📥 Commits

Reviewing files that changed from the base of the PR and between f8c18ad and ad5b275.

📒 Files selected for processing (9)
  • .github/workflows/test-diagnostics.yml
  • api/v1alpha1/claw_types.go
  • api/v1alpha1/zz_generated.deepcopy.go
  • config/crd/bases/claw.sandbox.redhat.com_claws.yaml
  • internal/controller/claw_diagnostics.go
  • internal/controller/claw_diagnostics_test.go
  • internal/controller/claw_metrics.go
  • internal/controller/claw_plugins.go
  • internal/controller/claw_resource_controller.go

Comment thread .github/workflows/test-kind.yml Outdated
Comment thread .github/workflows/test-kind.yml Outdated
Comment thread .github/workflows/test-kind.yml Outdated
Comment on lines +699 to +707
// Traces configures distributed tracing via OTLP, forwarded through the
// OTel Collector sidecar to an external OTLP endpoint.
// +optional
Traces *TracesSpec `json:"traces,omitempty"`

// Logs configures log forwarding via OTLP, forwarded through the
// OTel Collector sidecar to an external OTLP endpoint.
// +optional
Logs *LogsSpec `json:"logs,omitempty"`

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add CRD-level validation for enabled traces/logs endpoint combinations.

enabled: true currently permits endpoint-less specs, but the collector config path only wires trace/log export when endpoints resolve. That creates a silent misconfiguration path. Reject invalid combinations at admission time.

Suggested guardrails
+// +kubebuilder:validation:XValidation:rule="!has(self.traces) || !self.traces.enabled || self.traces.endpoint != ''",message="spec.traces.endpoint is required when spec.traces.enabled is true"
+// +kubebuilder:validation:XValidation:rule="!has(self.logs) || !self.logs.enabled || self.logs.endpoint != '' || (has(self.traces) && self.traces.enabled && self.traces.endpoint != '')",message="spec.logs.endpoint is required when spec.logs.enabled is true unless spec.traces.endpoint is set"
 type ClawSpec struct {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/v1alpha1/claw_types.go` around lines 699 - 707, The TracesSpec and
LogsSpec fields in the claw_types.go file need CRD-level validation to prevent
silent misconfiguration. Add kubebuilder validation markers (such as
+kubebuilder:validation:XValidationRule) to enforce that when Traces or Logs
specs have enabled set to true, the corresponding endpoint fields must be
explicitly configured. This ensures the validation happens at admission time
rather than silently failing during collector configuration, preventing users
from creating invalid resource combinations where endpoints would not be wired
for export.

Comment thread internal/controller/claw_diagnostics.go Outdated
Comment on lines +276 to +282
func isExternalEndpoint(rawURL string, instance *clawv1alpha1.Claw) bool {
target, err := classifyServiceURL(rawURL, instance.Namespace)
if err != nil {
return false
}
return target.External
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t suppress telemetry endpoint classification errors.

Returning false on classification failure hides invalid endpoint configs and silently skips egress rule injection. Bubble this error up so reconcile/status reflects misconfiguration immediately.

Minimal direction
-func isExternalEndpoint(rawURL string, instance *clawv1alpha1.Claw) bool {
+func isExternalEndpoint(rawURL string, instance *clawv1alpha1.Claw) (bool, error) {
 	target, err := classifyServiceURL(rawURL, instance.Namespace)
 	if err != nil {
-		return false
+		return false, err
 	}
-	return target.External
+	return target.External, nil
}

Then propagate the error through collectTelemetryEndpointsinjectTelemetryEgressRules.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/claw_diagnostics.go` around lines 276 - 282, The
isExternalEndpoint function suppresses errors from classifyServiceURL by
returning false on failure, which hides configuration issues and silently skips
egress rule injection. Modify isExternalEndpoint to return both a boolean and an
error instead of only a boolean, then propagate any error returned by
classifyServiceURL up through the call chain via collectTelemetryEndpoints and
finally through injectTelemetryEgressRules so that reconciliation and status
updates reflect misconfiguration issues immediately.

cooktheryan and others added 2 commits June 17, 2026 10:35
- Fix injectTelemetryEgressRules to handle in-cluster endpoints using
  classifyServiceURL/buildInClusterEgressRule (was silently blocking)
- Rename workflow to "kind test"
- Use quay.io/redhat-et/ image names for Kind compatibility
- Add IMAGE_PULL_POLICY=IfNotPresent for Kind deploys
- Enhance debug logging: dump container logs, configmap, kubectl get all

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread .github/workflows/test-kind.yml Fixed
cooktheryan and others added 7 commits June 17, 2026 11:03
…ntain permissions'

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
- Add permissions: contents: read to limit GITHUB_TOKEN scope
- Add persist-credentials: false on checkout
- Pin Kind to v0.32.0 for reproducible builds

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Send test OTLP data to the sidecar and verify it flows through:
- Traces: sidecar receives span, backend logs it via debug exporter
- Logs: sidecar receives log record, backend logs it
- Metrics: sidecar receives OTLP metric, Prometheus endpoint serves it

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
These tested the OTel Collector generically, not OpenClaw's actual
telemetry output. Keep the operator artifact validations (sidecar,
config, env vars, NP rules) which are what this test is for.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merge publish-images.yml into test-kind.yml so images are only pushed
to quay.io after successful Kind testing and only on merge to main.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Align test job action references with publish job for supply-chain
hardening consistency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants