Skip to content

feat: surface init container failures, plugin compatibility, and vers…#13

Merged
cooktheryan merged 3 commits into
mainfrom
feat/surface-deployment-failures
Jun 18, 2026
Merged

feat: surface init container failures, plugin compatibility, and vers…#13
cooktheryan merged 3 commits into
mainfrom
feat/surface-deployment-failures

Conversation

@cooktheryan

@cooktheryan cooktheryan commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

…ion downgrade warnings

When a Claw deployment fails due to init container crashes (e.g., a plugin requiring a newer OpenClaw version than spec.version), the operator previously reported only "Waiting for deployments to become ready" — forcing users to manually dig through pod logs to find the root cause. This was observed in production when spec.version 2026.6.5 was deployed against a PVC configured for 2026.6.8, causing the @openclaw/anthropic-vertex-provider plugin's init-plugins container to crash-loop with no actionable status on the CR.

This commit adds three warning-only status conditions:

  1. Ready condition enrichment: when deployments are not ready, the operator inspects pods for init container failures (non-zero exit or CrashLoopBackOff) and surfaces the actual error message in the Ready condition with reason InitContainerFailure.

  2. PluginCompatibility condition: when spec.version is older than a plugin's minimum required version (declared via PluginMinVersion in knownProviders), a warning condition is set. This does not block deployment — users may have compatible plugins cached on PVC.

  3. VersionDowngrade condition: when spec.version is older than status.lastDeployedVersion, a warning is set about potential PVC data incompatibility. Downgrades remain fully permitted.

Summary by CodeRabbit

  • New Features
    • Added new Claw status conditions for plugin compatibility and version downgrade detection.
    • Extended status reporting with lastDeployedVersion to help identify potentially unsafe version downgrades.
    • Improved non-ready status messaging by surfacing init-container failure details on the Ready condition.
  • Tests
    • Added unit tests covering version comparison, plugin compatibility checks, init failure surfacing, and downgrade behavior.
  • Chores
    • Updated controller permissions to read Pod details needed for init failure detection.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@cooktheryan, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 45 minutes and 56 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: fe4c17b5-32bc-4e1c-8fd0-726bba8b72eb

📥 Commits

Reviewing files that changed from the base of the PR and between 5148225 and f086d4d.

📒 Files selected for processing (9)
  • api/v1alpha1/claw_types.go
  • config/crd/bases/claw.sandbox.redhat.com_claws.yaml
  • config/rbac/role.yaml
  • internal/controller/claw_plugins.go
  • internal/controller/claw_plugins_test.go
  • internal/controller/claw_providers.go
  • internal/controller/claw_resource_controller.go
  • internal/controller/claw_status.go
  • internal/controller/claw_status_test.go
📝 Walkthrough

Walkthrough

Adds three safety and observability enhancements to the Claw operator: plugin CalVer compatibility checking that gates init-container installation and sets a PluginCompatibility condition; version downgrade detection using a new LastDeployedVersion status field that sets a VersionDowngrade condition; and init-container failure surfacing that enriches the Ready condition with pod termination details.

Changes

Claw Operator Safety Checks

Layer / File(s) Summary
API types and CRD schema
api/v1alpha1/claw_types.go, config/crd/bases/claw.sandbox.redhat.com_claws.yaml
Adds ConditionTypePluginCompatibility and ConditionTypeVersionDowngrade condition type constants, three new ConditionReason constants (Incompatible, VersionDowngrade, InitContainerFailure), and the LastDeployedVersion field on ClawStatus. The CRD YAML gains the corresponding status.lastDeployedVersion schema entry.
CalVer version comparison utilities
internal/controller/claw_plugins.go, internal/controller/claw_plugins_test.go
Implements compareCalver for segment-by-segment numeric version comparison (returning -1/0/1 or validity flag on parse failure). Full test coverage includes equality, ordering, year/segment differences, and malformed/empty inputs.
Plugin compatibility check logic
internal/controller/claw_plugins.go, internal/controller/claw_plugins_test.go
Implements checkPluginCompatibility to validate spec.version against any implicitly required Vertex plugin minimum version (PluginMinVersion), returning a compatibility warning when spec version is too low or "cannot check" when spec is non-empty but not valid CalVer. Includes test subtests for vertex/non-vertex credentials, compatible/incompatible versions, and edge cases.
Provider version requirements
internal/controller/claw_providers.go
Extends providerDefaults struct with PluginMinVersion field and configures the anthropic provider to require OpenClaw 2026.6.8 for Vertex plugin support.
Status management: init failures and version tracking
internal/controller/claw_status.go, internal/controller/claw_status_test.go
Implements checkPodInitFailures to inspect pending deployment pods for non-zero exit codes or CrashLoopBackOff waiting states, and setReadyConditionWithDetail to enrich the Ready condition with failure details using InitContainerFailure reason. Implements version downgrade detection by comparing spec.version against status.LastDeployedVersion, setting VersionDowngrade condition on downgrade and removing it on upgrade; maintains LastDeployedVersion as a high-water mark when Ready. Includes full test coverage for init failure surfacing, downgrade/upgrade detection, and condition persistence.
Controller integration and RBAC
internal/controller/claw_resource_controller.go, config/rbac/role.yaml
Wires checkPluginCompatibility into configureDeployments to set or remove the PluginCompatibility status condition based on version warnings. Updates controller RBAC with get and list verbs for pods resource to enable pod inspection for init-container failures.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • redhat-et/claw-operator#7: Modifies configureDeployments and pluginInstallationDisabled gating that this PR directly extends with the plugin compatibility check.
  • redhat-et/claw-operator#1: Introduces spec.version and its image-tag override, which is the field this PR reads for both checkPluginCompatibility and VersionDowngrade detection.

Suggested reviewers

  • sallyom
  • pavelanni
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is truncated mid-word ('vers…'), making it unclear what the final feature is and preventing full understanding of the changeset scope. Complete the title to fully convey all three features: surface init container failures, plugin compatibility, and version downgrade detection.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/surface-deployment-failures

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/controller/claw_plugins.go`:
- Around line 243-275: The compareCalver function silently returns 0 when
parsing version parts fails via strconv.Atoi, treating malformed version strings
as equal and suppressing important compatibility warnings. Modify the
compareCalver function to return an error as a second return value instead of
returning 0 on parse failures. When strconv.Atoi fails for either aParts or
bParts, return the error so callers can properly handle and surface "cannot
compare" diagnostics for invalid version strings. Update all call sites of
compareCalver to check the error return value and propagate these diagnostics
appropriately to prevent suppressing plugin compatibility and downgrade
warnings.

In `@internal/controller/claw_resource_controller.go`:
- Around line 897-914: The plugin compatibility check is incorrectly nested
inside the conditional block that gates plugin auto-configuration. Move the
checkPluginCompatibility call and its associated condition-setting logic outside
the if/else block that checks userManagedConfig(instance) and
pluginInstallationDisabled(instance), so compatibility warnings are evaluated
regardless of plugin installation mode. Keep the configurePluginsInitContainer
call and effectivePlugins logic inside the original conditional block since
those should only execute when plugins are being auto-configured.

In `@internal/controller/claw_status.go`:
- Around line 205-222: Error handling in the initialization failure inspection
is swallowing all errors by returning empty strings indiscriminately. The
r.Get() call for the deployment, the metav1.LabelSelectorAsSelector() call, and
the r.List() call for pods all return empty string on error, making it
impossible to distinguish between "no init failure found" and "inspection
failed". Modify these error cases to return an error value or use a tuple return
type (string, error) so that real inspection failures are properly propagated
rather than masked. This issue occurs at the primary location and also applies
to similar error handling at another location in the file.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: c31162a9-cb88-412c-868b-670ce2e0e10b

📥 Commits

Reviewing files that changed from the base of the PR and between f8c18ad and e3e5f16.

📒 Files selected for processing (9)
  • api/v1alpha1/claw_types.go
  • config/crd/bases/claw.sandbox.redhat.com_claws.yaml
  • config/rbac/role.yaml
  • internal/controller/claw_plugins.go
  • internal/controller/claw_plugins_test.go
  • internal/controller/claw_providers.go
  • internal/controller/claw_resource_controller.go
  • internal/controller/claw_status.go
  • internal/controller/claw_status_test.go

Comment thread internal/controller/claw_plugins.go Outdated
Comment thread internal/controller/claw_resource_controller.go Outdated
Comment thread internal/controller/claw_status.go
@IsaiahStapleton

Copy link
Copy Markdown
Collaborator

I built and redeployed the operator with these changes and tried to test each warning only feature the operator adds and here is what I found:

Feature 1: Init Container Failure Surfacing — WORKS
Bogus plugin caused init-plugins to crash. Ready condition
correctly showed reason: InitContainerFailure with the
CrashLoopBackOff details instead of generic Provisioning.

Feature 2: Plugin Compatibility Check — WORKS
Setting spec.version: "2026.6.5" with an Anthropic Vertex
credential correctly set PluginCompatibility: Incompatible
warning about requiring >= 2026.6.8.

Feature 3: Version Downgrade Detection — BUG
Set spec.version: "2026.6.8", waited for ready, then
downgraded to "2026.6.1". No VersionDowngrade condition
appeared — lastDeployedVersion was overwritten to the
downgraded version despite the if ready guard, so the
comparison never detected a downgrade.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal/controller/claw_status.go (1)

368-375: ⚡ Quick win

Consider validating the new version before updating the high-water mark.

When compareCalver returns !cmpTrackOK, it could mean either the old or new version is malformed. The current code updates LastDeployedVersion in this case, which could overwrite a valid old version with a malformed new version.

Scenario: Deploy "2026.6.8" → Deploy "notaversion" → Deploy "2026.6.5". After step 2, LastDeployedVersion becomes malformed, preventing downgrade detection in step 3.

Since this is a warning-only feature and malformed versions are rare, the current fail-open behavior is acceptable. However, validating that Spec.Version is valid CalVer before updating would make downgrade detection more robust:

// Validate new version is parseable before updating high-water mark
if _, newValid := compareCalver(instance.Spec.Version, instance.Spec.Version); newValid {
    cmpTrack, cmpTrackOK := compareCalver(instance.Spec.Version, instance.Status.LastDeployedVersion)
    if instance.Status.LastDeployedVersion == "" || !cmpTrackOK || cmpTrack >= 0 {
        instance.Status.LastDeployedVersion = instance.Spec.Version
    }
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/claw_status.go` around lines 368 - 375, The code updates
LastDeployedVersion without validating that the new version is valid CalVer,
which could corrupt the high-water mark if a malformed version is deployed. Add
a validation check before the existing logic: call compareCalver with
instance.Spec.Version against itself to verify it parses correctly, and only
proceed with the high-water mark update logic if this validation succeeds (the
second return value is true). This ensures that only valid CalVer versions can
overwrite LastDeployedVersion, making downgrade detection more robust.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@internal/controller/claw_status.go`:
- Around line 368-375: The code updates LastDeployedVersion without validating
that the new version is valid CalVer, which could corrupt the high-water mark if
a malformed version is deployed. Add a validation check before the existing
logic: call compareCalver with instance.Spec.Version against itself to verify it
parses correctly, and only proceed with the high-water mark update logic if this
validation succeeds (the second return value is true). This ensures that only
valid CalVer versions can overwrite LastDeployedVersion, making downgrade
detection more robust.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 6364ca6f-1cad-4df9-a29d-92dc9f20609b

📥 Commits

Reviewing files that changed from the base of the PR and between e3e5f16 and 5e45a1e.

📒 Files selected for processing (5)
  • internal/controller/claw_plugins.go
  • internal/controller/claw_plugins_test.go
  • internal/controller/claw_resource_controller.go
  • internal/controller/claw_status.go
  • internal/controller/claw_status_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • internal/controller/claw_plugins_test.go
  • internal/controller/claw_resource_controller.go

@cooktheryan

Copy link
Copy Markdown
Collaborator Author

@IsaiahStapleton ready for round 2 if you have the cycles

cooktheryan and others added 3 commits June 18, 2026 12:19
…ion downgrade warnings

When a Claw deployment fails due to init container crashes (e.g., a plugin
requiring a newer OpenClaw version than spec.version), the operator previously
reported only "Waiting for deployments to become ready" — forcing users to
manually dig through pod logs to find the root cause. This was observed in
production when spec.version 2026.6.5 was deployed against a PVC configured
for 2026.6.8, causing the @openclaw/anthropic-vertex-provider plugin's
init-plugins container to crash-loop with no actionable status on the CR.

This commit adds three warning-only status conditions:

1. Ready condition enrichment: when deployments are not ready, the operator
   inspects pods for init container failures (non-zero exit or CrashLoopBackOff)
   and surfaces the actual error message in the Ready condition with reason
   InitContainerFailure.

2. PluginCompatibility condition: when spec.version is older than a plugin's
   minimum required version (declared via PluginMinVersion in knownProviders),
   a warning condition is set. This does not block deployment — users may have
   compatible plugins cached on PVC.

3. VersionDowngrade condition: when spec.version is older than
   status.lastDeployedVersion, a warning is set about potential PVC data
   incompatibility. Downgrades remain fully permitted.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- compareCalver returns (int, bool) so callers can distinguish malformed
  versions from equal ones instead of silently treating parse failures as
  "equal"
- Plugin compatibility check moved outside the plugin-installation gate
  so warnings surface regardless of config management mode
- checkPodInitFailures returns (string, error) so inspection failures
  are logged rather than silently swallowed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…warning persists

The VersionDowngrade condition was being cleared on the next reconcile
because LastDeployedVersion was unconditionally overwritten to the
downgraded version when deployments became ready. Now it only updates
upward, preserving the previous version for comparison.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cooktheryan cooktheryan force-pushed the feat/surface-deployment-failures branch from 5148225 to f086d4d Compare June 18, 2026 16:24
@IsaiahStapleton

Copy link
Copy Markdown
Collaborator

I tested again, and everything seems to be working now. I tested PluginCompatability, InitContainerFailure, and VersionDowngrade are all showing up under status.conditions when the conditions are met.

LGTM

Since you asked about recording some of the things I looked for so you can write automated tests:

  • For PluginCompatibility, I patched the existing Claw CR with spec.version: "2026.6.1" (which is older than the PluginMinVersion of 2026.6.8 for the Anthropic Vertex plugin). The CR already had a GCP-type Anthropic credential. Immediately after reconcile, the condition appeared: PluginCompatibility=False, Reason=Incompatible. After patching the version up to 2026.6.8, the condition cleared.

  • For InitContainerFailure, using that same CR at spec.version: "2026.6.1", the init-plugins container failed because the plugin requires API >= 2026.6.8. Once the pod entered CrashLoopBackOff and the operator re-reconciled, the Ready condition changed from Reason=Provisioning to Reason=InitContainerFailure with a message containing the container name and failure details. After upgrading to 2026.6.8 and waiting for the pod to start, Ready went back to True with Reason=Ready.

  • For VersionDowngrade, I first deployed at spec.version: "2026.6.8" and waited for Ready=True, which caused status.lastDeployedVersion to be recorded as 2026.6.8. Then I patched spec.version down to "2026.6.5". The condition appeared: VersionDowngrade=True, Reason=VersionDowngrade, with a message referencing both versions. lastDeployedVersion stayed at 2026.6.8 (high-water mark, not overwritten by the downgrade).

@cooktheryan cooktheryan merged commit f13825d into main Jun 18, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants