Skip to content

OCPBUGS-67134: add grace period before reporting Available=False#1179

Open
sg00dwin wants to merge 1 commit into
openshift:mainfrom
sg00dwin:OCPBUGS-67134-available-false-blip-insufficient-replicas
Open

OCPBUGS-67134: add grace period before reporting Available=False#1179
sg00dwin wants to merge 1 commit into
openshift:mainfrom
sg00dwin:OCPBUGS-67134-available-false-blip-insufficient-replicas

Conversation

@sg00dwin

@sg00dwin sg00dwin commented Jun 30, 2026

Copy link
Copy Markdown
Member

Summary

  • During disruptive CI tests (node reboots, drains), console deployment replicas briefly drop to zero for ~10 seconds before self-recovering
  • The operator immediately reports Available=False with Deployment_InsufficientReplicas, triggering OTA invariant test failures
  • This fix adds a 2-minute grace period that suppresses the false alarm when the deployment was recently available, while still reporting genuine outages

What changed

  • Track when the deployment was last seen available (lastDeploymentAvailableTime on the operator struct)
  • Extract evaluateDeploymentAvailability() method that checks the grace window before reporting Available=False
  • Grace period (2 min) matches the existing Degraded inertia in library-go
  • First startup with no prior healthy state reports immediately — no risk of masking real outages

Related

Test plan

  • 7 new unit tests covering: available deployment, first-sync unavailable, grace period suppression, grace period expiry, recovery after blip, boundary condition, error message format
  • All existing tests pass
  • make test-unit clean (go test, gofmt, govet)
  • After merge: monitor Sippy for ~5-7 days, then remove origin invariant test exception for console + Deployment_InsufficientReplicas

Co-Authored-By: Claude Opus 4.6

Summary by CodeRabbit

  • Bug Fixes
    • Improved console availability handling to avoid brief, user-visible false alerts during short-lived deployment disruptions.
    • The console now waits a short grace period before reporting the deployment as unavailable, reducing noisy “insufficient replicas” messages for transient drops.
    • If the deployment stays down beyond that window, the expected availability error is still shown.

The console operator immediately reports Available=False when
deployment replicas drop to zero, even during brief disruptions
(~10s) that self-recover. Add a 2-minute grace period that
suppresses the condition when the deployment was recently available,
preventing false alarms during disruptive CI tests while still
reporting genuine outages.

Co-Authored-By: Claude Opus 4.6
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 30, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@sg00dwin: This pull request references Jira Issue OCPBUGS-67134, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

  • During disruptive CI tests (node reboots, drains), console deployment replicas briefly drop to zero for ~10 seconds before self-recovering
  • The operator immediately reports Available=False with Deployment_InsufficientReplicas, triggering OTA invariant test failures
  • This fix adds a 2-minute grace period that suppresses the false alarm when the deployment was recently available, while still reporting genuine outages

What changed

  • Track when the deployment was last seen available (lastDeploymentAvailableTime on the operator struct)
  • Extract evaluateDeploymentAvailability() method that checks the grace window before reporting Available=False
  • Grace period (2 min) matches the existing Degraded inertia in library-go
  • First startup with no prior healthy state reports immediately — no risk of masking real outages

Related

Test plan

  • 7 new unit tests covering: available deployment, first-sync unavailable, grace period suppression, grace period expiry, recovery after blip, boundary condition, error message format
  • All existing tests pass
  • make test-unit clean (go test, gofmt, govet)
  • After merge: monitor Sippy for ~5-7 days, then remove origin invariant test exception for console + Deployment_InsufficientReplicas

Co-Authored-By: Claude Opus 4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from TheRealJon and jhadvig June 30, 2026 15:40
@openshift-ci

openshift-ci Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sg00dwin
Once this PR has been reviewed and has the lgtm label, please assign jhadvig for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4ed07579-f19d-4707-aaad-5f6c3e76f239

📥 Commits

Reviewing files that changed from the base of the PR and between 58e10b0 and ac8fe6a.

📒 Files selected for processing (3)
  • pkg/console/operator/operator.go
  • pkg/console/operator/sync_v400.go
  • pkg/console/operator/sync_v400_test.go
🔗 Linked repositories identified

CodeRabbit considers these linked repositories for cross-repo context during reviews:

  • openshift/console (manual)
📜 Recent review details
🧰 Additional context used
📓 Path-based instructions (7)
**/*.go

📄 CodeRabbit inference engine (AGENTS.md)

**/*.go: Follow Go coding standards and patterns documented in CONVENTIONS.md
Organize imports according to conventions documented in CONVENTIONS.md
Use gofmt to format Go code with standard formatting
Run go vet checks on all Go packages

Follow Go coding standards and patterns as documented in CONVENTIONS.md, including proper import organization

Organize Go code following the repository structure: main entry point in cmd/console/main.go, API constants in pkg/api/, operator command setup in pkg/cmd/operator/, and version command in pkg/cmd/version/

**/*.go: Use gofmt for formatting Go code
Follow standard Go naming conventions
Group imports in order: standard lib, 3rd party, kube/openshift, internal (marked with comments)
Use meaningful error messages with context in Go code
Set status conditions using status.Handle* functions with type prefixes (*Degraded, *Progressing, *Available, *Upgradeable)
Use typed errors and wrap errors to preserve stack context

Flag MD5, SHA1, DES, RC4, 3DES, Blowfish, and ECB mode cryptographic usage. Also flag custom crypto implementations and non-constant-time comparison of secrets or tokens.

Files:

  • pkg/console/operator/operator.go
  • pkg/console/operator/sync_v400.go
  • pkg/console/operator/sync_v400_test.go

⚙️ CodeRabbit configuration file

**/*.go: Review Go code following OpenShift operator patterns.
See CONVENTIONS.md for coding standards and patterns.

Refer to the following skills based on CODE PATTERNS, not just file paths:

Refer to /controller-review when code contains:

  • Controller struct types (e.g., type *Controller struct)
  • func New*Controller( factory functions
  • factory.New().WithFilteredEventsInformers( pattern
  • .ToController( method calls
  • Sync(ctx context.Context, controllerContext factory.SyncContext) methods
  • operatorConfig.Spec.ManagementState checks
  • status.NewStatusHandler or status.Handle* functions

Refer to /sync-handler-review when code contains:

  • Main operator sync functions (e.g., sync_v400.go content)
  • Sequential resource syncing with early returns
  • Incremental reconciliation loops
  • Multiple resourceapply.Apply*() calls in sequence
  • Dependency ordering of ConfigMaps → Secrets → Service Accounts → RBAC → Services → Deployments → Routes
  • Feature gate conditional logic

Refer to /go-quality-review for all Go code to check:

  • Deprecated imports: ioutil.ReadFile, ioutil.WriteFile, ioutil.ReadAll
  • Deprecated patterns: Dial without DialContext
  • Error handling: missing %w in fmt.Errorf
  • Code smells: deep nesting (4+ levels), functions >100 lines
  • Magic values: unexplained numbers/strings
  • Context propagation: context.Background() instead of passed ctx
  • Missing godoc on exported functions

Files:

  • pkg/console/operator/operator.go
  • pkg/console/operator/sync_v400.go
  • pkg/console/operator/sync_v400_test.go
{pkg,cmd}/**/*.go

📄 CodeRabbit inference engine (CLAUDE.md)

Use gofmt for code formatting on pkg and cmd directories

{pkg,cmd}/**/*.go: Format code using gofmt -w ./pkg ./cmd
Run go vet checks on all Go packages in ./pkg and ./cmd

Files:

  • pkg/console/operator/operator.go
  • pkg/console/operator/sync_v400.go
  • pkg/console/operator/sync_v400_test.go
**/operator/**/*.go

📄 CodeRabbit inference engine (Custom checks)

When deployment manifests, operator code, or controllers are added/modified, ensure they do not introduce scheduling constraints assuming standard HA topology. Check ControlPlaneTopology for SingleReplica/DualReplica/HighlyAvailableArbiter/External modes before applying constraints. Use required anti-affinity with maxUnavailable >= 1 (not maxUnavailable: 0). Cap replica counts to schedulable nodes. Exclude arbiter nodes on TNA. Avoid master nodeSelectors on HyperShift. Use library-go DeploymentController hooks (WithTopologyAwareReplicasHook, WithTopologyAwareSchedulingHook, WithControlPlaneNodeSelectorHook).

Files:

  • pkg/console/operator/operator.go
  • pkg/console/operator/sync_v400.go
  • pkg/console/operator/sync_v400_test.go
**

⚙️ CodeRabbit configuration file

**: # OpenShift Console Operator - AI Context Hub

This file serves as the central AI documentation hub for the OpenShift Console Operator project. AI assistants (Claude, Cursor, Copilot, CodeRabbit, etc.) use this and the linked documents to understand project context.

Go Version and Dependencies

Go Version and Dependencies

  • Go version: 1.24.0 (toolchain: go1.24.4)
  • Dependency management: Uses go.mod with vendoring
  • Build flags: Use GOFLAGS="-mod=vendor" for builds and tests to ensure vendored dependencies are used
  • Key dependencies: openshift/api, openshift/library-go, k8s.io client libraries
  • Go version: 1.24.0 (toolchain: go1.24.4)
  • Dependency management: Uses go.mod with vendoring
  • Build flags: Use GOFLAGS="-mod=vendor" for builds and tests to ensure vendored dependencies are used
  • Key dependencies: openshift/api, openshift/library-go, k8s.io client libraries

Quick Reference

This Repository

Document Purpose
ARCHITECTURE.md System architecture, components, repository structure
CONVENTIONS.md Go coding standards, patterns, import organization
TESTING.md Testing patterns, commands, debugging
README.md Project README with setup instructions

Console Repository (openshift/console)

For frontend-related guidelines, see the openshift/console repository:

Document Purpose
STYLEGUIDE.md Frontend code style guidelines
INTERNATIONALIZATION.md i18n patterns and translation guidelines
CONTRIBUTING.md Contribution guidelines for the console project

Project Summary

The **console-operator...

Files:

  • pkg/console/operator/operator.go
  • pkg/console/operator/sync_v400.go
  • pkg/console/operator/sync_v400_test.go
**/*.{py,js,ts,go,rs,java,rb,php,kt,swift,cs}

⚙️ CodeRabbit configuration file

**/*.{py,js,ts,go,rs,java,rb,php,kt,swift,cs}: Injection prevention (prodsec-skills):

  • SQL: parameterized queries only; no string concatenation
  • Command: no shell=True, os.system, or backtick exec with user input
  • LDAP/XPath: escape special characters in filters
  • Path traversal: canonicalize paths, reject ../
  • Deserialization: no pickle/yaml.load()/eval on untrusted data
  • Prototype pollution: no recursive merge of untrusted objects
  • Validate at trust boundaries with allow-lists, not deny-lists
  • Normalize Unicode and anchor regexes (^$); watch for ReDoS

Files:

  • pkg/console/operator/operator.go
  • pkg/console/operator/sync_v400.go
  • pkg/console/operator/sync_v400_test.go
**/*sync*.go

📄 CodeRabbit inference engine (CONVENTIONS.md)

Implement sync loops (sync_v400) incrementally: start from zero, create/update missing requirements, and return to continue on next loop

Files:

  • pkg/console/operator/sync_v400.go
  • pkg/console/operator/sync_v400_test.go
**/*_test.go

📄 CodeRabbit inference engine (AGENTS.md)

Follow testing patterns and commands documented in TESTING.md

Follow testing patterns and commands as documented in TESTING.md, including running unit tests with 'make test-unit' and checks with 'make check'

**/*_test.go: Use table-driven tests for comprehensive coverage
Use httptest for HTTP handler testing in Go
Include proper cleanup functions in tests
Test both success and failure paths

Files:

  • pkg/console/operator/sync_v400_test.go

⚙️ CodeRabbit configuration file

**/*_test.go: Review test code for quality and patterns.

Refer to /unit-test-review when test is in pkg//*_test.go:**

  • Table-driven test structure with test cases
  • Use of go-test/deep for struct comparisons
  • Test naming conventions (TestFunctionName)
  • Error handling with wantErr pattern
  • Edge case coverage (nil, empty, boundary values)
  • Proper assertions with helpful error messages
  • Test isolation (no shared mutable state)

Refer to /e2e-test-review when test contains:

  • framework.MustNewClientset(t, nil) or similar e2e framework usage
  • wait.Poll or wait.PollImmediate patterns
  • retry.RetryOnConflict for updates
  • Cleanup via defer functions
  • Console/operator CR manipulations
  • Test assertions on cluster state

Suggest to use /e2e-test-review when:

  • PR adds new feature requiring e2e coverage
  • Test file is empty or skeleton
  • Comments indicate "TODO: add test"

Review for common issues:

  • Missing cleanup (defer statements)
  • Using time.Sleep instead of wait.Poll
  • Missing context timeouts
  • Vague error messages in assertions
  • Tests without table-driven structure when testing multiple cases
  • Ignoring errors with _
  • Tests without assertions

Files:

  • pkg/console/operator/sync_v400_test.go
🔇 Additional comments (3)
pkg/console/operator/operator.go (1)

97-98: LGTM!

pkg/console/operator/sync_v400_test.go (1)

543-681: LGTM!

pkg/console/operator/sync_v400.go (1)

343-360: 🩺 Stability & Availability

Need evidence before rewriting the comment.


Walkthrough

Adds a lastDeploymentAvailableTime field to consoleOperator and a deploymentAvailableGracePeriod constant (2 minutes). Replaces inline availability logic in sync_v400 with a new evaluateDeploymentAvailability helper that suppresses Available=False during the grace window after replicas drop to zero. Tests cover all boundary cases.

Changes

Deployment Availability Grace Period

Layer / File(s) Summary
Grace period constant, struct field, and helper
pkg/console/operator/operator.go, pkg/console/operator/sync_v400.go
Adds lastDeploymentAvailableTime time.Time to consoleOperator, defines deploymentAvailableGracePeriod = 2*time.Minute, replaces inline status.HandleAvailable call with co.evaluateDeploymentAvailability(actualDeployment), and implements the helper that updates the timestamp when available, suppresses InsufficientReplicas within the grace window, and reports it after the window expires.
Tests
pkg/console/operator/sync_v400_test.go
Adds TestEvaluateDeploymentAvailability covering: available replicas set timestamp, zero replicas with no prior availability fails immediately, zero replicas within grace period suppressed, zero replicas beyond grace period fails, recovery updates timestamp, and boundary inside grace period is suppressed.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 15
✅ Passed checks (15 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise, specific, and matches the main change: adding a grace period before reporting Available=False.
Description check ✅ Passed It covers the root cause, solution, related issue, and test plan, though several template sections are omitted or folded into other headings.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The added subtest titles are static, descriptive strings with no dynamic values, timestamps, pod/node/namespace names, or other unstable identifiers.
Test Structure And Quality ✅ Passed New tests are plain table-driven unit tests, cover one scenario per subtest, use no cluster resources/waits, and follow existing repo test style.
Microshift Test Compatibility ✅ Passed The new test is a plain Go unit test (testing.T), not a Ginkgo e2e test, and it only uses Deployments/core APIs available on MicroShift.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo/e2e test was added; the new coverage is a plain unit test on deployment status and has no multi-node/SNO assumptions.
Topology-Aware Scheduling Compatibility ✅ Passed Touched code only adds deployment availability grace logic and tests; no pod specs, replicas, affinity, node selectors, or topology labels were introduced.
Ote Binary Stdout Contract ✅ Passed Touched code only adds availability logic and tests; no stdout writes or process-level setup changes appear in main/init/TestMain/BeforeSuite paths.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PASS: The added tests are standard unit tests; no Ginkgo e2e blocks, hardcoded IPv4s, or external connectivity were found.
No-Weak-Crypto ✅ Passed No MD5/SHA1/DES/RC4/3DES/Blowfish/ECB, custom crypto, or secret/token comparisons were added in the touched files.
Container-Privileges ✅ Passed PR only changes Go logic/tests; no privileged/hostNetwork/hostPID/hostIPC/SYS_ADMIN/allowPrivilegeEscalation settings found in changed files or repo scan.
No-Sensitive-Data-In-Logs ✅ Passed No new logging of secrets, tokens, PII, hostnames, or customer data was added; the only new log reports replica timing and grace-period duration.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@openshift-ci

openshift-ci Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

@sg00dwin: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-operator ac8fe6a link true /test e2e-aws-operator
ci/prow/e2e-aws-console ac8fe6a link true /test e2e-aws-console
ci/prow/e2e-gcp-ovn ac8fe6a link true /test e2e-gcp-ovn

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@sg00dwin

Copy link
Copy Markdown
Member Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants