[Chaos][OADP] Automate cases: Delete Pods during OADP Restore by qwang1 · Pull Request #3284 · RedHatQE/openshift-virtualization-tests

qwang1 · 2025-12-29T18:06:59Z

Short description:

Polarion cases: CNV-12023, CNV-12025, CNV-12027

Need #3002 as its dependence.

More details:

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for reviewer:

jira-ticket:

Summary by CodeRabbit

Tests
- Added a destructive restore test that runs pod‑deletion chaos, asserts restore terminal status, waits for the restored VM, and verifies guest file contents.
- Expanded fixtures to orchestrate backup → VM/namespace cleanup → restore flows; VM setup/teardown now runs per test for stronger isolation.
Chores
- Added shared wait helpers and explicit terminal‑status constants for robust polling.
- Improved logging, tightened exception handling, made pod‑deletion loops cooperative, and added a zero‑ratio no‑op for deletions.

coderabbitai · 2025-12-29T18:07:09Z

📝 Walkthrough

Walkthrough

Adds OADP restore orchestration and a restore chaos test; introduces OADP-phase waiting helpers; tightens exception handling and pod-deletion semantics; changes rhel_vm_with_dv_running fixture to function scope; and adds OADP terminal-status constants. (23 words)

Changes

Cohort / File(s)	Summary
OADP Test Fixtures `tests/chaos/oadp/conftest.py`	Changed `rhel_vm_with_dv_running` fixture scope to `function`; replaced `TimeoutSampler` polling with `wait_for_oadp_phase(..., OADP_BACKUP_TERMINAL_STATUSES, TIMEOUT_10MIN/20MIN/30MIN)` and added terminal-phase logging; log before stopping pod-deletion thread; narrowed `except Exception` to `ResourceNotFoundError, ValueError, TypeError`; added fixtures `oadp_backup_completed`, `chaos_vms_cleanup`, `deleted_chaos_namespace`, `oadp_restore_started`, `restore_with_pod_deletion_orchestration`.
OADP Tests `tests/chaos/oadp/test_oadp.py`	Added imports (`wait_for_restored_vm`, `check_file_in_running_vm`) and constants (`FILE_NAME_FOR_BACKUP`, `TEXT_TO_TEST`, `TIMEOUT_5MIN`); adjusted docstring example; added destructive, parameterized `test_delete_pods_during_restore` that asserts restore terminal status, waits for restored VM, and verifies restored file contents.
Chaos Utilities `tests/chaos/utils.py`	Renamed `_delete_pods_continuously(..., _client=client)` call to `client=client`; unified recovery waiting logic in `pod_deleting_process_recover()` (Deployment -> `wait_for_replicas()`, else `wait_until_deployed()`); changed `delete_pods()` contract to accept `0 <= ratio <= 1` (treat `ratio==0` as no-op, raise `ValueError` for invalids); ensure `delete_pods_continuously()` always waits `stop_event.wait(timeout=interval)` each loop; added helpers `wait_until_vm_running`, `wait_for_restored_vm`, and `wait_for_oadp_phase`.
Constants `utilities/constants.py`	Added `OADP_TERMINAL_STATUSES = frozenset({"Completed","Failed","PartiallyFailed","FailedValidation"})` and aliases `OADP_BACKUP_TERMINAL_STATUSES`, `OADP_RESTORE_TERMINAL_STATUSES`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description is largely incomplete. While it provides Polarion case references and PR dependency, it leaves all substantive template sections (More details, What this PR does/why we need it, Which issue(s) this PR fixes, Special notes for reviewer, jira-ticket) empty or filled only with template comments.	Complete the description template by filling in: (1) More details explaining the scope and approach, (2) What this PR does and why it's needed, (3) Any related GitHub issues, (4) Special notes for reviewers about testing, dependencies, or breaking changes, (5) Jira ticket URL or 'NONE'.
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically identifies the main change: automating test cases for deleting pods during OADP Restore operations, which aligns with the actual code changes adding new fixtures and test function test_delete_pods_during_restore.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-virtualization-qe-bot-3 · 2025-12-29T18:07:32Z

Report bugs in Issues

Welcome! 🎉

This pull request will be automatically processed with the following features:

🔄 Automatic Actions

Reviewer Assignment: Reviewers are automatically assigned based on the OWNERS file in the repository root
Size Labeling: PR size labels (XS, S, M, L, XL, XXL) are automatically applied based on changes
Issue Creation: A tracking issue is created for this PR and will be closed when the PR is merged or closed
Pre-commit Checks: pre-commit runs automatically if .pre-commit-config.yaml exists
Branch Labeling: Branch-specific labels are applied to track the target branch
Auto-verification: Auto-verified users have their PRs automatically marked as verified

📋 Available Commands

PR Status Management

/wip - Mark PR as work in progress (adds WIP: prefix to title)
/wip cancel - Remove work in progress status
/hold - Block PR merging (approvers only)
/hold cancel - Unblock PR merging
/verified - Mark PR as verified
/verified cancel - Remove verification status
/reprocess - Trigger complete PR workflow reprocessing (useful if webhook failed or configuration changed)

Review & Approval

/lgtm - Approve changes (looks good to me)
/approve - Approve PR (approvers only)
/automerge - Enable automatic merging when all requirements are met (maintainers and approvers only)
/assign-reviewers - Assign reviewers based on OWNERS file
/assign-reviewer @username - Assign specific reviewer
/check-can-merge - Check if PR meets merge requirements

Testing & Validation

/retest tox - Run Python test suite with tox
/retest build-container - Rebuild and test container image
/retest all - Run all available tests

Container Operations

/build-and-push-container - Build and push container image (tagged with PR number)
- Supports additional build arguments: /build-and-push-container --build-arg KEY=value

Cherry-pick Operations

/cherry-pick <branch> - Schedule cherry-pick to target branch when PR is merged
- Multiple branches: /cherry-pick branch1 branch2 branch3

Label Management

/<label-name> - Add a label to the PR
/<label-name> cancel - Remove a label from the PR

✅ Merge Requirements

This PR will be automatically approved when the following conditions are met:

Approval: /approve from at least one approver
LGTM Count: Minimum 2 /lgtm from reviewers
Status Checks: All required status checks must pass
No Blockers: No WIP, hold, or conflict labels
Verified: PR must be marked as verified (if verification is enabled)

📊 Review Process

Approvers and Reviewers

Approvers:

dshchedr
jpeimer
myakove
rnetser
vsibirsk

Reviewers:

Ahmad-Hafe
RoniKishner
dalia-frank
dshchedr
duyanyan
geetikakay
josemacassan
jpeimer
kgoldbla
kshvaika
qwang1
rnetser
sarahbx
stesrn
vsibirsk

Available Labels

hold
verified
wip
lgtm
approve
automerge

💡 Tips

WIP Status: Use /wip when your PR is not ready for review
Verification: The verified label is automatically removed on each new commit
Cherry-picking: Cherry-pick labels are processed when the PR is merged
Container Builds: Container images are automatically tagged with the PR number
Permission Levels: Some commands require approver permissions
Auto-verified Users: Certain users have automatic verification and merge privileges

For more information, please refer to the project documentation or contact the maintainers.

codecov · 2025-12-29T18:08:49Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.65%. Comparing base (b4ad2e0) to head (9aa58ca).
⚠️ Report is 44 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3284      +/-   ##
==========================================
+ Coverage   98.63%   98.65%   +0.02%     
==========================================
  Files          25       25              
  Lines        2420     2462      +42     
==========================================
+ Hits         2387     2429      +42     
  Misses         33       33

Flag	Coverage Δ
utilities	`98.65% <100.00%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

openshift-virtualization-qe-bot-3 · 2026-03-19T15:29:36Z

/retest all

Auto-triggered: Files in this PR were modified by merged PR #3755.

Overlapping files

utilities/constants.py

qwang1 · 2026-03-20T04:45:32Z

@coderabbitai resume

coderabbitai · 2026-03-20T04:45:39Z

✅ Actions performed

Reviews resumed.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/chaos/utils.py`:
- Around line 602-613: The post-try unreachable check for vm is redundant
because the TimeoutSampler loop either breaks with a truthy sample (setting vm
via try_get_and_wait_running) or the TimeoutExpiredError is caught and re-raised
as TimeoutError; remove the unreachable block that checks "if not vm: raise
TimeoutError(...)" and rely on the existing try/except around TimeoutSampler and
the break condition in the loop (symbols: TimeoutSampler,
try_get_and_wait_running, TimeoutExpiredError, TimeoutError, vm, vm_name,
namespace, timeout).
- Around line 564-573: The loop using TimeoutSampler with func=is_running can
never fall through because TimeoutSampler raises TimeoutExpiredError on timeout,
so remove the unreachable raise or, better, catch TimeoutExpiredError around the
iteration and re-raise a clearer TimeoutError with context; wrap the for loop in
a try/except TimeoutExpiredError block, log via LOGGER.info/LOGGER.error as
needed, then raise TimeoutError(f"VM {vm.name} did not reach Running state in
{timeout}s") from the caught TimeoutExpiredError so callers get the original
traceback.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b4c66557-93df-459c-87d6-80cab5347d71

📥 Commits

Reviewing files that changed from the base of the PR and between 4e73cd5 and a53ab8b.

📒 Files selected for processing (4)

tests/chaos/oadp/conftest.py
tests/chaos/oadp/test_oadp.py
tests/chaos/utils.py
utilities/constants.py

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/chaos/oadp/conftest.py (1)
150-169: ⚠️ Potential issue | 🟠 Major

HIGH: These finally blocks can replace the real OADP failure with a cleanup error.

Line 158 and Line 355 both call pod_deleting_process_recover() during teardown. If wait_for_oadp_phase() or the phase assertion fails first, any exception raised by recovery will override that original failure. That is especially risky here because pod_deleting_process_recover() explicitly raises ResourceNotFoundError when no matching workloads exist.

The simplest fix is to only enforce recovery after the operation reached COMPLETED, or to catch/log recovery errors here without re-raising when backup/restore has already failed. Right now a secondary recovery problem can hide the real backup/restore result.

Also applies to: 349-359
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/oadp/conftest.py` around lines 150 - 169, In the finally block,
avoid letting pod_deleting_process_recover override earlier test failures:
either only call pod_deleting_process_recover if the OADP operation reached the
COMPLETED phase (check via wait_for_oadp_phase/phase result or a success flag on
pod_deleting_thread_during_oadp_operations) OR wrap the
pod_deleting_process_recover(...) call in a try/except that catches
ResourceNotFoundError, ValueError, TypeError and logs the recovery failure
without re-raising when an earlier exception is present; keep references to
pod_deleting_process_recover and pod_deleting_thread_during_oadp_operations to
locate the code and ensure stop_event/thread.join logic remains unchanged.

♻️ Duplicate comments (1)

tests/chaos/utils.py (1)

578-620: ⚠️ Potential issue | 🔴 Critical

CRITICAL: Return a VirtualMachineForTests wrapper here, not a raw VirtualMachine.

tests/chaos/oadp/test_oadp.py passes this return value straight into check_file_in_running_vm(), and that helper opens Console(vm=...). A raw ocp_resources.VirtualMachine can exist and reach Running, but it still lacks the QE wrapper’s login context, so the restore can succeed and the console validation still fail.

🐛 Minimal fix

 def wait_for_restored_vm(
     admin_client,
     namespace,
     vm_name,
+    os_flavor,
     timeout=TIMEOUT_5MIN,
     sleep=5,
-) -> VirtualMachine:
+) -> VirtualMachineForTests:
@@
-    LOGGER.info(f"VM {vm.name} found, waiting for Running")
-    wait_until_vm_running(vm=vm, timeout=timeout, sleep=sleep)
-
-    LOGGER.info(f"Restored VM {vm_name} is Running")
-    return vm
+    restored_vm = VirtualMachineForTests(
+        name=vm.name,
+        namespace=namespace_name,
+        client=admin_client,
+        os_flavor=os_flavor,
+        generate_unique_name=False,
+    )
+    LOGGER.info(f"VM {restored_vm.name} found, waiting for Running")
+    wait_until_vm_running(vm=restored_vm, timeout=timeout, sleep=sleep)
+    return restored_vm

Then pass os_flavor=rhel_vm_with_dv_running.os_flavor from test_delete_pods_during_restore().

Based on learnings: VirtualMachineForTests auto-generates unique names by default, so wrappers around existing restored VMs need generate_unique_name=False to preserve the real resource name.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/utils.py` around lines 578 - 620, The function
wait_for_restored_vm currently returns a raw VirtualMachine which breaks
Console(vm=...) usage in check_file_in_running_vm; change wait_for_restored_vm
to accept an optional os_flavor parameter and before returning wrap the found vm
in the QE wrapper: return a VirtualMachineForTests wrapper around the existing
VirtualMachine with generate_unique_name=False and the passed os_flavor (use
VirtualMachineForTests(..., generate_unique_name=False, os_flavor=os_flavor) or
the project’s wrapper constructor signature), and update the caller
test_delete_pods_during_restore to pass
os_flavor=rhel_vm_with_dv_running.os_flavor when calling wait_for_restored_vm so
the wrapper preserves the real VM name and has the correct flavor for login
context.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/chaos/oadp/conftest.py`:
- Around line 150-169: In the finally block, avoid letting
pod_deleting_process_recover override earlier test failures: either only call
pod_deleting_process_recover if the OADP operation reached the COMPLETED phase
(check via wait_for_oadp_phase/phase result or a success flag on
pod_deleting_thread_during_oadp_operations) OR wrap the
pod_deleting_process_recover(...) call in a try/except that catches
ResourceNotFoundError, ValueError, TypeError and logs the recovery failure
without re-raising when an earlier exception is present; keep references to
pod_deleting_process_recover and pod_deleting_thread_during_oadp_operations to
locate the code and ensure stop_event/thread.join logic remains unchanged.

---

Duplicate comments:
In `@tests/chaos/utils.py`:
- Around line 578-620: The function wait_for_restored_vm currently returns a raw
VirtualMachine which breaks Console(vm=...) usage in check_file_in_running_vm;
change wait_for_restored_vm to accept an optional os_flavor parameter and before
returning wrap the found vm in the QE wrapper: return a VirtualMachineForTests
wrapper around the existing VirtualMachine with generate_unique_name=False and
the passed os_flavor (use VirtualMachineForTests(...,
generate_unique_name=False, os_flavor=os_flavor) or the project’s wrapper
constructor signature), and update the caller test_delete_pods_during_restore to
pass os_flavor=rhel_vm_with_dv_running.os_flavor when calling
wait_for_restored_vm so the wrapper preserves the real VM name and has the
correct flavor for login context.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: ad42c555-effc-41f0-8740-0c325ac78486

📥 Commits

Reviewing files that changed from the base of the PR and between a53ab8b and dcef30d.

📒 Files selected for processing (4)

tests/chaos/oadp/conftest.py
tests/chaos/oadp/test_oadp.py
tests/chaos/utils.py
utilities/constants.py

rnetser

did not complete the review;
please check how functionality is implemented in the repo and align the code here

openshift-virtualization-qe-bot-3 · 2026-03-22T13:21:23Z

/retest all

Auto-triggered: Files in this PR were modified by merged PR #4171.

Overlapping files

utilities/constants.py

openshift-virtualization-qe-bot-3 · 2026-03-23T14:34:45Z

/retest all

Auto-triggered: Files in this PR were modified by merged PR #4169.

Overlapping files

utilities/constants.py

openshift-virtualization-qe-bot-3 · 2026-03-30T16:43:54Z

/retest all

Auto-triggered: Files in this PR were modified by merged PR #4267.

Overlapping files

utilities/constants.py

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/chaos/oadp/conftest.py (1)
156-162: ⚠️ Potential issue | 🔴 Critical

CRITICAL: Parenthesize the multi-exception handler — this is a blocking SyntaxError.

Line 162 uses a bare except ResourceNotFoundError, ValueError, TypeError: which Python 3.14 rejects with SyntaxError: multiple exception types must be parenthesized. This syntax error blocks parsing of the entire file. Wrap the exceptions in parentheses to fix the syntax:
Fix
-        except ResourceNotFoundError, ValueError, TypeError:
+        except (ResourceNotFoundError, ValueError, TypeError):
Multiple exception types in except clauses must always be parenthesized in Python 3.14 (and earlier). This is a syntax requirement, not optional style.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/oadp/conftest.py` around lines 156 - 162, The except clause after
the call to pod_deleting_process_recover uses comma-separated exception names
which raises a SyntaxError; change the handler to use a parenthesized tuple of
exceptions (e.g., except (ResourceNotFoundError, ValueError, TypeError):) so the
multi-exception catch is valid in Python 3.x; update the except that references
ResourceNotFoundError, ValueError, TypeError in the conftest.py block around the
pod_deleting_process_recover call.

♻️ Duplicate comments (1)

tests/chaos/oadp/conftest.py (1)

283-302: ⚠️ Potential issue | 🟠 Major

HIGH: Do not expose final_status before chaos teardown finishes.

The yield at Line 290 gives the test control before Lines 293-302 stop the deleter thread and recover the OADP workloads. That makes restored-VM validation race ongoing pod deletion instead of validating the post-restore steady state.

Return the status after cleanup

     try:
         final_status = wait_for_oadp_phase(
             resource=oadp_restore_started,
             terminal_statuses=terminal_statuses,
             timeout=TIMEOUT_3MIN,
         )
         assert final_status in allowed_statuses, f"Restore ended in unexpected terminal phase: {final_status}"
-        yield final_status
 
     finally:
         pod_deleting_thread_during_oadp_operations["stop_event"].set()
         if thread.is_alive():
             thread.join(timeout=TIMEOUT_1MIN)
@@
         pod_deleting_process_recover(
             resources=[Deployment, DaemonSet],
             namespace=namespace,
             pod_prefix=pod_prefix,
         )
+
+    return final_status

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/oadp/conftest.py` around lines 283 - 302, The test yields control
before the chaos teardown runs, causing validation to race with pod deletion;
fix by moving the yield so it happens after cleanup. Specifically, declare
final_status in the outer scope, assign it via wait_for_oadp_phase (and keep the
assert) inside the try block, then perform
pod_deleting_thread_during_oadp_operations["stop_event"].set(),
thread.join(...), and pod_deleting_process_recover(...) in the finally block,
and only after that finally block yield final_status; reference symbols:
wait_for_oadp_phase, final_status, allowed_statuses,
pod_deleting_thread_during_oadp_operations, thread,
pod_deleting_process_recover.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/chaos/oadp/conftest.py`:
- Around line 193-204: The ResourceEditor context is closed before
wait_for_oadp_phase runs, so the patched field
backup_obj.spec["defaultVolumesToFsBackup"]=False is reverted too early; move
the call to wait_for_oadp_phase inside the with ResourceEditor(backup.instance)
as backup_obj: block so the patch remains active until the backup reaches the
expected phase (use the same parameters currently passed to wait_for_oadp_phase
and keep references to ResourceEditor, backup.instance, backup_obj, and
backup.Backup.Status.COMPLETED).

In `@tests/chaos/oadp/test_oadp.py`:
- Around line 144-224: The test uses a module-scoped fixture
rhel_vm_with_dv_running while each parametrized case deletes and recreates the
VM/namespace, causing cross-test contamination; change the VM/namespace fixture
to function scope (or add a new function-scoped fixture used only by
test_delete_pods_during_restore) so each parametrized run gets a fresh
VM/namespace; update the fixture definition for rhel_vm_with_dv_running (or
create rhel_vm_per_test) to scope="function" and ensure
test_delete_pods_during_restore and the restore_with_pod_deletion_orchestration
flow consume that function-scoped fixture so state is isolated per parameter
set.

In `@tests/chaos/utils.py`:
- Around line 670-673: Remove the unreachable AssertionError at the end of the
TimeoutSampler loop: the iterator either returns a sample in TimeoutSampler or
raises TimeoutExpiredError (caught and re-raised), so delete the "raise
AssertionError(...)" statement after the exception handler to avoid dead code;
locate this in the TimeoutSampler/loop implementation where TimeoutExpiredError
is caught and re-raised.

---

Outside diff comments:
In `@tests/chaos/oadp/conftest.py`:
- Around line 156-162: The except clause after the call to
pod_deleting_process_recover uses comma-separated exception names which raises a
SyntaxError; change the handler to use a parenthesized tuple of exceptions
(e.g., except (ResourceNotFoundError, ValueError, TypeError):) so the
multi-exception catch is valid in Python 3.x; update the except that references
ResourceNotFoundError, ValueError, TypeError in the conftest.py block around the
pod_deleting_process_recover call.

---

Duplicate comments:
In `@tests/chaos/oadp/conftest.py`:
- Around line 283-302: The test yields control before the chaos teardown runs,
causing validation to race with pod deletion; fix by moving the yield so it
happens after cleanup. Specifically, declare final_status in the outer scope,
assign it via wait_for_oadp_phase (and keep the assert) inside the try block,
then perform pod_deleting_thread_during_oadp_operations["stop_event"].set(),
thread.join(...), and pod_deleting_process_recover(...) in the finally block,
and only after that finally block yield final_status; reference symbols:
wait_for_oadp_phase, final_status, allowed_statuses,
pod_deleting_thread_during_oadp_operations, thread,
pod_deleting_process_recover.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 23f3d04f-cead-45af-910a-c29663d3e88f

📥 Commits

Reviewing files that changed from the base of the PR and between dcef30d and f951a28.

📒 Files selected for processing (4)

tests/chaos/oadp/conftest.py
tests/chaos/oadp/test_oadp.py
tests/chaos/utils.py
utilities/constants.py

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/chaos/oadp/conftest.py (1)
156-163: ⚠️ Potential issue | 🔴 Critical

Fix Python 3.14 syntax error: parenthesize multiple exception types.

Line 162 blocks module collection with SyntaxError: multiple exception types must be parenthesized. When catching multiple exception types in a single clause, wrap them in parentheses.
-        except ResourceNotFoundError, ValueError, TypeError:
+        except (ResourceNotFoundError, ValueError, TypeError):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/oadp/conftest.py` around lines 156 - 163, The except clause
incorrectly lists multiple exception types with commas causing a Python 3.14
SyntaxError; update the except in the pod_deleting_process_recover call to catch
a tuple of exceptions like (ResourceNotFoundError, ValueError, TypeError) and
leave the LOGGER.error(...) block unchanged so the handler runs for any of those
exceptions.

♻️ Duplicate comments (1)

tests/chaos/oadp/conftest.py (1)

283-302: ⚠️ Potential issue | 🟠 Major

HIGH: Don’t hand control back to the test until pod-deletion chaos is fully stopped.

Line 290 yields to test_delete_pods_during_restore() before Lines 293-302 stop the thread and recover the affected workloads. That makes wait_for_restored_vm() and check_file_in_running_vm() race ongoing deletions instead of validating a recovered post-restore state. The race noted earlier is still present.

🔧 Return a stable post-recovery state

     try:
         final_status = wait_for_oadp_phase(
             resource=oadp_restore_started,
             terminal_statuses=terminal_statuses,
             timeout=TIMEOUT_3MIN,
         )
         assert final_status in allowed_statuses, f"Restore ended in unexpected terminal phase: {final_status}"
-        yield final_status

     finally:
         pod_deleting_thread_during_oadp_operations["stop_event"].set()
         if thread.is_alive():
             thread.join(timeout=TIMEOUT_1MIN)

         # Recovery only — thread teardown handled by pod_deleting_thread fixture
         pod_deleting_process_recover(
             resources=[Deployment, DaemonSet],
             namespace=namespace,
             pod_prefix=pod_prefix,
         )
+
+    return final_status

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/oadp/conftest.py` around lines 283 - 302, The test yields control
too early: move the yield of final_status (from wait_for_oadp_phase / variable
final_status) so that pod-deleting chaos is fully stopped and workloads
recovered first; specifically, set
pod_deleting_thread_during_oadp_operations["stop_event"], join the thread (use
thread.join with TIMEOUT_1MIN if alive), call pod_deleting_process_recover
(passing Deployment/DaemonSet, namespace, pod_prefix) and only then yield
final_status so downstream helpers like wait_for_restored_vm() and
check_file_in_running_vm() run against a stable, post-recovery state.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/chaos/utils.py`:
- Line 614: Replace the runtime assert with an explicit runtime check after the
TimeoutSampler loop: verify that the local variable vm is not None (the one set
inside the TimeoutSampler loop) and if it is None raise a clear exception (e.g.,
raise RuntimeError or AssertionError with the same message) so the check cannot
be stripped by python -O and the type checker is satisfied; update the code
around the TimeoutSampler loop and the variable vm accordingly.
- Around line 558-575: The function wait_until_vm_running currently lacks an
explicit return type; update its signature to include a None return annotation
(i.e., def wait_until_vm_running(vm: VirtualMachine, timeout=TIMEOUT_5MIN,
sleep=5) -> None:) so the contract is explicit for IDEs/type-checkers while
keeping the same behavior involving TimeoutSampler and raising TimeoutError from
TimeoutExpiredError.
- Line 673: The trailing "assert False, 'Unreachable code reached'" in the
TimeoutSampler loop is unsafe and unnecessary because the loop either returns a
sample or raises TimeoutExpiredError; remove this assertion (preferred) or, if a
return-path is required by the type checker, replace it with an explicit raise
AssertionError() so the check is not stripped by python -O; update the code
around the TimeoutSampler generator/loop (the function/class named
TimeoutSampler and its surrounding try/except/return logic) to eliminate the
unreachable assert and ensure control flow remains clear.

---

Outside diff comments:
In `@tests/chaos/oadp/conftest.py`:
- Around line 156-163: The except clause incorrectly lists multiple exception
types with commas causing a Python 3.14 SyntaxError; update the except in the
pod_deleting_process_recover call to catch a tuple of exceptions like
(ResourceNotFoundError, ValueError, TypeError) and leave the LOGGER.error(...)
block unchanged so the handler runs for any of those exceptions.

---

Duplicate comments:
In `@tests/chaos/oadp/conftest.py`:
- Around line 283-302: The test yields control too early: move the yield of
final_status (from wait_for_oadp_phase / variable final_status) so that
pod-deleting chaos is fully stopped and workloads recovered first; specifically,
set pod_deleting_thread_during_oadp_operations["stop_event"], join the thread
(use thread.join with TIMEOUT_1MIN if alive), call pod_deleting_process_recover
(passing Deployment/DaemonSet, namespace, pod_prefix) and only then yield
final_status so downstream helpers like wait_for_restored_vm() and
check_file_in_running_vm() run against a stable, post-recovery state.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7acddf4b-e53e-4169-86dc-d616ce69b5dc

📥 Commits

Reviewing files that changed from the base of the PR and between f951a28 and 530f51e.

📒 Files selected for processing (4)

tests/chaos/oadp/conftest.py
tests/chaos/oadp/test_oadp.py
tests/chaos/utils.py
utilities/constants.py

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/chaos/utils.py`:
- Around line 614-615: Remove the redundant unreachable assertion after the
TimeoutSampler loop: the post-loop check "if vm is None: raise
AssertionError(...)" is unnecessary because the loop either breaks with a truthy
sample (setting vm via try_get_and_wait_running) or raises TimeoutExpiredError;
delete this check to reduce cognitive overhead and keep logic in the surrounding
function that uses TimeoutSampler and try_get_and_wait_running unchanged.
- Line 674: Remove the unreachable "raise AssertionError(...)" at the end of the
TimeoutSampler consumption block; the TimeoutSampler iterator either returns a
value or raises TimeoutExpiredError (which is already caught and re-raised), so
the assertion is dead code—delete the raise statement in the function using
TimeoutSampler (referenced as TimeoutSampler and TimeoutExpiredError) to clean
up the unreachable path.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c46d35ae-61ea-4547-b7b2-6c84969031f4

📥 Commits

Reviewing files that changed from the base of the PR and between 530f51e and f032d17.

📒 Files selected for processing (4)

tests/chaos/oadp/conftest.py
tests/chaos/oadp/test_oadp.py
tests/chaos/utils.py
utilities/constants.py

qwang1 · 2026-04-07T10:23:43Z

@coderabbitai resume

coderabbitai · 2026-04-07T10:23:50Z

✅ Actions performed

Reviews resumed.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/chaos/oadp/conftest.py`:
- Line 162: The except clause using multiple exception types is not
parenthesized and causes a SyntaxError; update the except handler that currently
lists ResourceNotFoundError, ValueError, TypeError to use the proper
parenthesized multi-except form (i.e., wrap the exception types in parentheses)
so the try/except in conftest.py compiles correctly.

In `@tests/chaos/oadp/test_oadp.py`:
- Around line 196-211: The test docstring currently uses a nonconforming
"High-level flow:" heading; update the docstring to follow the repo STD format
by replacing that heading with "Steps:", add a top-level "Preconditions:"
section summarizing setup (healthy VM, backup exists), and add a final
"Expected:" section describing pass criteria (restore reaches terminal state, VM
recreated and Running, guest data intact); ensure section names are exactly
"Preconditions:", "Steps:", and "Expected:" and keep the same bulleted content
under the appropriate headings so tooling and conventions (the docstring for
this OADP restore resilience test) remain compatible.

In `@tests/chaos/utils.py`:
- Around line 665-668: Replace the reachable assert in the conditional that
checks expected_phase with an explicit raise of AssertionError so the check
still runs under python -O; locate the block that currently reads "if
expected_phase: assert sample == expected_phase, (f\"{resource.name} ended with
phase {sample}, expected {expected_phase}\")" and change it to raise
AssertionError with the same formatted message (use the variables sample,
expected_phase and resource.name) so failures are never stripped by
optimization.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1093bff8-584d-4f20-b1d8-94407321e2e7

📥 Commits

Reviewing files that changed from the base of the PR and between 530f51e and f032d17.

📒 Files selected for processing (4)

tests/chaos/oadp/conftest.py
tests/chaos/oadp/test_oadp.py
tests/chaos/utils.py
utilities/constants.py

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/chaos/utils.py (1)
350-350: ⚠️ Potential issue | 🔴 Critical

CRITICAL: Parameter name mismatch causes TypeError at runtime.

The function signature uses resources (plural) at tests/chaos/utils.py:350, but the call site at tests/chaos/conftest.py:404 passes resource= (singular). This will raise TypeError: pod_deleting_process_recover() got an unexpected keyword argument 'resource' when executed.

Fix by changing resource= to resources= at tests/chaos/conftest.py:404-405.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/utils.py` at line 350, The test fails because the function
pod_deleting_process_recover(resources, namespace, pod_prefix) expects the
parameter name resources but the call uses resource=; update the call site to
pass resources= (instead of resource=) so the keyword matches the function
signature (locate the invocation that passes resource= and rename it to
resources=).

♻️ Duplicate comments (2)

tests/chaos/oadp/test_oadp.py (1)
196-211: ⚠️ Potential issue | 🟡 Minor

MEDIUM: Use required STD section headers in the new restore test docstring.

Line 199 uses High-level flow:; this should be structured as Preconditions:, Steps:, and Expected: to match repository STD tooling/conventions.
As per coding guidelines: STD (Software Test Description) test docstrings MUST follow format with Required sections: Preconditions:, Steps:, Expected:.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/oadp/test_oadp.py` around lines 196 - 211, Update the test
docstring to use the required STD section headers instead of "High-level flow:":
replace that header with three sections "Preconditions:", "Steps:", and
"Expected:", moving the existing numbered flow items so preconditions list setup
(e.g., VM creation and backup), steps list actions performed (delete
VM/namespace, start background pod-deletion process, trigger restore, wait), and
expected list the outcomes (restore succeeds, VM recreated and Running, guest
data verified); edit the triple-quoted docstring in
tests/chaos/oadp/test_oadp.py that describes the restore test to reflect these
STD headers.
tests/chaos/oadp/conftest.py (1)
162-162: ⚠️ Potential issue | 🔴 Critical

CRITICAL: SyntaxError — parenthesize multiple exception types.

This is invalid Python 3 syntax and will fail at import time:
except ResourceNotFoundError, ValueError, TypeError:  # SyntaxError
WHY: Python 3 requires parentheses for multi-exception except clauses. The comma-separated form was Python 2 syntax for except ExceptionType, variable:.
🐛 Fix
-        except ResourceNotFoundError, ValueError, TypeError:
+        except (ResourceNotFoundError, ValueError, TypeError):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/oadp/conftest.py` at line 162, The except clause currently uses
Python2-style comma-separated exceptions ("except ResourceNotFoundError,
ValueError, TypeError:") which raises a SyntaxError in Python3; change it to use
a parenthesized tuple of exception types like "except (ResourceNotFoundError,
ValueError, TypeError):" (preserving the existing exception handling body) so
the interpreter accepts multiple exception types correctly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/chaos/oadp/conftest.py`:
- Around line 271-281: Replace the inline terminal statuses set in the fixture
with the shared constant from utilities.constants: remove the local
terminal_statuses definition and import and use OADP_RESTORE_TERMINAL_STATUSES
(which should mirror VeleroRestore.Status values) instead; keep allowed_statuses
but if it should be a subset derive it from OADP_RESTORE_TERMINAL_STATUSES (or
explicitly import/define OADP_RESTORE_TERMINAL_STATUSES in utilities.constants
as suggested) so the fixture uses the centralized constant rather than
hard-coded VeleroRestore.Status members.
- Around line 297-302: Wrap the call to pod_deleting_process_recover in the
recovery-only fixture in a try/except like the
backup_with_pod_deletion_orchestration fixture: catch Exception as err, log a
contextual error including namespace and pod_prefix (e.g., using the existing
logger or processLogger) and then re-raise the exception so failures are
visible; reference the pod_deleting_process_recover invocation and the
surrounding fixture to locate where to add the try/except and logging.
- Around line 35-36: POLL_INTERVAL is declared but unused; either remove the
constant or replace hard-coded poll values with it—update the TimeoutSampler
usages (the calls to TimeoutSampler that currently pass literal poll intervals
like 5 and 10) to use POLL_INTERVAL (or a second constant if different intervals
are needed) or simply delete the POLL_INTERVAL declaration if you prefer
literals; ensure you modify the TimeoutSampler invocations (the TimeoutSampler
calls in this module) to reference the constant name POLL_INTERVAL when
applicable.

In `@tests/chaos/utils.py`:
- Around line 665-668: The reachable assertion in the conditional block that
checks expected_phase (comparing sample to expected_phase and referencing
resource.name) must be replaced with an explicit raise so it isn't removed under
python -O; locate the conditional that reads "if expected_phase:" and replace
the assert that compares sample == expected_phase with an explicit raise (e.g.,
raise AssertionError or a ValueError) including the same formatted message
f"{resource.name} ended with phase {sample}, expected {expected_phase}" so
failures still raise when running optimized Python.

---

Outside diff comments:
In `@tests/chaos/utils.py`:
- Line 350: The test fails because the function
pod_deleting_process_recover(resources, namespace, pod_prefix) expects the
parameter name resources but the call uses resource=; update the call site to
pass resources= (instead of resource=) so the keyword matches the function
signature (locate the invocation that passes resource= and rename it to
resources=).

---

Duplicate comments:
In `@tests/chaos/oadp/conftest.py`:
- Line 162: The except clause currently uses Python2-style comma-separated
exceptions ("except ResourceNotFoundError, ValueError, TypeError:") which raises
a SyntaxError in Python3; change it to use a parenthesized tuple of exception
types like "except (ResourceNotFoundError, ValueError, TypeError):" (preserving
the existing exception handling body) so the interpreter accepts multiple
exception types correctly.

In `@tests/chaos/oadp/test_oadp.py`:
- Around line 196-211: Update the test docstring to use the required STD section
headers instead of "High-level flow:": replace that header with three sections
"Preconditions:", "Steps:", and "Expected:", moving the existing numbered flow
items so preconditions list setup (e.g., VM creation and backup), steps list
actions performed (delete VM/namespace, start background pod-deletion process,
trigger restore, wait), and expected list the outcomes (restore succeeds, VM
recreated and Running, guest data verified); edit the triple-quoted docstring in
tests/chaos/oadp/test_oadp.py that describes the restore test to reflect these
STD headers.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: cf4b306d-a0c6-400c-9d7e-10a9af0ca687

📥 Commits

Reviewing files that changed from the base of the PR and between f032d17 and c8f7af8.

📒 Files selected for processing (4)

tests/chaos/oadp/conftest.py
tests/chaos/oadp/test_oadp.py
tests/chaos/utils.py
utilities/constants.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

tests/chaos/oadp/conftest.py (1)
161-161: ⚠️ Potential issue | 🔴 Critical

CRITICAL: SyntaxError — parenthesize the multi-except handler.

WHY: Python 3 requires parentheses around multiple exception types. The syntax except A, B, C: was Python 2's way to bind an exception to a variable (like except A as B). In Python 3.11+ (your requires-python), this raises SyntaxError: multiple exception types must be parenthesized and the code won't compile.
Fix
-        except ResourceNotFoundError, ValueError, TypeError:
+        except (ResourceNotFoundError, ValueError, TypeError):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/oadp/conftest.py` at line 161, The multi-except handler using
`except ResourceNotFoundError, ValueError, TypeError:` is invalid in Python 3;
change it to a parenthesized tuple of exceptions like `except
(ResourceNotFoundError, ValueError, TypeError):` (or `except
(ResourceNotFoundError, ValueError, TypeError) as e:` if the exception object is
needed) in the relevant try/except block to fix the SyntaxError in conftest.py.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/chaos/oadp/conftest.py`:
- Line 298: The multi-except handler currently uses Python 2 comma syntax
(listing ResourceNotFoundError, ValueError, TypeError separated by commas);
update the handler to use a parenthesized tuple of exception types instead
(i.e., wrap the exception names in parentheses) so the except statement is valid
in Python 3; make the same change for both occurrences in
tests/chaos/oadp/conftest.py (the handlers that reference ResourceNotFoundError,
ValueError, TypeError).

In `@utilities/constants.py`:
- Around line 1044-1056: OADP_BACKUP_TERMINAL_STATUSES and
OADP_RESTORE_TERMINAL_STATUSES contain identical sets and risk drift;
consolidate them by defining a single canonical constant (e.g.,
OADP_TERMINAL_STATUSES) containing the set of terminal status strings and then
set OADP_BACKUP_TERMINAL_STATUSES and OADP_RESTORE_TERMINAL_STATUSES to
reference that canonical constant (or remove the duplicates and import/reference
the canonical name wherever used) so there is a single source of truth.

---

Duplicate comments:
In `@tests/chaos/oadp/conftest.py`:
- Line 161: The multi-except handler using `except ResourceNotFoundError,
ValueError, TypeError:` is invalid in Python 3; change it to a parenthesized
tuple of exceptions like `except (ResourceNotFoundError, ValueError,
TypeError):` (or `except (ResourceNotFoundError, ValueError, TypeError) as e:`
if the exception object is needed) in the relevant try/except block to fix the
SyntaxError in conftest.py.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 095c8415-cbde-4fa3-bd50-1127e949a492

📥 Commits

Reviewing files that changed from the base of the PR and between c8f7af8 and a5a14a5.

📒 Files selected for processing (4)

tests/chaos/oadp/conftest.py
tests/chaos/oadp/test_oadp.py
tests/chaos/utils.py
utilities/constants.py

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/chaos/utils.py (1)

54-104: ⚠️ Potential issue | 🟠 Major

HIGH: ratio=0 still keeps the deleter alive for the full duration.

delete_pods() now treats ratio == 0 as a supported no-op, but both continuous deleter paths still keep their loop/process alive until max_duration. That means “disabled chaos” still burns a worker for the whole interval and emits repeated no-op logs instead of exiting immediately.

♻️ Proposed fix

 def _delete_pods_continuously(client, pod_prefix, namespace_name, ratio, interval, max_duration):
+    if ratio == 0:
+        LOGGER.info(f"Skipping pod deleting process for prefix {pod_prefix}: ratio=0")
+        return
+
     try:
         for _ in TimeoutSampler(
             wait_timeout=max_duration,
             sleep=interval,
             func=delete_pods,
@@
 def delete_pods_continuously(
     stop_event,
     client,
     pod_prefix,
     namespace_name,
     ratio,
     interval,
     max_duration,
 ):
+    if ratio == 0:
+        LOGGER.info(f"Skipping continuous pod deletion for prefix {pod_prefix}: ratio=0")
+        return
+
     start_time = time.time()

Also applies to: 435-440, 468-512

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/utils.py` around lines 54 - 104, The continuous pod-deleter
currently keeps running until max_duration even when ratio == 0; update
create_pod_deleting_process by having its inner function
_delete_pods_continuously check if ratio == 0 (or falsy) at start and return
immediately to avoid spawning a long-lived no-op process; locate the inner
function _delete_pods_continuously and add an early "if ratio == 0: return" (or
equivalent) before calling delete_pods/TimeoutSampler, and apply the same fix to
the other similar deleter routines that use delete_pods (the other continuous
deleter loops referenced in the diff).

♻️ Duplicate comments (3)

tests/chaos/oadp/conftest.py (3)

155-161: ⚠️ Potential issue | 🔴 Critical

CRITICAL: Parenthesize both multi-except handlers.

Line 161 and Line 298 use a multi-except form that the repo's current parser/Flake8 still rejects, so tests/chaos/oadp/conftest.py will not import in CI. Use except (ResourceNotFoundError, ValueError, TypeError): in both blocks.

🛠️ Minimal fix

-        except ResourceNotFoundError, ValueError, TypeError:
+        except (ResourceNotFoundError, ValueError, TypeError):
@@
-        except ResourceNotFoundError, ValueError, TypeError:
+        except (ResourceNotFoundError, ValueError, TypeError):

🔎 Verification

#!/bin/bash
set -euo pipefail

python --version

for cfg in pyproject.toml tox.ini setup.cfg .python-version; do
  if [ -f "$cfg" ]; then
    echo "== $cfg =="
    rg -n "requires-python|target-version|python_version" "$cfg" || true
  fi
done

python - <<'PY'
from pathlib import Path

source = Path("tests/chaos/oadp/conftest.py").read_text()
compile(source, "tests/chaos/oadp/conftest.py", "exec")
print("compile(): OK")
PY

Also applies to: 292-298

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/oadp/conftest.py` around lines 155 - 161, The multi-except
clauses around the pod_deleting_process_recover call use invalid syntax; change
the except clauses from "except ResourceNotFoundError, ValueError, TypeError:"
to a parenthesized tuple form "except (ResourceNotFoundError, ValueError,
TypeError):" in the block that wraps the pod_deleting_process_recover invocation
(and the similar multi-except block later in the file), so Python/Flake8 will
parse the file correctly.

270-283: ⚠️ Potential issue | 🟠 Major

HIGH: Let the test assert the final restore phase.

Lines 272-283 narrow the result back to Completed/Failed even though the wait already uses OADP_RESTORE_TERMINAL_STATUSES. That turns PartiallyFailed and FailedValidation into fixture-setup errors instead of a regular expected_status mismatch in the test.

✂️ Minimal cleanup

-    allowed_statuses = {
-        oadp_restore_started.Status.COMPLETED,
-        oadp_restore_started.Status.FAILED,
-    }
-
     try:
         final_status = wait_for_oadp_phase(
             resource=oadp_restore_started,
             terminal_statuses=terminal_statuses,
             timeout=TIMEOUT_3MIN,
         )
-        assert final_status in allowed_statuses, f"Restore ended in unexpected terminal phase: {final_status}"
         yield final_status

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/oadp/conftest.py` around lines 270 - 283, The test narrows
terminal restore outcomes to only COMPLETED/FAILED which hides other terminal
phases; remove the allowed_statuses set and the restrictive assertion and
instead assert that the returned final_status from wait_for_oadp_phase (called
with terminal_statuses=OADP_RESTORE_TERMINAL_STATUSES) is one of the expected
terminal statuses by checking membership against OADP_RESTORE_TERMINAL_STATUSES
(use oadp_restore_started and final_status in the assertion message to preserve
context).

284-299: ⚠️ Potential issue | 🟠 Major

HIGH: Recover the OADP workloads before returning from this fixture.

Line 284 yields final_status before the stop/join/recover path runs. The downstream restore validation then waits for the VM while velero / controller-manager / node-agent pods can still be getting deleted, which makes the post-restore checks race ongoing chaos. Return the phase only after the thread is stopped and pod_deleting_process_recover(...) completes, or split orchestration and post-recovery validation into separate fixtures.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/oadp/conftest.py` around lines 284 - 299, The fixture yields
final_status too early and must only return after the chaos thread is stopped
and pod recovery completes; move the yield of final_status to after the finally
block (after pod_deleting_thread_during_oadp_operations["stop_event"].set(),
thread.join(...), and pod_deleting_process_recover(...)) or alternatively split
orchestration and post-recovery validation into separate fixtures so that
pod_deleting_process_recover (the function handling Deployment/DaemonSet
recovery) finishes before the fixture returns; also ensure you use the existing
pod_deleting_thread fixture for thread teardown and keep references to
final_status, pod_deleting_thread_during_oadp_operations, and
pod_deleting_process_recover to locate the changes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/chaos/utils.py`:
- Around line 574-575: Before converting TimeoutExpiredError to TimeoutError,
add an ERROR-level log that records the full context and original exception;
specifically, in the except blocks that catch TimeoutExpiredError (the
occurrences handling VM state waits that reference vm.name and timeout) call the
module logger (e.g., logger.error or logger.exception) with a message like "VM
<vm.name> failed to reach 'Running' within <timeout>s; current state: <vm.status
or similar>" and include the original exception (exc_info=error or use
logger.exception) before raising the new TimeoutError; apply this change to the
three catch sites shown (the block that currently does `except
TimeoutExpiredError as error: raise TimeoutError(f"VM {vm.name}...") from error`
and the similar blocks at the other mentioned locations).

---

Outside diff comments:
In `@tests/chaos/utils.py`:
- Around line 54-104: The continuous pod-deleter currently keeps running until
max_duration even when ratio == 0; update create_pod_deleting_process by having
its inner function _delete_pods_continuously check if ratio == 0 (or falsy) at
start and return immediately to avoid spawning a long-lived no-op process;
locate the inner function _delete_pods_continuously and add an early "if ratio
== 0: return" (or equivalent) before calling delete_pods/TimeoutSampler, and
apply the same fix to the other similar deleter routines that use delete_pods
(the other continuous deleter loops referenced in the diff).

---

Duplicate comments:
In `@tests/chaos/oadp/conftest.py`:
- Around line 155-161: The multi-except clauses around the
pod_deleting_process_recover call use invalid syntax; change the except clauses
from "except ResourceNotFoundError, ValueError, TypeError:" to a parenthesized
tuple form "except (ResourceNotFoundError, ValueError, TypeError):" in the block
that wraps the pod_deleting_process_recover invocation (and the similar
multi-except block later in the file), so Python/Flake8 will parse the file
correctly.
- Around line 270-283: The test narrows terminal restore outcomes to only
COMPLETED/FAILED which hides other terminal phases; remove the allowed_statuses
set and the restrictive assertion and instead assert that the returned
final_status from wait_for_oadp_phase (called with
terminal_statuses=OADP_RESTORE_TERMINAL_STATUSES) is one of the expected
terminal statuses by checking membership against OADP_RESTORE_TERMINAL_STATUSES
(use oadp_restore_started and final_status in the assertion message to preserve
context).
- Around line 284-299: The fixture yields final_status too early and must only
return after the chaos thread is stopped and pod recovery completes; move the
yield of final_status to after the finally block (after
pod_deleting_thread_during_oadp_operations["stop_event"].set(),
thread.join(...), and pod_deleting_process_recover(...)) or alternatively split
orchestration and post-recovery validation into separate fixtures so that
pod_deleting_process_recover (the function handling Deployment/DaemonSet
recovery) finishes before the fixture returns; also ensure you use the existing
pod_deleting_thread fixture for thread teardown and keep references to
final_status, pod_deleting_thread_during_oadp_operations, and
pod_deleting_process_recover to locate the changes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2e62e786-e7ad-4a5b-8d93-449da7cc738c

📥 Commits

Reviewing files that changed from the base of the PR and between a5a14a5 and f0ddc4b.

📒 Files selected for processing (4)

tests/chaos/oadp/conftest.py
tests/chaos/oadp/test_oadp.py
tests/chaos/utils.py
utilities/constants.py

coderabbitai

♻️ Duplicate comments (4)

tests/chaos/utils.py (2)
591-618: ⚠️ Potential issue | 🟠 Major

HIGH: timeout is being spent twice here.

Lines 604-607 can consume the full budget waiting for the VM to exist, and Line 618 then starts a second full-budget wait for Running. A 5-minute timeout can therefore stretch to nearly 10 minutes and hide slow-restore regressions.
⏱️ Keep a single timeout budget
     LOGGER.info(f"Waiting for restored VM {vm_name}")

     namespace_name = namespace.name if hasattr(namespace, "name") else namespace
+    deadline = time.monotonic() + timeout

     vm = None
@@
     if vm is None:
         raise AssertionError("Unreachable: vm should be set after TimeoutSampler loop")

     LOGGER.info(f"VM {vm.name} found, waiting for Running")
-    wait_until_vm_running(vm=vm, timeout=timeout, sleep=sleep)
+    remaining_timeout = max(1, int(deadline - time.monotonic()))
+    wait_until_vm_running(vm=vm, timeout=remaining_timeout, sleep=sleep)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/utils.py` around lines 591 - 618, The current code can use the
full timeout twice: first in the TimeoutSampler loop (try_get_and_wait_running)
and then again in wait_until_vm_running, potentially doubling the wait budget.
Fix by tracking elapsed time after the sampler returns (or while looping) and
pass the remaining timeout to wait_until_vm_running (e.g., remaining = max(0,
timeout - elapsed)), or refactor try_get_and_wait_running/TimeoutSampler to
check vm.status == "Running" so only one TimeoutSampler call is needed; update
calls to use the remaining timeout and keep sleep unchanged, referencing
TimeoutSampler, try_get_and_wait_running, wait_until_vm_running, timeout, sleep,
and vm.
574-575: ⚠️ Potential issue | 🟡 Minor

MEDIUM: log translated timeouts before re-raising them.

These except TimeoutExpiredError blocks convert the failure straight into TimeoutError, but they never emit an ERROR log with the last-known VM or phase context. In long chaos runs, that drops the most useful triage signal from the log stream.

As per coding guidelines, ERROR level logging REQUIRED for exceptions with full context: what failed, expected vs actual values, resource state.

Also applies to: 609-612, 670-671
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/utils.py` around lines 574 - 575, The except TimeoutExpiredError
handlers that currently raise a TimeoutError (the blocks matching "except
TimeoutExpiredError as error:" that reference vm.name and the similar blocks at
the other noted locations) must first emit an ERROR-level log containing full
context (VM name, phase/state, expected vs actual/last-seen values and the
timeout) and include the captured exception details, then re-raise the
TimeoutError using "raise ... from error"; update each such handler in
tests/chaos/utils.py (including the occurrences around the vm.name block and the
ones noted at the other locations) to call the module's logger.error (or the
existing process logger used in this file) with a descriptive message and the
exception before re-raising.
tests/chaos/oadp/conftest.py (2)
284-289: ⚠️ Potential issue | 🟠 Major

HIGH: Line 284 yields before the chaos thread is stopped.

In pytest, Lines 287-299 run during fixture teardown, not before the test body. test_delete_pods_during_restore() therefore waits for the restored VM and checks guest data while the deleter thread can still be killing OADP pods, so the post-restore validation is racing the disruption instead of a recovered state. Stop/join and recover first, then return the phase.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/oadp/conftest.py` around lines 284 - 289, The fixture yields
final_status too early — stop the pod deleter and wait for it to finish before
yielding so tests don't race with chaos; move the
pod_deleting_thread_during_oadp_operations["stop_event"].set() and the
thread.join(timeout=TIMEOUT_1MIN) (and any recovery/wait-for-recovery steps if
present) to execute before the yield of final_status (i.e., ensure the call to
thread.join and cleanup happen prior to returning final_status from the fixture,
referencing pod_deleting_thread_during_oadp_operations, thread, TIMEOUT_1MIN,
and yield final_status).
161-161: ⚠️ Potential issue | 🔴 Critical

CRITICAL: Parenthesize both multi-except handlers or the current checks will fail to parse this file.

Flake8 is already reporting E999 on Line 161, and Line 298 has the same form. The comma syntax only parses on Python 3.14+, so this blocks lint/import with the repo's current toolchain.
🔧 Minimal fix
-        except ResourceNotFoundError, ValueError, TypeError:
+        except (ResourceNotFoundError, ValueError, TypeError):
@@
-        except ResourceNotFoundError, ValueError, TypeError:
+        except (ResourceNotFoundError, ValueError, TypeError):
Based on learnings: "with Python 3.14 the syntax except ValueError, TypeError: is valid if there is no as clause"; the current Flake8 E999 shows this repo's toolchain is not on that parser yet.

Also applies to: 298-298
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/oadp/conftest.py` at line 161, The problematic multi-except uses
comma-separated exceptions (ResourceNotFoundError, ValueError, TypeError) which
is invalid for our parser; update each occurrence to either wrap those exception
names in a single exception-tuple (parenthesize them) or split them into
separate except blocks so each exception is handled properly; locate the
offending except statement(s) that refer to ResourceNotFoundError, ValueError,
and TypeError (there are two occurrences) and apply the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/chaos/oadp/conftest.py`:
- Around line 284-289: The fixture yields final_status too early — stop the pod
deleter and wait for it to finish before yielding so tests don't race with
chaos; move the pod_deleting_thread_during_oadp_operations["stop_event"].set()
and the thread.join(timeout=TIMEOUT_1MIN) (and any recovery/wait-for-recovery
steps if present) to execute before the yield of final_status (i.e., ensure the
call to thread.join and cleanup happen prior to returning final_status from the
fixture, referencing pod_deleting_thread_during_oadp_operations, thread,
TIMEOUT_1MIN, and yield final_status).
- Line 161: The problematic multi-except uses comma-separated exceptions
(ResourceNotFoundError, ValueError, TypeError) which is invalid for our parser;
update each occurrence to either wrap those exception names in a single
exception-tuple (parenthesize them) or split them into separate except blocks so
each exception is handled properly; locate the offending except statement(s)
that refer to ResourceNotFoundError, ValueError, and TypeError (there are two
occurrences) and apply the change.

In `@tests/chaos/utils.py`:
- Around line 591-618: The current code can use the full timeout twice: first in
the TimeoutSampler loop (try_get_and_wait_running) and then again in
wait_until_vm_running, potentially doubling the wait budget. Fix by tracking
elapsed time after the sampler returns (or while looping) and pass the remaining
timeout to wait_until_vm_running (e.g., remaining = max(0, timeout - elapsed)),
or refactor try_get_and_wait_running/TimeoutSampler to check vm.status ==
"Running" so only one TimeoutSampler call is needed; update calls to use the
remaining timeout and keep sleep unchanged, referencing TimeoutSampler,
try_get_and_wait_running, wait_until_vm_running, timeout, sleep, and vm.
- Around line 574-575: The except TimeoutExpiredError handlers that currently
raise a TimeoutError (the blocks matching "except TimeoutExpiredError as error:"
that reference vm.name and the similar blocks at the other noted locations) must
first emit an ERROR-level log containing full context (VM name, phase/state,
expected vs actual/last-seen values and the timeout) and include the captured
exception details, then re-raise the TimeoutError using "raise ... from error";
update each such handler in tests/chaos/utils.py (including the occurrences
around the vm.name block and the ones noted at the other locations) to call the
module's logger.error (or the existing process logger used in this file) with a
descriptive message and the exception before re-raising.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f036865a-ad9d-4f7b-b86a-24f3e5b25572

📥 Commits

Reviewing files that changed from the base of the PR and between f0ddc4b and 5301f6a.

📒 Files selected for processing (4)

tests/chaos/oadp/conftest.py
tests/chaos/oadp/test_oadp.py
tests/chaos/utils.py
utilities/constants.py

qwang1 · 2026-04-07T15:07:08Z

@coderabbitai resume

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/chaos/utils.py (1)
350-401: ⚠️ Potential issue | 🔴 Critical

CRITICAL: Existing chaos recovery callers still use resource=, so this helper will fail before it reaches the wait logic.

tests/chaos/conftest.py:404-407 calls pod_deleting_process_recover(resource=pod_deletion_config["resource"], ...), but this API only accepts resources. That raises TypeError in the standard chaos cleanup path. Update that caller in the same PR, or keep this helper backward-compatible with the singular form.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/utils.py` around lines 350 - 401, The helper
pod_deleting_process_recover currently requires a list named resources and
raises TypeError if callers pass the singular keyword resource; make it
backward-compatible by accepting either resources (list) or resource (single
Deployment/DaemonSet) and normalizing to a list early in the function (e.g., if
resources is None and a resource arg exists, wrap it in a list, or if resources
is not a list but is a single resource, treat it as [resources]); ensure the
rest of the logic still validates each item via resource.kind and preserve
existing error messages (adjust the initial TypeError message to reflect both
accepted forms) so existing callers using resource= continue to work.

♻️ Duplicate comments (3)

tests/chaos/utils.py (1)

579-620: ⚠️ Potential issue | 🟠 Major

HIGH: wait_for_restored_vm() can consume the timeout budget twice.

The first TimeoutSampler waits up to timeout for the VM object to appear, then wait_until_vm_running() can wait another full timeout for Running. A restore that produces the VM near the deadline can therefore block for almost 2× the advertised budget and make slow restores look healthy. Use a single sampler for “exists and is Running”, or pass only the remaining time into wait_until_vm_running().

⏱️ Minimal fix

+    start_time = time.monotonic()
     try:
         for sample in TimeoutSampler(wait_timeout=timeout, sleep=sleep, func=try_get_and_wait_running):
             if sample:
                 break
@@
     LOGGER.info(f"VM {vm.name} found, waiting for Running")
-    wait_until_vm_running(vm=vm, timeout=timeout, sleep=sleep)
+    remaining_timeout = max(0, timeout - (time.monotonic() - start_time))
+    wait_until_vm_running(vm=vm, timeout=remaining_timeout, sleep=sleep)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/utils.py` around lines 579 - 620, wait_for_restored_vm currently
can spend the full timeout in TimeoutSampler and then another full timeout in
wait_until_vm_running; change it to track elapsed time and only pass the
remaining budget to wait_until_vm_running (or use a single TimeoutSampler that
checks both existence and vm.status == "Running"). Concretely, in
wait_for_restored_vm record start = time.time() before the TimeoutSampler loop,
after the sampler finds vm compute elapsed = time.time() - start, compute
remaining = timeout - elapsed, if remaining <= 0 raise TimeoutError with the
same context, otherwise call wait_until_vm_running(vm=vm, timeout=remaining,
sleep=sleep); references: wait_for_restored_vm, try_get_and_wait_running,
TimeoutSampler, wait_until_vm_running.

tests/chaos/oadp/conftest.py (2)

161-161: ⚠️ Potential issue | 🔴 Critical

CRITICAL: Parenthesize both multi-except handlers.

Line 161 and Line 298 use the unparenthesized multi-except form, but the repo’s current lint/parser setup still rejects it, so this file will not import in CI. Use a tuple at both sites.

🔧 Minimal fix

-        except ResourceNotFoundError, ValueError, TypeError:
+        except (ResourceNotFoundError, ValueError, TypeError):
...
-        except ResourceNotFoundError, ValueError, TypeError:
+        except (ResourceNotFoundError, ValueError, TypeError):

Quick repro: this should fail until both handlers are parenthesized.

#!/bin/bash
python -m py_compile tests/chaos/oadp/conftest.py

Also applies to: 298-298

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/oadp/conftest.py` at line 161, Replace the unparenthesized
multi-except forms with a parenthesized tuple at both occurrences: locate the
lines using the exact except pattern "except ResourceNotFoundError, ValueError,
TypeError" and the other similar multi-except at the later occurrence and change
them to use a tuple (e.g., except (ResourceNotFoundError, ValueError,
TypeError):) so the file imports correctly under the current parser/lint rules.

277-299: ⚠️ Potential issue | 🟠 Major

HIGH: Yield the restore status only after chaos recovery finishes.

yield final_status at Line 284 hands control back to the test while the pod-deletion thread is still running. The restored-VM wait and file check then race ongoing pod deletions, so this validates restore under continued chaos instead of the post-chaos result.

♻️ Suggested restructuring

     try:
         final_status = wait_for_oadp_phase(
             resource=oadp_restore_started,
             terminal_statuses=terminal_statuses,
             timeout=TIMEOUT_3MIN,
         )
         assert final_status in allowed_statuses, (
             f"Restore ended in unexpected terminal phase: {final_status}"
         )
-        yield final_status
 
     finally:
         pod_deleting_thread_during_oadp_operations["stop_event"].set()
         if thread.is_alive():
             thread.join(timeout=TIMEOUT_1MIN)
@@
         except (ResourceNotFoundError, ValueError, TypeError):
             LOGGER.error(f"Recovery failed for prefix {pod_prefix} in namespace {namespace}")
             raise
+
+    return final_status

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/oadp/conftest.py` around lines 277 - 299, The test yields
final_status before chaos recovery finishes, causing the post-restore checks to
race with ongoing pod deletions; move the yield(final_status) to after the
recovery/teardown block so that you first set
pod_deleting_thread_during_oadp_operations["stop_event"], join thread
(thread.join(timeout=TIMEOUT_1MIN)), run
pod_deleting_process_recover(resources=[Deployment, DaemonSet],
namespace=namespace, pod_prefix=pod_prefix) and only then yield final_status;
also fix the except clause to use a tuple form except (ResourceNotFoundError,
ValueError, TypeError): to correctly catch multiple exceptions.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/chaos/utils.py`:
- Around line 350-401: The helper pod_deleting_process_recover currently
requires a list named resources and raises TypeError if callers pass the
singular keyword resource; make it backward-compatible by accepting either
resources (list) or resource (single Deployment/DaemonSet) and normalizing to a
list early in the function (e.g., if resources is None and a resource arg
exists, wrap it in a list, or if resources is not a list but is a single
resource, treat it as [resources]); ensure the rest of the logic still validates
each item via resource.kind and preserve existing error messages (adjust the
initial TypeError message to reflect both accepted forms) so existing callers
using resource= continue to work.

---

Duplicate comments:
In `@tests/chaos/oadp/conftest.py`:
- Line 161: Replace the unparenthesized multi-except forms with a parenthesized
tuple at both occurrences: locate the lines using the exact except pattern
"except ResourceNotFoundError, ValueError, TypeError" and the other similar
multi-except at the later occurrence and change them to use a tuple (e.g.,
except (ResourceNotFoundError, ValueError, TypeError):) so the file imports
correctly under the current parser/lint rules.
- Around line 277-299: The test yields final_status before chaos recovery
finishes, causing the post-restore checks to race with ongoing pod deletions;
move the yield(final_status) to after the recovery/teardown block so that you
first set pod_deleting_thread_during_oadp_operations["stop_event"], join thread
(thread.join(timeout=TIMEOUT_1MIN)), run
pod_deleting_process_recover(resources=[Deployment, DaemonSet],
namespace=namespace, pod_prefix=pod_prefix) and only then yield final_status;
also fix the except clause to use a tuple form except (ResourceNotFoundError,
ValueError, TypeError): to correctly catch multiple exceptions.

In `@tests/chaos/utils.py`:
- Around line 579-620: wait_for_restored_vm currently can spend the full timeout
in TimeoutSampler and then another full timeout in wait_until_vm_running; change
it to track elapsed time and only pass the remaining budget to
wait_until_vm_running (or use a single TimeoutSampler that checks both existence
and vm.status == "Running"). Concretely, in wait_for_restored_vm record start =
time.time() before the TimeoutSampler loop, after the sampler finds vm compute
elapsed = time.time() - start, compute remaining = timeout - elapsed, if
remaining <= 0 raise TimeoutError with the same context, otherwise call
wait_until_vm_running(vm=vm, timeout=remaining, sleep=sleep); references:
wait_for_restored_vm, try_get_and_wait_running, TimeoutSampler,
wait_until_vm_running.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f60bafbe-6956-4a01-9eaa-42c0b9b8e642

📥 Commits

Reviewing files that changed from the base of the PR and between 5301f6a and fe3b87f.

📒 Files selected for processing (4)

tests/chaos/oadp/conftest.py
tests/chaos/oadp/test_oadp.py
tests/chaos/utils.py
utilities/constants.py

coderabbitai · 2026-04-07T15:18:24Z

✅ Actions performed

Reviews resumed.

Polarion cases: CNV-12023, CNV-12025, CNV-12027 Signed-off-by: Qixuan Wang <qwang@redhat.com>

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

tests/chaos/oadp/conftest.py (2)

285-307: ⚠️ Potential issue | 🟠 Major

HIGH: Finish chaos teardown before returning the restore phase.

Line 292 yields control back to the test, but Line 295 through Line 307 do not run until fixture teardown. test_delete_pods_during_restore() then waits on the restored VM while the pod-deletion thread is still active, so the new restore checks race ongoing chaos instead of validating post-recovery state.

🔧 Suggested restructure

     try:
         final_status = wait_for_oadp_phase(
             resource=oadp_restore_started,
             terminal_statuses=terminal_statuses,
             timeout=TIMEOUT_30MIN,
         )
-        assert final_status in allowed_statuses, f"Restore ended in unexpected terminal phase: {final_status}"
-        yield final_status
-
     finally:
         pod_deleting_thread_during_oadp_operations["stop_event"].set()
         if thread.is_alive():
             thread.join(timeout=TIMEOUT_1MIN)

         # Recovery only — thread teardown handled by pod_deleting_thread fixture
         try:
             pod_deleting_process_recover(
                 resources=[Deployment, DaemonSet],
                 namespace=namespace,
                 pod_prefix=pod_prefix,
             )
-        except ResourceNotFoundError, ValueError, TypeError:
+        except (ResourceNotFoundError, ValueError, TypeError):
             LOGGER.error(f"Recovery failed for prefix {pod_prefix} in namespace {namespace}")
             raise
+
+    assert final_status in allowed_statuses, (
+        f"Restore ended in unexpected terminal phase: {final_status}"
+    )
+    return final_status

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/oadp/conftest.py` around lines 285 - 307, The fixture yields the
restore phase before stopping the chaos thread and performing recovery, causing
races; ensure teardown runs before yielding by moving the
pod_deleting_thread_during_oadp_operations["stop_event"].set(), waiting for
thread.join(timeout=TIMEOUT_1MIN) and calling
pod_deleting_process_recover(resources=[Deployment, DaemonSet],
namespace=namespace, pod_prefix=pod_prefix) (catching exceptions with a tuple:
except (ResourceNotFoundError, ValueError, TypeError):) to execute prior to
yield from the fixture (or alternatively, join/stop the thread and run recovery
immediately after wait_for_oadp_phase and assert, then yield the final_status).

165-165: ⚠️ Potential issue | 🔴 Critical

CRITICAL: Parenthesize both multi-except handlers.

Line 165 and Line 306 still use the unparenthesized multi-except form. The repo’s current parser/lint toolchain rejects it, so this file will not pass validation as written.

🔧 Minimal fix

-        except ResourceNotFoundError, ValueError, TypeError:
+        except (ResourceNotFoundError, ValueError, TypeError):

#!/bin/bash
set -euo pipefail

rg -n 'except ResourceNotFoundError, ValueError, TypeError:' tests/chaos/oadp/conftest.py

python - <<'PY'
from pathlib import Path

source = Path("tests/chaos/oadp/conftest.py").read_text()
try:
    compile(source, "tests/chaos/oadp/conftest.py", "exec")
    print("compile(): OK")
except SyntaxError as exc:
    print(f"compile(): SyntaxError -> {exc}")
PY

Also applies to: 306-306

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/chaos/oadp/conftest.py` at line 165, The except clauses use the
deprecated comma-separated multi-except form ("except ResourceNotFoundError,
ValueError, TypeError:"); update both occurrences to use a parenthesized tuple
form ("except (ResourceNotFoundError, ValueError, TypeError):") so the parser
accepts the file; search for the exact string "except ResourceNotFoundError,
ValueError, TypeError:" in conftest.py and replace both instances with the
parenthesized version.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/chaos/oadp/conftest.py`:
- Around line 301-305: The restore fixture call passes a resources argument to
pod_deleting_process_recover but the function still ignores it and uses a
hardcoded [Deployment, DaemonSet]; update the pod_deleting_process_recover
implementation to consume the resources parameter (e.g., iterate over the
provided resources list instead of the hardcoded list) so the restore-case
inputs from tests/chaos/oadp/test_oadp.py (which pass [Deployment] or
[DaemonSet]) take effect, or if you prefer remove the unused resources parameter
from the call/signature to eliminate dead code; ensure the function signature
(pod_deleting_process_recover) and any internal uses reference the passed-in
resources variable rather than a literal.

In `@tests/chaos/utils.py`:
- Around line 462-465: The current except block catching TimeoutExpiredError in
the pod deletion helper (around pod.wait_deleted and TIMEOUT_30SEC) only logs at
INFO and continues; change it to log at ERROR with full context (pod.name,
timeout value, expected deletion vs observed state) using LOGGER.error, and then
re-raise the TimeoutExpiredError so callers can fail or retry explicitly; keep
the original exception when re-raising to preserve traceback.
- Around line 592-629: The current flow in wait_for_restored_vm
(try_get_and_wait_running, TimeoutSampler loop, then wait_until_vm_running) can
double-consume the timeout; fix by making the two phases share a single timeout
budget: record start_time before the TimeoutSampler, compute end_time =
start_time + timeout, after the sampler loop calculate remaining = end_time -
now and if remaining <= 0 raise the same TimeoutError, otherwise call
wait_until_vm_running(vm=vm, timeout=remaining, sleep=sleep). Use the existing
symbols try_get_and_wait_running, TimeoutSampler, wait_until_vm_running and
preserve the same error/LOG lines but pass namespace_name consistently.

---

Duplicate comments:
In `@tests/chaos/oadp/conftest.py`:
- Around line 285-307: The fixture yields the restore phase before stopping the
chaos thread and performing recovery, causing races; ensure teardown runs before
yielding by moving the
pod_deleting_thread_during_oadp_operations["stop_event"].set(), waiting for
thread.join(timeout=TIMEOUT_1MIN) and calling
pod_deleting_process_recover(resources=[Deployment, DaemonSet],
namespace=namespace, pod_prefix=pod_prefix) (catching exceptions with a tuple:
except (ResourceNotFoundError, ValueError, TypeError):) to execute prior to
yield from the fixture (or alternatively, join/stop the thread and run recovery
immediately after wait_for_oadp_phase and assert, then yield the final_status).
- Line 165: The except clauses use the deprecated comma-separated multi-except
form ("except ResourceNotFoundError, ValueError, TypeError:"); update both
occurrences to use a parenthesized tuple form ("except (ResourceNotFoundError,
ValueError, TypeError):") so the parser accepts the file; search for the exact
string "except ResourceNotFoundError, ValueError, TypeError:" in conftest.py and
replace both instances with the parenthesized version.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a2c1d712-7b34-4664-910b-d3d240d201ab

📥 Commits

Reviewing files that changed from the base of the PR and between 5301f6a and 9aa58ca.

📒 Files selected for processing (4)

tests/chaos/oadp/conftest.py
tests/chaos/oadp/test_oadp.py
tests/chaos/utils.py
utilities/constants.py

coderabbitai · 2026-04-08T16:34:03Z

+            pod_deleting_process_recover(
+                resources=[Deployment, DaemonSet],
+                namespace=namespace,
+                pod_prefix=pod_prefix,
+            )


⚠️ Potential issue | 🟡 Minor

MEDIUM: The new resources parameter is still inert.

The restore parametrization in tests/chaos/oadp/test_oadp.py now passes [Deployment] vs [DaemonSet], but this recovery call still hardcodes both resource types. Changing the case data currently has no effect, so either consume the payload here or remove it from the restore-case inputs.

♻️ Minimal fix

- resources=[Deployment, DaemonSet], + resources=( + pod_deleting_thread_during_oadp_operations["resources"] + if pod_deleting_thread_during_oadp_operations["resources"] is not None + else [Deployment, DaemonSet] + ),

As per coding guidelines: "No dead code - every function, variable, fixture MUST be used or removed; code marked with # skip-unused-code is excluded from dead code analysis".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

pod_deleting_process_recover(

resources=[Deployment, DaemonSet],

namespace=namespace,

pod_prefix=pod_prefix,

)

pod_deleting_process_recover(

resources=(

pod_deleting_thread_during_oadp_operations["resources"]

if pod_deleting_thread_during_oadp_operations["resources"] is not None

else [Deployment, DaemonSet]

),

namespace=namespace,

pod_prefix=pod_prefix,

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/chaos/oadp/conftest.py` around lines 301 - 305, The restore fixture call passes a resources argument to pod_deleting_process_recover but the function still ignores it and uses a hardcoded [Deployment, DaemonSet]; update the pod_deleting_process_recover implementation to consume the resources parameter (e.g., iterate over the provided resources list instead of the hardcoded list) so the restore-case inputs from tests/chaos/oadp/test_oadp.py (which pass [Deployment] or [DaemonSet]) take effect, or if you prefer remove the unused resources parameter from the call/signature to eliminate dead code; ensure the function signature (pod_deleting_process_recover) and any internal uses reference the passed-in resources variable rather than a literal.

coderabbitai · 2026-04-08T16:34:04Z

        try:
            pod.wait_deleted(timeout=TIMEOUT_30SEC)
        except TimeoutExpiredError:
-            LOGGER.warning(f"Pod {pod.name} was not deleted")
+            LOGGER.info(f"Pod {pod.name} was not deleted")


⚠️ Potential issue | 🟠 Major

HIGH: Timed-out pod deletions are treated as success.

If wait_deleted() times out, this chaos round did not actually remove the selected pod, but the helper only emits an INFO log and keeps going. That can let the restore scenario pass without exercising the intended failure. Log this at ERROR and re-raise so callers can fail or retry explicitly.

🐛 Proposed fix

for pod in pods_to_delete: LOGGER.info(f"Deleting pod {pod.name}") pod.clean_up() try: pod.wait_deleted(timeout=TIMEOUT_30SEC) - except TimeoutExpiredError: - LOGGER.info(f"Pod {pod.name} was not deleted") + except TimeoutExpiredError as error: + LOGGER.error( + f"Timed out waiting for pod {pod.name} in namespace {namespace_name} to be deleted" + ) + raise TimeoutError( + f"Pod {pod.name} was not deleted within {TIMEOUT_30SEC}s" + ) from error

As per coding guidelines: ERROR level logging REQUIRED for exceptions with full context: what failed, expected vs actual values, resource state and NEVER silently swallow exceptions - at minimum, log the error before continuing in Python.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

try:

pod.wait_deleted(timeout=TIMEOUT_30SEC)

except TimeoutExpiredError:

LOGGER.warning(f"Pod {pod.name} was not deleted")

LOGGER.info(f"Pod {pod.name} was not deleted")

try:

pod.wait_deleted(timeout=TIMEOUT_30SEC)

except TimeoutExpiredError as error:

LOGGER.error(

f"Timed out waiting for pod {pod.name} in namespace {namespace_name} to be deleted"

)

raise TimeoutError(

f"Pod {pod.name} was not deleted within {TIMEOUT_30SEC}s"

) from error

🧰 Tools

🪛 Ruff (0.15.9)

[warning] 465-465: Logging statement uses f-string

(G004)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/chaos/utils.py` around lines 462 - 465, The current except block catching TimeoutExpiredError in the pod deletion helper (around pod.wait_deleted and TIMEOUT_30SEC) only logs at INFO and continues; change it to log at ERROR with full context (pod.name, timeout value, expected deletion vs observed state) using LOGGER.error, and then re-raise the TimeoutExpiredError so callers can fail or retry explicitly; keep the original exception when re-raising to preserve traceback.

coderabbitai · 2026-04-08T16:34:04Z

+    namespace_name = namespace.name if hasattr(namespace, "name") else namespace
+
+    vm = None
+
+    def try_get_and_wait_running():
+        nonlocal vm
+
+        try:
+            vm_iter = VirtualMachine.get(client=admin_client, namespace=namespace_name, name=vm_name)
+            vm_obj = next(vm_iter, None)
+            if vm_obj:
+                test_vm = VirtualMachineForTests(
+                    client=admin_client,
+                    name=vm_name,
+                    namespace=namespace_name,
+                )
+                test_vm.name = vm_name
+                vm = test_vm
+                return vm
+        except ResourceNotFoundError:
+            return None
+
+    try:
+        for sample in TimeoutSampler(wait_timeout=timeout, sleep=sleep, func=try_get_and_wait_running):
+            if sample:
+                break
+
+    except TimeoutExpiredError as error:
+        LOGGER.error(f"VM {vm_name} was not found in namespace {namespace_name} within {timeout}s")
+        raise TimeoutError(
+            f"VM {vm_name} not found or not running in namespace {namespace} within {timeout}s"
+        ) from error
+
+    if vm is None:
+        raise AssertionError("Unreachable: vm should be set after TimeoutSampler loop")
+
+    LOGGER.info(f"VM {vm.name} found, waiting for Running")
+    wait_until_vm_running(vm=vm, timeout=timeout, sleep=sleep)


⚠️ Potential issue | 🟠 Major

HIGH: timeout can be consumed twice in wait_for_restored_vm.

The first wait only covers “VM object exists”. After that breaks, wait_until_vm_running() gets the full timeout again. In the failure path, a 5-minute budget can stretch to nearly 10 minutes. Carry one timeout budget across both phases, or fold the Running check into the same wait.

♻️ Proposed fix

def wait_for_restored_vm( admin_client, namespace, vm_name, timeout=TIMEOUT_5MIN, sleep=5, ) -> VirtualMachine: @@ LOGGER.info(f"Waiting for restored VM {vm_name}") namespace_name = namespace.name if hasattr(namespace, "name") else namespace + timeout_watch = TimeoutWatch(timeout=timeout) vm = None @@ LOGGER.info(f"VM {vm.name} found, waiting for Running") - wait_until_vm_running(vm=vm, timeout=timeout, sleep=sleep) + wait_until_vm_running( + vm=vm, + timeout=max(1, int(timeout_watch.remaining_time())), + sleep=sleep, + )

🧰 Tools

🪛 Ruff (0.15.9)

[warning] 596-596: Missing return type annotation for private function try_get_and_wait_running

(ANN202)

[warning] 620-620: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

[warning] 620-620: Logging statement uses f-string

(G004)

[warning] 621-623: Avoid specifying long messages outside the exception class

(TRY003)

[warning] 626-626: Avoid specifying long messages outside the exception class

(TRY003)

[warning] 628-628: Logging statement uses f-string

(G004)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/chaos/utils.py` around lines 592 - 629, The current flow in wait_for_restored_vm (try_get_and_wait_running, TimeoutSampler loop, then wait_until_vm_running) can double-consume the timeout; fix by making the two phases share a single timeout budget: record start_time before the TimeoutSampler, compute end_time = start_time + timeout, after the sampler loop calculate remaining = end_time - now and if remaining <= 0 raise the same TimeoutError, otherwise call wait_until_vm_running(vm=vm, timeout=remaining, sleep=sleep). Use the existing symbols try_get_and_wait_running, TimeoutSampler, wait_until_vm_running and preserve the same error/LOG lines but pass namespace_name consistently.

openshift-virtualization-qe-bot-3 · 2026-04-30T10:54:19Z

/retest all

Auto-triggered: Files in this PR were modified by merged PR #3748.

Overlapping files

utilities/constants.py

openshift-virtualization-qe-bot-3 · 2026-05-04T13:57:03Z

/retest all

Auto-triggered: Files in this PR were modified by merged PR #4725.

Overlapping files

utilities/constants.py

github-actions Bot added sig-chaos sig-storage labels Dec 29, 2025

openshift-virtualization-qe-bot-3 assigned qwang1 Dec 29, 2025

openshift-virtualization-qe-bot-3 requested a review from kshvaika December 29, 2025 18:07

openshift-virtualization-qe-bot-3 mentioned this pull request Dec 29, 2025

WIP [Chaos][OADP] Automate cases: Delete Pods during OADP Restore - 3284 #3285

Closed

openshift-virtualization-qe-bot-3 requested a review from jpeimer December 29, 2025 18:07

openshift-virtualization-qe-bot-3 added branch-main size/XXL labels Dec 29, 2025

openshift-virtualization-qe-bot-3 requested review from Ahmad-Hafe, RoniKishner, geetikakay, josemacassan, kgoldbla, rnetser, sarahbx and vsibirsk December 29, 2025 18:07

openshift-virtualization-qe-bot added the tox:verify-tc-requirement-polarion:running D/S Testing label Dec 29, 2025

openshift-virtualization-qe-bot-3 requested review from dshchedr and duyanyan December 29, 2025 18:08

openshift-virtualization-qe-bot-3 requested a review from dalia-frank December 29, 2025 18:08

coderabbitai Bot approved these changes Mar 19, 2026

View reviewed changes

coderabbitai Bot requested changes Mar 20, 2026

View reviewed changes

Comment thread tests/chaos/utils.py Outdated

Comment thread tests/chaos/utils.py Outdated

coderabbitai Bot reviewed Mar 20, 2026

View reviewed changes

coderabbitai Bot previously approved these changes Mar 20, 2026

View reviewed changes

rnetser requested changes Mar 22, 2026

View reviewed changes

Comment thread tests/chaos/oadp/conftest.py Outdated

Comment thread tests/chaos/oadp/conftest.py Outdated

Comment thread tests/chaos/oadp/conftest.py Outdated

Comment thread tests/chaos/oadp/conftest.py Outdated

Comment thread tests/chaos/oadp/conftest.py Outdated

coderabbitai Bot requested changes Apr 7, 2026

View reviewed changes

Comment thread tests/chaos/oadp/conftest.py Outdated

Comment thread tests/chaos/oadp/test_oadp.py

Comment thread tests/chaos/utils.py Outdated

coderabbitai Bot requested changes Apr 7, 2026

View reviewed changes

Comment thread tests/chaos/utils.py Outdated

Comment thread tests/chaos/utils.py Outdated

Comment thread tests/chaos/utils.py Outdated

coderabbitai Bot requested changes Apr 7, 2026

View reviewed changes

Comment thread tests/chaos/utils.py

Comment thread tests/chaos/utils.py

coderabbitai Bot approved these changes Apr 7, 2026

View reviewed changes

coderabbitai Bot requested changes Apr 7, 2026

View reviewed changes

Comment thread tests/chaos/oadp/conftest.py

Comment thread tests/chaos/oadp/test_oadp.py

coderabbitai Bot requested changes Apr 7, 2026

View reviewed changes

Comment thread tests/chaos/oadp/conftest.py Outdated

Comment thread tests/chaos/oadp/conftest.py Outdated

Comment thread tests/chaos/oadp/conftest.py Outdated

Comment thread tests/chaos/utils.py Outdated

coderabbitai Bot requested changes Apr 7, 2026

View reviewed changes

Comment thread tests/chaos/oadp/conftest.py

Comment thread utilities/constants.py Outdated

coderabbitai Bot requested changes Apr 7, 2026

View reviewed changes

Comment thread tests/chaos/utils.py

coderabbitai Bot reviewed Apr 7, 2026

View reviewed changes

coderabbitai Bot previously approved these changes Apr 7, 2026

View reviewed changes

[Chaos][OADP] Automate cases: Delete Pods during OADP Restore

9aa58ca

Polarion cases: CNV-12023, CNV-12025, CNV-12027 Signed-off-by: Qixuan Wang <qwang@redhat.com>

coderabbitai Bot requested changes Apr 8, 2026

View reviewed changes

Conversation

qwang1 commented Dec 29, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Short description:

More details:

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for reviewer:

jira-ticket:

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

openshift-virtualization-qe-bot-3 commented Dec 29, 2025

Welcome! 🎉

🔄 Automatic Actions

📋 Available Commands

PR Status Management

Review & Approval

Testing & Validation

Container Operations

Cherry-pick Operations

Label Management

✅ Merge Requirements

📊 Review Process

💡 Tips

Uh oh!

codecov Bot commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

openshift-virtualization-qe-bot-3 commented Mar 19, 2026

Uh oh!

qwang1 commented Mar 20, 2026

Uh oh!

coderabbitai Bot commented Mar 20, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

rnetser left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

openshift-virtualization-qe-bot-3 commented Mar 22, 2026

Uh oh!

openshift-virtualization-qe-bot-3 commented Mar 23, 2026

Uh oh!

openshift-virtualization-qe-bot-3 commented Mar 30, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

qwang1 commented Dec 29, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Dec 29, 2025 •

edited

Loading

codecov Bot commented Dec 29, 2025 •

edited

Loading