test: add flaky test detection with pytest-rerunfailures and historical report analysis#474
test: add flaky test detection with pytest-rerunfailures and historical report analysis#474larock22 wants to merge 3 commits into
Conversation
…al report analysis - Add pytest-rerunfailures dependency for retry-on-failure (reruns=2, delay=1s) - Add flaky marker to label known-unstable tests for quarantine - Add scripts/check-flaky-tests.py to detect outcome-flaky and duration-variance tests from historical JSON reports - Wire flaky test detection into pre-push hooks so flaky tests surface before pushing Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
…ction Integration and tmux tests inherently have variable timing due to external API calls and tmux sessions. Excluding them from duration variance analysis prevents false positives from the flaky test check. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (2)
💤 Files with no reviewable changes (1)
📜 Recent review details🧰 Additional context used🧠 Learnings (1)📓 Common learnings🔇 Additional comments (1)
📝 WalkthroughWalkthroughAdds pytest rerun dependency and pytest config for retries and a Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 7 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (7 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Review rate limit: 4/5 reviews remaining, refill in 12 minutes. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pyproject.toml`:
- Line 58: Add pytest-rerunfailures to deptry's DEP002 allowlist so deptry stops
flagging it as unused; update the tool.deptry.per_rule_ignores.DEP002 setting to
include "pytest-rerunfailures" (the plugin listed in pyproject.toml) so the
plugin entrypoint consumed by pytest is whitelisted and CI no longer fails.
In `@scripts/check-flaky-tests.py`:
- Around line 8-10: The usage string advertises a --min-runs flag but main()
never parses argv and the script uses the hard-coded MIN_RUNS constant; update
main() to parse a --min-runs (or --min_runs) argument (using argparse or
similar), validate it as an int, and use the parsed value instead of the
hard-coded MIN_RUNS when calling the detection logic (or when assigning
MIN_RUNS), and update the usage/help text accordingly; alternatively, if you
prefer to remove the flag, remove --min-runs from the usage and any mention in
help and keep MIN_RUNS as the single source of truth.
- Around line 31-35: The current loop in scripts/check-flaky-tests.py silently
swallows json.JSONDecodeError and OSError when loading each report (the try
around open(path)/json.load), which hides corrupt or unreadable inputs; change
it so failures are surfaced instead of continued—either remove the broad except
and let the exception propagate, or catch the exception and re-raise a new
exception that includes the path context (e.g., include the path variable) so
the caller fails loudly; ensure this change affects the block that opens the
file and appends to reports (reports.append(json.load(f))).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 57911e98-f445-411b-9b7b-8d55fa002e46
⛔ Files ignored due to path filters (2)
.pre-commit-config.yamlis excluded by!**/*.yamluv.lockis excluded by!**/*.lock
📒 Files selected for processing (3)
pyproject.tomlpytest.iniscripts/check-flaky-tests.py
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (Custom checks)
**/*.py: Enforce dependency direction: ui → core → tools → utils/types. Flag violations where core/ imports from ui/, tools/ imports from ui/, tools/ imports from core/, or types/ imports from anything except stdlib/typing
In agent/orchestration code, verify exception handlers clean up state: flagexcept (UserAbortError, CancelledError)patterns that exist without corresponding cleanup like_remove_dangling_tool_calls()or state rollback. State mutations (messages.append, session modifications) followed by await/function calls that could raise must be cleaned up in except blocks
Check for shallow copy followed by nested dict mutation: flag.copy()on dicts that contain nested dicts/lists. Usedeepcopy()instead or use dictionary spreading syntax to avoid mutating original objects
Enforce error handling principle: 'Fail fast, fail loud. No silent fallbacks.' Flagexcept: pass,except Exception: pass, empty except blocks, catching broad exceptions without re-raising or logging, and returning None/[]/{} sentinel values instead of raising exceptions for invalid inputs
Files:
scripts/check-flaky-tests.py
pyproject.toml
📄 CodeRabbit inference engine (AGENTS.md)
Python requirement: 3.11+ (enforce in pyproject.toml)
Files:
pyproject.toml
🪛 GitHub Actions: Lint
pyproject.toml
[error] 1-1: deptry (DEP002): 'pytest-rerunfailures' is defined as a dependency but not used in the codebase. Found 1 dependency issue.
🔇 Additional comments (1)
pytest.ini (1)
28-29: LGTM!The rerun settings and the new
flakymarker are wired up consistently with the new test-flakiness workflow.Also applies to: 49-49
| Usage: | ||
| uv run python scripts/check-flaky-tests.py [--min-runs N] | ||
| """ |
There was a problem hiding this comment.
Implement --min-runs or remove it from the usage text.
main() never parses argv, so the documented --min-runs flag is silently ignored and the threshold is hard-coded via MIN_RUNS. That makes the CLI contract misleading and prevents callers from tuning the detection sensitivity.
Also applies to: 95-106
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/check-flaky-tests.py` around lines 8 - 10, The usage string
advertises a --min-runs flag but main() never parses argv and the script uses
the hard-coded MIN_RUNS constant; update main() to parse a --min-runs (or
--min_runs) argument (using argparse or similar), validate it as an int, and use
the parsed value instead of the hard-coded MIN_RUNS when calling the detection
logic (or when assigning MIN_RUNS), and update the usage/help text accordingly;
alternatively, if you prefer to remove the flag, remove --min-runs from the
usage and any mention in help and keep MIN_RUNS as the single source of truth.
| try: | ||
| with open(path) as f: | ||
| reports.append(json.load(f)) | ||
| except (json.JSONDecodeError, OSError): | ||
| continue |
There was a problem hiding this comment.
Fail hard on unreadable report files.
Skipping malformed JSON or I/O errors here hides partial/corrupt input and can make the hook report "no flaky tests" on incomplete data. Surface the failure instead of continuing silently. As per coding guidelines, "Fail fast, fail loud. No silent fallbacks."
🛠 Suggested fix
- except (json.JSONDecodeError, OSError):
- continue
+ except (json.JSONDecodeError, OSError) as exc:
+ raise RuntimeError(f"Failed to load flaky-test report: {path}") from exc🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/check-flaky-tests.py` around lines 31 - 35, The current loop in
scripts/check-flaky-tests.py silently swallows json.JSONDecodeError and OSError
when loading each report (the try around open(path)/json.load), which hides
corrupt or unreadable inputs; change it so failures are surfaced instead of
continued—either remove the broad except and let the exception propagate, or
catch the exception and re-raise a new exception that includes the path context
(e.g., include the path variable) so the caller fails loudly; ensure this change
affects the block that opens the file and appends to reports
(reports.append(json.load(f))).
Summary
Adds comprehensive flaky test detection to the test infrastructure:
Changes
pytest-rerunfailures— Industry-standard plugin for retrying failed tests. Configured with--reruns 2 --reruns-delay 1inpytest.ini. Failed tests automatically retry before being reported as failures, preventing intermittent failures from blocking CI.flakymarker — New pytest marker (@pytest.mark.flaky) for labeling known-unstable tests. Enables optional quarantine via-m "not flaky"in CI.scripts/check-flaky-tests.py— Analyzes historical JSON test reports (frompytest-json-report) to detect:Pre-push hook integration — The flaky test check runs on
git push, surfacing unstable tests before they reach CI.Why this matters
flakymarkerException Handling
scripts/check-flaky-tests.py uses defensive exception handling when loading reports:
Corrupted or unreadable JSON report files are skipped, allowing processing to continue with valid reports. The script's main() also guards missing report directory and insufficient report count by printing a message and returning exit code 0 for non-critical conditions; it returns 1 only when unstable tests are detected to surface failures in CI/pre-push hooks.
A tiny non-functional change in src/tunacode/exceptions.py removed an extra trailing blank line in ToolRetryError.init; no behavioral change to exception handling.
Type Safety
State Management, Messages, and Tool Calls
--reruns 2 --reruns-delay 1), which affects CI/test-run behavior by suppressing transient failures until retries are exhausted.Dead Code
Configuration Changes (summary)
--reruns 2 --reruns-delay 1and aflakymarker for quarantining known-unstable tests (can be excluded with-m "not flaky"in CI).pytest-rerunfailures>=14to dev dependencies and updates deptry ignore list for DEP002 to include pytest-rerunfailures.Other Notes
flakyto reduce false positives.