Skip to content

test: add flaky test detection with pytest-rerunfailures and historical report analysis#474

Open
larock22 wants to merge 3 commits into
masterfrom
feat/flaky-test-detection
Open

test: add flaky test detection with pytest-rerunfailures and historical report analysis#474
larock22 wants to merge 3 commits into
masterfrom
feat/flaky-test-detection

Conversation

@larock22
Copy link
Copy Markdown
Collaborator

@larock22 larock22 commented May 1, 2026

Summary

Adds comprehensive flaky test detection to the test infrastructure:

Changes

  1. pytest-rerunfailures — Industry-standard plugin for retrying failed tests. Configured with --reruns 2 --reruns-delay 1 in pytest.ini. Failed tests automatically retry before being reported as failures, preventing intermittent failures from blocking CI.

  2. flaky marker — New pytest marker (@pytest.mark.flaky) for labeling known-unstable tests. Enables optional quarantine via -m "not flaky" in CI.

  3. scripts/check-flaky-tests.py — Analyzes historical JSON test reports (from pytest-json-report) to detect:

    • Outcome-flaky tests: tests that flip between pass/fail across runs
    • Duration-unstable tests: tests with high coefficient of variation in execution time (excludes integration/tmux/flaky-marked tests, since their timing inherently depends on external resources)
  4. Pre-push hook integration — The flaky test check runs on git push, surfacing unstable tests before they reach CI.

Why this matters

  • Prevents flaky tests from silently eroding confidence in the test suite
  • Gives developers immediate feedback when a test becomes unstable
  • Provides a clear quarantine path via the flaky marker
  • Excludes integration/tmux tests from false-positive duration variance alerts

Exception Handling

  • scripts/check-flaky-tests.py uses defensive exception handling when loading reports:

    try:
        with open(path) as f:
            reports.append(json.load(f))
    except (json.JSONDecodeError, OSError):
        continue

    Corrupted or unreadable JSON report files are skipped, allowing processing to continue with valid reports. The script's main() also guards missing report directory and insufficient report count by printing a message and returning exit code 0 for non-critical conditions; it returns 1 only when unstable tests are detected to surface failures in CI/pre-push hooks.

  • A tiny non-functional change in src/tunacode/exceptions.py removed an extra trailing blank line in ToolRetryError.init; no behavioral change to exception handling.

Type Safety

  • The new flaky-check script uses Python 3.x type annotations (e.g., list[dict], dict[str, list[float]], tuple[list[str], list[str]], int) and typed container usage (defaultdict[list]/set). These annotations improve readability and static analysis; no type-safety regressions were introduced.

State Management, Messages, and Tool Calls

  • No changes to application SessionState, message passing, or tool-call architectures.
  • Tooling/config changes:
    • pytest-rerunfailures added to dev/test tooling (pyproject.toml), and pytest is configured via pytest.ini to retry failing tests (--reruns 2 --reruns-delay 1), which affects CI/test-run behavior by suppressing transient failures until retries are exhausted.
    • The script prints categorized lists of detected outcome-flaky and duration-unstable tests and returns exit codes to integrate with CI or git hooks.
    • The flaky-check script is integrated into a pre-push Git hook so developers receive local feedback about unstable tests before pushing.

Dead Code

  • No dead code added. The new script is purposeful and exercised via the hook/CI integration. The exceptions.py change removed only an extraneous blank line.

Configuration Changes (summary)

  • pytest.ini: Adds --reruns 2 --reruns-delay 1 and a flaky marker for quarantining known-unstable tests (can be excluded with -m "not flaky" in CI).
  • pyproject.toml: Adds pytest-rerunfailures>=14 to dev dependencies and updates deptry ignore list for DEP002 to include pytest-rerunfailures.

Other Notes

  • The duration-unstable detection excludes integration, tmux, and tests already marked flaky to reduce false positives.

larock22 and others added 2 commits May 1, 2026 18:02
…al report analysis

- Add pytest-rerunfailures dependency for retry-on-failure (reruns=2, delay=1s)
- Add flaky marker to label known-unstable tests for quarantine
- Add scripts/check-flaky-tests.py to detect outcome-flaky and duration-variance tests from historical JSON reports
- Wire flaky test detection into pre-push hooks so flaky tests surface before pushing

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
…ction

Integration and tmux tests inherently have variable timing due to
external API calls and tmux sessions. Excluding them from duration
variance analysis prevents false positives from the flaky test check.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 1, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 592317f6-ee43-43e8-ab3b-bbc5f4e392cf

📥 Commits

Reviewing files that changed from the base of the PR and between 8a47ef2 and 16f015f.

⛔ Files ignored due to path filters (1)
  • AGENTS.md is excluded by !**/*.md
📒 Files selected for processing (2)
  • pyproject.toml
  • src/tunacode/exceptions.py
💤 Files with no reviewable changes (1)
  • src/tunacode/exceptions.py
📜 Recent review details
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: CR
Repo: alchemiststudiosDOTai/tunacode

Timestamp: 2026-05-01T23:11:57.667Z
Learning: Keep `AGENTS.md` updated with `Last Updated: YYYY-MM-DD` whenever `src/` or `docs/` changes
Learnt from: CR
Repo: alchemiststudiosDOTai/tunacode

Timestamp: 2026-05-01T23:11:57.667Z
Learning: Enforce architectural layering: allowed import direction is `ui -> core -> tools`, with shared-layer imports from `utils`, `types`, `configuration`, `constants`, `exceptions`
Learnt from: CR
Repo: alchemiststudiosDOTai/tunacode

Timestamp: 2026-05-01T23:11:57.667Z
Learning: Do not add TunaCode-owned message-contract wrappers around tinyagent message types; use tinyagent models directly in memory and keep dict payloads at real boundaries
Learnt from: CR
Repo: alchemiststudiosDOTai/tunacode

Timestamp: 2026-05-01T23:11:57.667Z
Learning: Use UV and .venv for Python environment management (Python 3.11+ required)
Learnt from: CR
Repo: alchemiststudiosDOTai/tunacode

Timestamp: 2026-05-01T23:11:57.667Z
Learning: Follow Git safety and non-destructive defaults: read `docs/git/practices.md` before Git operations, never delete untracked files/directories unless explicitly requested, and pause to ask if unknown files appear
Learnt from: CR
Repo: alchemiststudiosDOTai/tunacode

Timestamp: 2026-05-01T23:11:57.667Z
Learning: During commit-time check failures, apply only trivial lint-only fixes; do not make architectural/refactor decisions to force a green commit—stop and ask for user instruction
Learnt from: CR
Repo: alchemiststudiosDOTai/tunacode

Timestamp: 2026-05-01T23:11:57.667Z
Learning: Prefer small, scoped changes and keep edits minimal and targeted
Learnt from: CR
Repo: alchemiststudiosDOTai/tunacode

Timestamp: 2026-05-01T23:11:57.667Z
Learning: Mirror command/test naming and patterns used in nearby code; prefer existing patterns
Learnt from: CR
Repo: alchemiststudiosDOTai/tunacode

Timestamp: 2026-05-01T23:11:57.667Z
Learning: Run validation commands (dependency layers, import order, init bloat, gates checks) before handoff when touching architecture, dependencies, or shared packages
Learnt from: CR
Repo: alchemiststudiosDOTai/tunacode

Timestamp: 2026-05-01T23:11:57.667Z
Learning: Do not add empty directories or `__init__.py`-only directories
Learnt from: CR
Repo: alchemiststudiosDOTai/tunacode

Timestamp: 2026-05-01T23:11:57.667Z
Learning: Preserve existing architecture order rules; do not add new imports across forbidden layers
Learnt from: CR
Repo: alchemiststudiosDOTai/tunacode

Timestamp: 2026-05-01T23:11:57.667Z
Learning: README.md and docs should stay consistent with implementation
🔇 Additional comments (1)
pyproject.toml (1)

58-58: Good alignment for flaky-test tooling dependencies.

Line 58 and Line 352 correctly add pytest-rerunfailures in both dev dependency definitions, and Line 271 correctly whitelists it in deptry (DEP002) for plugin-entrypoint usage. This matches the configured pytest rerun behavior and addresses the prior deptry blocker cleanly.

Also applies to: 271-271, 352-352


📝 Walkthrough

Walkthrough

Adds pytest rerun dependency and pytest config for retries and a flaky marker; introduces a new CLI script to analyze JSON test reports for outcome- and duration-based flakiness; removes an extra blank line in an exception file.

Changes

Cohort / File(s) Summary
Project metadata & tooling
pyproject.toml
Added pytest-rerunfailures>=14 to project.optional-dependencies.dev and dependency-groups.dev; added pytest-rerunfailures to tool.deptry.per_rule_ignores.DEP002.
Pytest configuration
pytest.ini
Appended --reruns 2 --reruns-delay 1 to addopts; added a flaky marker in markers.
Flaky detection script
scripts/check-flaky-tests.py
New CLI script that parses .test_reports/report_*.json to detect outcome-flaky tests (mixed pass/fail across runs) and duration-flaky tests (high coeff. of variation), skipping tests with keywords integration, tmux, or flaky.
Minor cleanup
src/tunacode/exceptions.py
Removed an extra trailing blank line in ToolRetryError.__init__ (no behavioral change).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

codex

🚥 Pre-merge checks | ✅ 7 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning Title exceeds 72-character limit at 87 characters and does not follow conventional commits format precisely. Shorten title to ≤72 characters while keeping the conventional commit prefix. Consider: 'test: add flaky test detection and retry configuration'
Nosilentfailures ⚠️ Warning load_reports() silently swallows JSONDecodeError and OSError exceptions without logging; main() returns success (0) when preconditions fail, violating CLAUDE.md's fail-loud principle. Add warning logs in load_reports() for each failed report and raise exception if no reports load; raise exceptions in main() for unmet preconditions instead of returning 0.
✅ Passed checks (7 passed)
Check name Status Explanation
Description check ✅ Passed PR description provides clear context on changes and rationale but is missing sections from the repository template like Pre-PR Checklist, Type of Change, Testing, and verification checkboxes.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Noshallowcopymutation ✅ Passed PR does not introduce shallow copy mutations of nested dictionaries; new files contain no .copy() calls and existing code uses copy.deepcopy() correctly.
Exceptionpathcleanup ✅ Passed PR modifies only test infrastructure and whitespace; no agent/orchestration exception handlers modified.
Dependencydirection ✅ Passed PR changed files comply with CLAUDE.md Gate 2 dependency rules: new script has stdlib-only imports, modified exceptions.py imports only from allowed types layer.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Review rate limit: 4/5 reviews remaining, refill in 12 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pyproject.toml`:
- Line 58: Add pytest-rerunfailures to deptry's DEP002 allowlist so deptry stops
flagging it as unused; update the tool.deptry.per_rule_ignores.DEP002 setting to
include "pytest-rerunfailures" (the plugin listed in pyproject.toml) so the
plugin entrypoint consumed by pytest is whitelisted and CI no longer fails.

In `@scripts/check-flaky-tests.py`:
- Around line 8-10: The usage string advertises a --min-runs flag but main()
never parses argv and the script uses the hard-coded MIN_RUNS constant; update
main() to parse a --min-runs (or --min_runs) argument (using argparse or
similar), validate it as an int, and use the parsed value instead of the
hard-coded MIN_RUNS when calling the detection logic (or when assigning
MIN_RUNS), and update the usage/help text accordingly; alternatively, if you
prefer to remove the flag, remove --min-runs from the usage and any mention in
help and keep MIN_RUNS as the single source of truth.
- Around line 31-35: The current loop in scripts/check-flaky-tests.py silently
swallows json.JSONDecodeError and OSError when loading each report (the try
around open(path)/json.load), which hides corrupt or unreadable inputs; change
it so failures are surfaced instead of continued—either remove the broad except
and let the exception propagate, or catch the exception and re-raise a new
exception that includes the path context (e.g., include the path variable) so
the caller fails loudly; ensure this change affects the block that opens the
file and appends to reports (reports.append(json.load(f))).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 57911e98-f445-411b-9b7b-8d55fa002e46

📥 Commits

Reviewing files that changed from the base of the PR and between 0e39c2d and 8a47ef2.

⛔ Files ignored due to path filters (2)
  • .pre-commit-config.yaml is excluded by !**/*.yaml
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • pyproject.toml
  • pytest.ini
  • scripts/check-flaky-tests.py
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (Custom checks)

**/*.py: Enforce dependency direction: ui → core → tools → utils/types. Flag violations where core/ imports from ui/, tools/ imports from ui/, tools/ imports from core/, or types/ imports from anything except stdlib/typing
In agent/orchestration code, verify exception handlers clean up state: flag except (UserAbortError, CancelledError) patterns that exist without corresponding cleanup like _remove_dangling_tool_calls() or state rollback. State mutations (messages.append, session modifications) followed by await/function calls that could raise must be cleaned up in except blocks
Check for shallow copy followed by nested dict mutation: flag .copy() on dicts that contain nested dicts/lists. Use deepcopy() instead or use dictionary spreading syntax to avoid mutating original objects
Enforce error handling principle: 'Fail fast, fail loud. No silent fallbacks.' Flag except: pass, except Exception: pass, empty except blocks, catching broad exceptions without re-raising or logging, and returning None/[]/{} sentinel values instead of raising exceptions for invalid inputs

Files:

  • scripts/check-flaky-tests.py
pyproject.toml

📄 CodeRabbit inference engine (AGENTS.md)

Python requirement: 3.11+ (enforce in pyproject.toml)

Files:

  • pyproject.toml
🪛 GitHub Actions: Lint
pyproject.toml

[error] 1-1: deptry (DEP002): 'pytest-rerunfailures' is defined as a dependency but not used in the codebase. Found 1 dependency issue.

🔇 Additional comments (1)
pytest.ini (1)

28-29: LGTM!

The rerun settings and the new flaky marker are wired up consistently with the new test-flakiness workflow.

Also applies to: 49-49

Comment thread pyproject.toml
Comment on lines +8 to +10
Usage:
uv run python scripts/check-flaky-tests.py [--min-runs N]
"""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Implement --min-runs or remove it from the usage text.

main() never parses argv, so the documented --min-runs flag is silently ignored and the threshold is hard-coded via MIN_RUNS. That makes the CLI contract misleading and prevents callers from tuning the detection sensitivity.

Also applies to: 95-106

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/check-flaky-tests.py` around lines 8 - 10, The usage string
advertises a --min-runs flag but main() never parses argv and the script uses
the hard-coded MIN_RUNS constant; update main() to parse a --min-runs (or
--min_runs) argument (using argparse or similar), validate it as an int, and use
the parsed value instead of the hard-coded MIN_RUNS when calling the detection
logic (or when assigning MIN_RUNS), and update the usage/help text accordingly;
alternatively, if you prefer to remove the flag, remove --min-runs from the
usage and any mention in help and keep MIN_RUNS as the single source of truth.

Comment on lines +31 to +35
try:
with open(path) as f:
reports.append(json.load(f))
except (json.JSONDecodeError, OSError):
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail hard on unreadable report files.

Skipping malformed JSON or I/O errors here hides partial/corrupt input and can make the hook report "no flaky tests" on incomplete data. Surface the failure instead of continuing silently. As per coding guidelines, "Fail fast, fail loud. No silent fallbacks."

🛠 Suggested fix
-        except (json.JSONDecodeError, OSError):
-            continue
+        except (json.JSONDecodeError, OSError) as exc:
+            raise RuntimeError(f"Failed to load flaky-test report: {path}") from exc
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/check-flaky-tests.py` around lines 31 - 35, The current loop in
scripts/check-flaky-tests.py silently swallows json.JSONDecodeError and OSError
when loading each report (the try around open(path)/json.load), which hides
corrupt or unreadable inputs; change it so failures are surfaced instead of
continued—either remove the broad except and let the exception propagate, or
catch the exception and re-raise a new exception that includes the path context
(e.g., include the path variable) so the caller fails loudly; ensure this change
affects the block that opens the file and appends to reports
(reports.append(json.load(f))).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant