test: add flaky test detection with pytest-rerunfailures and historical report analysis by larock22 · Pull Request #474 · alchemiststudiosDOTai/tunacode

larock22 · 2026-05-01T23:03:21Z

Summary

Adds comprehensive flaky test detection to the test infrastructure:

Changes

pytest-rerunfailures — Industry-standard plugin for retrying failed tests. Configured with --reruns 2 --reruns-delay 1 in pytest.ini. Failed tests automatically retry before being reported as failures, preventing intermittent failures from blocking CI.
flaky marker — New pytest marker (@pytest.mark.flaky) for labeling known-unstable tests. Enables optional quarantine via -m "not flaky" in CI.
scripts/check-flaky-tests.py — Analyzes historical JSON test reports (from pytest-json-report) to detect:
- Outcome-flaky tests: tests that flip between pass/fail across runs
- Duration-unstable tests: tests with high coefficient of variation in execution time (excludes integration/tmux/flaky-marked tests, since their timing inherently depends on external resources)
Pre-push hook integration — The flaky test check runs on git push, surfacing unstable tests before they reach CI.

Why this matters

Prevents flaky tests from silently eroding confidence in the test suite
Gives developers immediate feedback when a test becomes unstable
Provides a clear quarantine path via the flaky marker
Excludes integration/tmux tests from false-positive duration variance alerts

Exception Handling

scripts/check-flaky-tests.py uses defensive exception handling when loading reports:
```
try:
    with open(path) as f:
        reports.append(json.load(f))
except (json.JSONDecodeError, OSError):
    continue
```
Corrupted or unreadable JSON report files are skipped, allowing processing to continue with valid reports. The script's main() also guards missing report directory and insufficient report count by printing a message and returning exit code 0 for non-critical conditions; it returns 1 only when unstable tests are detected to surface failures in CI/pre-push hooks.
A tiny non-functional change in src/tunacode/exceptions.py removed an extra trailing blank line in ToolRetryError.init; no behavioral change to exception handling.

Type Safety

The new flaky-check script uses Python 3.x type annotations (e.g., list[dict], dict[str, list[float]], tuple[list[str], list[str]], int) and typed container usage (defaultdict[list]/set). These annotations improve readability and static analysis; no type-safety regressions were introduced.

State Management, Messages, and Tool Calls

No changes to application SessionState, message passing, or tool-call architectures.
Tooling/config changes:
- pytest-rerunfailures added to dev/test tooling (pyproject.toml), and pytest is configured via pytest.ini to retry failing tests (--reruns 2 --reruns-delay 1), which affects CI/test-run behavior by suppressing transient failures until retries are exhausted.
- The script prints categorized lists of detected outcome-flaky and duration-unstable tests and returns exit codes to integrate with CI or git hooks.
- The flaky-check script is integrated into a pre-push Git hook so developers receive local feedback about unstable tests before pushing.

Dead Code

No dead code added. The new script is purposeful and exercised via the hook/CI integration. The exceptions.py change removed only an extraneous blank line.

Configuration Changes (summary)

pytest.ini: Adds --reruns 2 --reruns-delay 1 and a flaky marker for quarantining known-unstable tests (can be excluded with -m "not flaky" in CI).
pyproject.toml: Adds pytest-rerunfailures>=14 to dev dependencies and updates deptry ignore list for DEP002 to include pytest-rerunfailures.

Other Notes

The duration-unstable detection excludes integration, tmux, and tests already marked flaky to reduce false positives.

…al report analysis - Add pytest-rerunfailures dependency for retry-on-failure (reruns=2, delay=1s) - Add flaky marker to label known-unstable tests for quarantine - Add scripts/check-flaky-tests.py to detect outcome-flaky and duration-variance tests from historical JSON reports - Wire flaky test detection into pre-push hooks so flaky tests surface before pushing Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

…ction Integration and tmux tests inherently have variable timing due to external API calls and tmux sessions. Excluding them from duration variance analysis prevents false positives from the flaky test check. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

coderabbitai · 2026-05-01T23:03:33Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 592317f6-ee43-43e8-ab3b-bbc5f4e392cf

📥 Commits

Reviewing files that changed from the base of the PR and between 8a47ef2 and 16f015f.

⛔ Files ignored due to path filters (1)

AGENTS.md is excluded by !**/*.md

📒 Files selected for processing (2)

pyproject.toml
src/tunacode/exceptions.py

💤 Files with no reviewable changes (1)

src/tunacode/exceptions.py

📜 Recent review details

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: CR
Repo: alchemiststudiosDOTai/tunacode

Timestamp: 2026-05-01T23:11:57.667Z
Learning: Keep `AGENTS.md` updated with `Last Updated: YYYY-MM-DD` whenever `src/` or `docs/` changes

Learnt from: CR
Repo: alchemiststudiosDOTai/tunacode

Timestamp: 2026-05-01T23:11:57.667Z
Learning: Enforce architectural layering: allowed import direction is `ui -> core -> tools`, with shared-layer imports from `utils`, `types`, `configuration`, `constants`, `exceptions`

Learnt from: CR
Repo: alchemiststudiosDOTai/tunacode

Timestamp: 2026-05-01T23:11:57.667Z
Learning: Do not add TunaCode-owned message-contract wrappers around tinyagent message types; use tinyagent models directly in memory and keep dict payloads at real boundaries

Learnt from: CR
Repo: alchemiststudiosDOTai/tunacode

Timestamp: 2026-05-01T23:11:57.667Z
Learning: Use UV and .venv for Python environment management (Python 3.11+ required)

Learnt from: CR
Repo: alchemiststudiosDOTai/tunacode

Timestamp: 2026-05-01T23:11:57.667Z
Learning: Follow Git safety and non-destructive defaults: read `docs/git/practices.md` before Git operations, never delete untracked files/directories unless explicitly requested, and pause to ask if unknown files appear

Learnt from: CR
Repo: alchemiststudiosDOTai/tunacode

Timestamp: 2026-05-01T23:11:57.667Z
Learning: During commit-time check failures, apply only trivial lint-only fixes; do not make architectural/refactor decisions to force a green commit—stop and ask for user instruction

Learnt from: CR
Repo: alchemiststudiosDOTai/tunacode

Timestamp: 2026-05-01T23:11:57.667Z
Learning: Prefer small, scoped changes and keep edits minimal and targeted

Learnt from: CR
Repo: alchemiststudiosDOTai/tunacode

Timestamp: 2026-05-01T23:11:57.667Z
Learning: Mirror command/test naming and patterns used in nearby code; prefer existing patterns

Learnt from: CR
Repo: alchemiststudiosDOTai/tunacode

Timestamp: 2026-05-01T23:11:57.667Z
Learning: Run validation commands (dependency layers, import order, init bloat, gates checks) before handoff when touching architecture, dependencies, or shared packages

Learnt from: CR
Repo: alchemiststudiosDOTai/tunacode

Timestamp: 2026-05-01T23:11:57.667Z
Learning: Do not add empty directories or `__init__.py`-only directories

Learnt from: CR
Repo: alchemiststudiosDOTai/tunacode

Timestamp: 2026-05-01T23:11:57.667Z
Learning: Preserve existing architecture order rules; do not add new imports across forbidden layers

Learnt from: CR
Repo: alchemiststudiosDOTai/tunacode

Timestamp: 2026-05-01T23:11:57.667Z
Learning: README.md and docs should stay consistent with implementation

🔇 Additional comments (1)

pyproject.toml (1)

58-58: Good alignment for flaky-test tooling dependencies.

Line 58 and Line 352 correctly add pytest-rerunfailures in both dev dependency definitions, and Line 271 correctly whitelists it in deptry (DEP002) for plugin-entrypoint usage. This matches the configured pytest rerun behavior and addresses the prior deptry blocker cleanly.

Also applies to: 271-271, 352-352

📝 Walkthrough

Walkthrough

Adds pytest rerun dependency and pytest config for retries and a flaky marker; introduces a new CLI script to analyze JSON test reports for outcome- and duration-based flakiness; removes an extra blank line in an exception file.

Changes

Cohort / File(s)	Summary
Project metadata & tooling `pyproject.toml`	Added `pytest-rerunfailures>=14` to `project.optional-dependencies.dev` and `dependency-groups.dev`; added `pytest-rerunfailures` to `tool.deptry.per_rule_ignores.DEP002`.
Pytest configuration `pytest.ini`	Appended `--reruns 2 --reruns-delay 1` to `addopts`; added a `flaky` marker in `markers`.
Flaky detection script `scripts/check-flaky-tests.py`	New CLI script that parses `.test_reports/report_*.json` to detect outcome-flaky tests (mixed pass/fail across runs) and duration-flaky tests (high coeff. of variation), skipping tests with keywords `integration`, `tmux`, or `flaky`.
Minor cleanup `src/tunacode/exceptions.py`	Removed an extra trailing blank line in `ToolRetryError.__init__` (no behavioral change).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Refactor: Decouple Tools from Pydantic-AI Dependency #326 — touches ToolRetryError in src/tunacode/exceptions.py, overlapping changes to the same exception implementation.

Suggested labels

codex

🚥 Pre-merge checks | ✅ 7 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	Title exceeds 72-character limit at 87 characters and does not follow conventional commits format precisely.	Shorten title to ≤72 characters while keeping the conventional commit prefix. Consider: 'test: add flaky test detection and retry configuration'
Nosilentfailures	⚠️ Warning	load_reports() silently swallows JSONDecodeError and OSError exceptions without logging; main() returns success (0) when preconditions fail, violating CLAUDE.md's fail-loud principle.	Add warning logs in load_reports() for each failed report and raise exception if no reports load; raise exceptions in main() for unmet preconditions instead of returning 0.

✅ Passed checks (7 passed)

Check name	Status	Explanation
Description check	✅ Passed	PR description provides clear context on changes and rationale but is missing sections from the repository template like Pre-PR Checklist, Type of Change, Testing, and verification checkboxes.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Noshallowcopymutation	✅ Passed	PR does not introduce shallow copy mutations of nested dictionaries; new files contain no `.copy()` calls and existing code uses `copy.deepcopy()` correctly.
Exceptionpathcleanup	✅ Passed	PR modifies only test infrastructure and whitespace; no agent/orchestration exception handlers modified.
Dependencydirection	✅ Passed	PR changed files comply with CLAUDE.md Gate 2 dependency rules: new script has stdlib-only imports, modified exceptions.py imports only from allowed types layer.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Review rate limit: 4/5 reviews remaining, refill in 12 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pyproject.toml`:
- Line 58: Add pytest-rerunfailures to deptry's DEP002 allowlist so deptry stops
flagging it as unused; update the tool.deptry.per_rule_ignores.DEP002 setting to
include "pytest-rerunfailures" (the plugin listed in pyproject.toml) so the
plugin entrypoint consumed by pytest is whitelisted and CI no longer fails.

In `@scripts/check-flaky-tests.py`:
- Around line 8-10: The usage string advertises a --min-runs flag but main()
never parses argv and the script uses the hard-coded MIN_RUNS constant; update
main() to parse a --min-runs (or --min_runs) argument (using argparse or
similar), validate it as an int, and use the parsed value instead of the
hard-coded MIN_RUNS when calling the detection logic (or when assigning
MIN_RUNS), and update the usage/help text accordingly; alternatively, if you
prefer to remove the flag, remove --min-runs from the usage and any mention in
help and keep MIN_RUNS as the single source of truth.
- Around line 31-35: The current loop in scripts/check-flaky-tests.py silently
swallows json.JSONDecodeError and OSError when loading each report (the try
around open(path)/json.load), which hides corrupt or unreadable inputs; change
it so failures are surfaced instead of continued—either remove the broad except
and let the exception propagate, or catch the exception and re-raise a new
exception that includes the path context (e.g., include the path variable) so
the caller fails loudly; ensure this change affects the block that opens the
file and appends to reports (reports.append(json.load(f))).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 57911e98-f445-411b-9b7b-8d55fa002e46

📥 Commits

Reviewing files that changed from the base of the PR and between 0e39c2d and 8a47ef2.

⛔ Files ignored due to path filters (2)

.pre-commit-config.yaml is excluded by !**/*.yaml
uv.lock is excluded by !**/*.lock

📒 Files selected for processing (3)

pyproject.toml
pytest.ini
scripts/check-flaky-tests.py

📜 Review details

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (Custom checks)

**/*.py: Enforce dependency direction: ui → core → tools → utils/types. Flag violations where core/ imports from ui/, tools/ imports from ui/, tools/ imports from core/, or types/ imports from anything except stdlib/typing
In agent/orchestration code, verify exception handlers clean up state: flag except (UserAbortError, CancelledError) patterns that exist without corresponding cleanup like _remove_dangling_tool_calls() or state rollback. State mutations (messages.append, session modifications) followed by await/function calls that could raise must be cleaned up in except blocks
Check for shallow copy followed by nested dict mutation: flag .copy() on dicts that contain nested dicts/lists. Use deepcopy() instead or use dictionary spreading syntax to avoid mutating original objects
Enforce error handling principle: 'Fail fast, fail loud. No silent fallbacks.' Flag except: pass, except Exception: pass, empty except blocks, catching broad exceptions without re-raising or logging, and returning None/[]/{} sentinel values instead of raising exceptions for invalid inputs

Files:

scripts/check-flaky-tests.py

pyproject.toml

📄 CodeRabbit inference engine (AGENTS.md)

Python requirement: 3.11+ (enforce in pyproject.toml)

Files:

pyproject.toml

🪛 GitHub Actions: Lint

pyproject.toml

[error] 1-1: deptry (DEP002): 'pytest-rerunfailures' is defined as a dependency but not used in the codebase. Found 1 dependency issue.

🔇 Additional comments (1)

pytest.ini (1)

28-29: LGTM!

The rerun settings and the new flaky marker are wired up consistently with the new test-flakiness workflow.

Also applies to: 49-49

coderabbitai · 2026-05-01T23:09:35Z

+Usage:
+    uv run python scripts/check-flaky-tests.py [--min-runs N]
+"""


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Implement --min-runs or remove it from the usage text.

main() never parses argv, so the documented --min-runs flag is silently ignored and the threshold is hard-coded via MIN_RUNS. That makes the CLI contract misleading and prevents callers from tuning the detection sensitivity.

Also applies to: 95-106

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/check-flaky-tests.py` around lines 8 - 10, The usage string advertises a --min-runs flag but main() never parses argv and the script uses the hard-coded MIN_RUNS constant; update main() to parse a --min-runs (or --min_runs) argument (using argparse or similar), validate it as an int, and use the parsed value instead of the hard-coded MIN_RUNS when calling the detection logic (or when assigning MIN_RUNS), and update the usage/help text accordingly; alternatively, if you prefer to remove the flag, remove --min-runs from the usage and any mention in help and keep MIN_RUNS as the single source of truth.

coderabbitai · 2026-05-01T23:09:35Z

+        try:
+            with open(path) as f:
+                reports.append(json.load(f))
+        except (json.JSONDecodeError, OSError):
+            continue


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail hard on unreadable report files.

Skipping malformed JSON or I/O errors here hides partial/corrupt input and can make the hook report "no flaky tests" on incomplete data. Surface the failure instead of continuing silently. As per coding guidelines, "Fail fast, fail loud. No silent fallbacks."

🛠 Suggested fix

- except (json.JSONDecodeError, OSError): - continue + except (json.JSONDecodeError, OSError) as exc: + raise RuntimeError(f"Failed to load flaky-test report: {path}") from exc

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/check-flaky-tests.py` around lines 31 - 35, The current loop in scripts/check-flaky-tests.py silently swallows json.JSONDecodeError and OSError when loading each report (the try around open(path)/json.load), which hides corrupt or unreadable inputs; change it so failures are surfaced instead of continued—either remove the broad except and let the exception propagate, or catch the exception and re-raise a new exception that includes the path context (e.g., include the path variable) so the caller fails loudly; ensure this change affects the block that opens the file and appends to reports (reports.append(json.load(f))).

larock22 and others added 2 commits May 1, 2026 18:02

coderabbitai Bot reviewed May 1, 2026

View reviewed changes

fix: resolve flaky test CI failures

16f015f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add flaky test detection with pytest-rerunfailures and historical report analysis#474

test: add flaky test detection with pytest-rerunfailures and historical report analysis#474
larock22 wants to merge 3 commits into
masterfrom
feat/flaky-test-detection

larock22 commented May 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 1, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot May 1, 2026

Uh oh!

coderabbitai Bot May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

larock22 commented May 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Why this matters

Exception Handling

Type Safety

State Management, Messages, and Tool Calls

Dead Code

Configuration Changes (summary)

Other Notes

Uh oh!

coderabbitai Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

larock22 commented May 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 1, 2026 •

edited

Loading