falsegreen

One problem, one tool: the false positive. falsegreen finds Python/pytest tests that pass green without protecting anything — tests that let broken code through because the assertion is empty, always true, never runs, or checks the wrong thing.

A test that tells you a broken program is safe is worse than no test at all. AI coding assistants produce these at scale. The tool catches them before they merge.

The scanner is a zero-dependency AST pass. It validates each test against 34 active false-positive codes — patterns a parser can prove: an assertion that never runs, a check that is empty or always true, a swallowed exception, a mock assertion with a typo, a check stranded in dead code. HIGH findings block the commit; LOW ones warn. A third group (diagnostic and coupling, seven codes) can be enabled per project for informational checks that do not affect the exit code. The semantic layer — intent-based patterns no static tool can see — lives in falsegreen-skill, the LLM companion covering Python and other languages.

The checks are grounded in the rotten-green-test research (Soares 2023; Delplanque et al., ICSE 2019) and cross-walked against the published test-smell catalog. See CREDITS.md.

Install: pip install falsegreen. Pre-commit hook also available (see below). LLM semantic pass: falsegreen-skill.

Why this exists

Coverage tells you which lines ran. It does not tell you whether anything was checked. A suite can report 90% coverage while most of those tests assert nothing real — the green bar is a comfort, not a guarantee.

The danger is not a test that fails. It is a test that passes when it should not.

falsegreen targets that one problem. It is not a style linter and not a coverage tool. It asks one question per test: is there a way for the code to be wrong and this test to stay green? If yes, the test is not protecting what it claims.

This matters more now that a large share of tests come from AI assistants. The common machine-written failure modes are exactly the patterns here: assertion roulette, mocking the unit under test, asserting the value you fed the mock, copying the expected value from the current output. They look thorough. They pass. falsegreen is the second reader that asks whether they actually verify anything.

The methodology

One rule drives everything: a test is only useful if it fails when the code breaks. If you have never seen a test go red, you do not know that it tests anything.

The patterns are organized into five families:

A. The test never checks anything. The assertion is skipped, missing, swallowed, or the test is never collected by the runner.
B. The check is weak or always true. It accepts almost any output, or it is true by construction.
C. The test checks itself, not the program. It mocks the thing under test, asserts the value it configured, or re-implements the production logic as its own answer key.
D. Green depends on outside factors. Time, randomness, test order, a fixed sleep.
E. The test passes but checks the wrong thing. The assertion runs against a real result, yet the expected value contradicts what the code should do, so the test freezes a bug.

Family E is why the project has a semantic layer. It carries the core principle: the expected value must come from an independent oracle, not from the code's current output. The oracle hierarchy, highest first: explicit spec or requirement, documented contract (docstring, types, API), independent human judgment, current code. Code is the lowest-priority oracle. Treating it as the highest is how you rubber-stamp a bug.

The plain-language guide, with a real-world analogy and before/after for each case, is in docs/guide.md. The full detection reference lives in falsegreen-skill.

What it detects

The scanner ships 40 active false-positive codes across the five families, plus CC (commented-out assert). HIGH findings block a commit; LOW ones warn. Cases that require reading production intent (10, 11, 12, 15, 18) need the semantic layer.

#	Case	Why it fools you	Code	Conf
1	`assert` inside an `if`/`for` that may not run	check silently skips	`C1`	LOW
2	Test with no assertion at all	proves only it did not crash	`C2`/`C2b`	HIGH/LOW
3	`assert` inside `try` whose `except` swallows it	failure discarded	`C3`	HIGH
4	Test the runner never collects	vanishes from the count	`C4`/`C4b`	HIGH/LOW
5	Always-true check (`assert True`, non-empty tuple, `or True`)	passes by construction	`C5`	HIGH
6	Weak check (truthiness, `len > 0`, `"x" in str(...)`)	accepts almost anything	`C6`	LOW
7	Compares a value to itself	true by construction	`C7`	HIGH
8	Exact equality on a float	fails on rounding, not bugs	`C8`	LOW
9	`pytest.raises(Exception)` too broad	accepts the wrong error	`C9`	LOW
10	Mocks the unit under test	tests the mock, not the code	semantic	-
11	Asserts the value fed to the mock	an echo, not a result	semantic	-
12	Re-implements the production formula	both agree on the same wrong number	semantic	-
13	Mock assertion misspelled / not called	becomes a no-op	`C13`/`C13b`	HIGH/LOW
14	Golden/snapshot written from the output	records today's bug as correct	`C14`	LOW
15	Passes only if another test ran first	borrowed state	semantic	-
16	Depends on time, randomness, or a sleep	passes or fails by luck	`C16`	LOW
17	`skip` inside a broad `except`	turns red into yellow	`C17`	HIGH
18	Expected value contradicts what code should do	freezes a bug as correct	semantic	-

Six codes cover patterns added after the original eighteen:

Code	Pattern	Why it fools you	Family	Conf
`C18`	Compares `str()`/`repr()`/f-string to literal	checks formatting, not the value	B	LOW
`C19`	`pytest.raises` wraps more than one call	earlier line raises; target never reached	A	LOW
`C20`	`assert` in dead code after `return`/`raise`/`fail()`	never runs	A	HIGH
`C21`	Every `assert` is conditional, none unconditional	false condition passes the whole test	A	LOW
`C22`	`async` test asserts but never awaits the unit	checks an un-awaited coroutine	A	off
`C23`	Opens a real file at a literal hard-coded path	binds test to a layout; often a credential outside the repo (Mystery Guest)	D	LOW

Eleven additional codes covering the most common patterns in real test suites:

Code	Pattern	Conf
`C25`	`@pytest.mark.xfail` without `strict=True` — XPASS silently accepted	LOW
`C27`	`try/except/pass` used instead of `pytest.raises`	HIGH
`C28`	`pytest.raises` binding declared but exception content never inspected	LOW
`C29`	`os.environ` mutated directly in a test — state leaks between tests	LOW
`C30`	`responses.add()` / `httpretty.register_uri()` without activating the interceptor	LOW
`C31`	`capsys.readouterr()` result never asserted	LOW
`C32`	`@pytest.mark.skip` without `reason=`	LOW
`C33`	sklearn / ML metric computed but never asserted	LOW
`C34`	Suboptimal assert form — pytest provides a clearer, more idiomatic alternative	LOW
`C35`	Retry/flaky decorator masks flaky behaviour instead of fixing the root cause	LOW
`C36`	`pytest.fail()` with no reason argument — failure message is empty	LOW
`C37`	Duplicate case in `@pytest.mark.parametrize` — same argument set runs twice	LOW
`CC`	Commented-out assert	LOW

Six more from the consolidated catalog:

Code	Pattern	Conf
`C38`	Two test functions share a name — the later one silently overrides the first	HIGH
`C39`	Test `return`s a comparison instead of asserting it — pytest ignores the value	HIGH
`C42`	`assert` on a generator expression or lambda — the object is always truthy	HIGH
`C43`	`pytest.skip()` after test logic — the checks below it never run	LOW
`C44`	Numeric tautology (`len(x) >= 0`, `abs(x) >= 0`) — always true	HIGH
`C45`	Empty `@pytest.mark.parametrize` list — the test is generated with zero cases	HIGH

C41 (an assertion on an in-place method that returns None, like assert not lst.sort()) is deliberately left to the semantic pass: whether it is trivially green depends on the receiver's type, which the parser cannot see.

How the scanner detects. It parses each test file with Python's ast module and inspects the tree. It never imports or runs the test, so a malicious or broken test cannot execute through it. Detection is structural: an assert whose expression is a constant, both sides of a comparison AST-identical, a pytest.raises argument of Exception, a mock-named receiver with a no-parentheses assert_called_once, a Test* class with __init__, and so on. Precision is the priority for HIGH codes, because they block commits: each one is stress-tested against look-alikes (optional-dependency skips, abstract base test classes, @patch-injected mocks, exact-count len(x) == N) and stays quiet on them.

How the semantic pass detects. Cases 10, 11, 12, 15, and 18 cannot be proven by structure. A parser sees a mock but cannot tell whether it replaced an edge (network, disk, clock) or the thing under test. It sees an arithmetic expression but cannot tell whether the expected value was derived independently or copied from the code. That judgment requires reading the production code against an independent oracle — that is what falsegreen-skill does.

Why two confidence levels. A blocking gate that cries wolf gets disabled. So only near-certain, mechanically unambiguous patterns are HIGH (they block). The rest are LOW (they warn) and are starting points for human or semantic judgment, not verdicts.

How falsegreen is validated

A tool that flags tests for not protecting anything has to show it protects something itself.

The scanner (deterministic). Every rule ships with two tests: one proving it fires on the bad pattern, one proving it stays quiet on a legitimate look-alike. The scanner also runs on its own source on every commit (the self-scan), because the false-positive detector is not allowed to contain one. It is also validated against real-world Python projects — the most recent corpus run covered 40 projects with over 58,000 test functions. That pass surfaced false positives in two rule classes (C7 on deliberate __eq__ tests, C4 on test-named route handlers). Both were fixed, each with regression tests. The HIGH count across all 40 projects after fixes: 0. Each false positive is recorded in the commit history and the CHANGELOG.
The semantic pass (LLM). Validation for the LLM-based semantic layer is tracked in falsegreen-skill, where benchmark corpora for Python and TypeScript are maintained with precision/recall measurements.

Test levels (the pyramid)

falsegreen scans tests at every level of the pyramid. Discovery is level-agnostic - it reads any pytest/unittest file - but a few codes are read in light of the level, so a valid pattern at one level is not flagged at another.

Unit: a function with its boundaries doubled. The oracle is assert (or self.assert*).
Integration (API and database): API tests through requests/httpx or a framework TestClient (FastAPI, Flask, Django), database tests against a real datastore (SQLAlchemy, the Django ORM, testcontainers). These cross the I/O boundary on purpose, so the response or row IS the verification at that level. The weak-check code (C6) relaxes in the web layer, where the presence of a response is a real check.
E2E: Playwright for Python and Selenium. expect(locator).to_be_visible() is the oracle.

A real API or database hit inside a test that claims to be a unit test is itself the smell (mystery guest, resource optimism, state leak), not the level of the test. C23 (real file at a literal path), C29 (os.environ mutated), and C30 (mock interceptor never activated) flag those forms.

Diagnostic and coupling codes (opt-in)

Seven additional codes surface smells that do not create false positives but hurt observability and maintainability. All are off by default. Enable with severity = { CODE = "info" } in config. info findings appear in separate DIAGNOSTIC and COUPLING sections and do not affect the exit code.

Code	Smell	What it flags
`C22`	Async Liar	`async def test_*` that asserts but never `await`s the unit
`D1`	Assertion Roulette	2+ assertions in one test, all without a `msg` argument
`D3`	Duplicate Assert	the same assertion written twice in the same test body
`D4`	Unnamed Parametrize	`@pytest.mark.parametrize` with 3+ cases and no `ids=`
`D5`	Inline Setup Excess	too many setup statements before the first `assert` (threshold configurable)
`D6`	Debug Print	`print()` call in test body
`M2`	Long Test Method	test body exceeds `long_test_threshold` lines (default 50)

The two layers

Layer	What it is	When it runs	Catches
Scanner (this repo)	Zero-dependency AST analysis	CLI, CI, pre-commit	34 active false-positive codes + 7 opt-in diagnostic codes
Semantic pass (falsegreen-skill)	LLM-based analysis, Python and other languages	on demand	bug-freezing patterns no static tool can see (cases 10/11/12/15/18)

The scanner is the fast, deterministic pre-filter. For TypeScript, JavaScript, Java, and other languages, use falsegreen-skill.

Install and use

1. CLI (pip)

pip install falsegreen

No permanent install needed: uvx falsegreen tests/ or pipx run falsegreen tests/ runs the latest release from PyPI without touching your environment.

falsegreen                        # scan the current directory
falsegreen tests/                 # scan a folder or a single file
falsegreen --staged               # only the test files staged in git
falsegreen --format sarif         # text (default) | json | sarif | junit
falsegreen --summary              # one-line "N scanned, M flagged" to stderr
falsegreen --output report.sarif  # write to a file
falsegreen --output .falsegreen/  # write report.<ext> into a directory
falsegreen --config-audit         # audit pytest/coverage config (project-layer PL codes)
falsegreen --disable C6,C2b       # turn specific codes off

--config-audit is a separate mode: instead of scanning test files, it reads the project's pytest and coverage config (pyproject.toml, pytest.ini, tox.ini, setup.cfg) and reports the project-layer ways a suite stays green by configuration: PL2 (filterwarnings does not promote warnings to errors), PL7 (no coverage gate), PL8 (addopts stops the run early with -x/--maxfail). These complement the per-file scan, which cannot see config.

Each finding is reported with its pyramid level (unit / integration / e2e, read from the file's imports) and a one-line fix hint, and the text summary breaks the findings down by level and lists the most common fixes. --output takes a file or a directory: an extension-less or trailing-slash path (e.g. .falsegreen/) receives report.<ext> for the chosen format. Reports are run artifacts; keep the output directory gitignored.

--format sarif emits SARIF 2.1.0 (HIGH → error, LOW → warning) for GitHub code scanning and PR annotations. --format junit emits JUnit XML for CI dashboards.

python -m falsegreen ... is equivalent. Exit codes: 0 clean, 10 low-confidence findings only, 20 at least one high-confidence finding. No third-party runtime dependencies; Python 3.8+.

2. Pre-commit hook

Add to .pre-commit-config.yaml:

  - repo: https://github.com/vinicq/falsegreen
    rev: v0.3.0
    hooks:
      - id: falsegreen

Then pre-commit install. On each commit it scans the staged test files. HIGH findings block the commit. Bypass once with git commit --no-verify, or set FALSEGREEN_BLOCK=0 to make the hook warn-only.

Raw git hook (without the pre-commit framework):

python -m falsegreen.hook_install --repo .      # install
python -m falsegreen.hook_install --uninstall   # remove

3. Semantic pass (multi-language)

For cases that require reading production intent (mocking the unit under test, copying expected from current output, re-implementing the formula), use falsegreen-skill. It covers Python, TypeScript, JavaScript, Java, and other languages.

Configuration

Inline suppression: # falsegreen: ignore silences all codes on a line; # falsegreen: ignore[C8] silences one.
Disable codes globally: --disable C6,C2b.
Environment: FALSEGREEN_BLOCK=0 makes the pre-commit hook warn instead of block.

Project config file

[tool.falsegreen] in pyproject.toml, or a flat .falsegreen.toml at the repo root (.falsegreen.toml wins if both exist):

[tool.falsegreen]
disable = ["C13b"]            # turn these codes off everywhere
exclude = ["tests/legacy/*"]  # skip files matching these globs
long_test_threshold = 30      # line-count limit for M2 (default: 50)
inline_setup_threshold = 3    # stmt limit for D5 (default: 5)

[tool.falsegreen.severity]
C8 = "high"    # promote: now blocks the commit (exit 20)
C6 = "off"     # same as adding C6 to disable
C22 = "low"    # enable: async-never-awaits check
D1 = "info"    # enable Assertion Roulette
D3 = "info"    # enable Duplicate Assert
D4 = "info"    # enable Unnamed Parametrize
D5 = "info"    # enable Inline Setup Excess
D6 = "info"    # enable Debug Print
M2 = "info"    # enable Long Test Method

severity values: high, low, info, or off. info findings appear in DIAGNOSTIC/COUPLING sections and do not affect the exit code. long_test_threshold and inline_setup_threshold are top-level keys in [tool.falsegreen], not inside [severity]. Precedence, highest first: --disable CLI, inline # falsegreen: ignore, config file, built-in default. Point at a specific file with --config PATH.

Baseline (adopt on a legacy repo)

Record the findings you already have, then fail only on new ones:

falsegreen --write-baseline tests/   # writes .falsegreen-baseline.json, exits 0
falsegreen --baseline tests/         # suppresses recorded findings, fails on new

A finding is fingerprinted by relative path, code, detail, and normalized source line — not line number, so prepending code does not re-trigger a baselined finding. Commit .falsegreen-baseline.json and the ratchet only tightens.

How it compares

ruff / flake8-pytest-style — mature, fast lint rules. Overlaps on broad raises (PT011) and assert-in-except (PT017). Run both: falsegreen adds uncollected tests, always-true asserts, self-comparison, mock typos, duplicate parametrize cases, and more.
PyNose / pytest-smell — test-smell catalogs from research. Broader taxonomy, but no commit gate and no oracle-correctness check.
mutmut / cosmic-ray — mutation testing, the most honest measure of whether a green suite fails when the code is wrong. Complementary and heavier. falsegreen is the cheap pre-filter you run on every commit; mutation testing is the deep audit you run on suites that matter.
falsegreen-skill — the LLM companion for the semantic pass and for TypeScript, JavaScript, Java, and other languages.

The defensible gap: a deterministic commit gate that catches the mechanical false-positive patterns with zero runtime dependencies, paired with an LLM semantic layer that catches the oracle-correctness cases no static tool can see.

Project layout

falsegreen/
  src/falsegreen/scanner.py        the deterministic scanner
  src/falsegreen/hook_install.py   raw git-hook installer
  docs/guide.md                    plain-language guide to every case
  tests/test_scanner.py            the scanner's own tests
  .pre-commit-hooks.yaml           pre-commit integration
  pyproject.toml                   packaging

Contributing, security, license

CONTRIBUTING.md — dev setup, how to add a detection rule, the false-positive policy, Conventional Commits.
SECURITY.md — how to report a vulnerability privately.
CODE_OF_CONDUCT.md — Contributor Covenant 2.1.
CREDITS.md — research falsegreen builds on (Soares rotten-green work, PyNose, test-smell catalog, agentic-LLM studies), with author credit.
License: MIT, see LICENSE.

Contributors ✨

Thanks to the people who keep false-green tests out of real suites (emoji key):

_{Vinicius Queiroz}
💻 📖 🤔 🚧 🚇

⚠️

🔬

_{Home Seller}
💻

New contributors are added automatically; the table also recognizes non-code work (docs, ideas, infrastructure, tests, research) via the all-contributors spec.

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
.github		.github
docs		docs
src/falsegreen		src/falsegreen
tests		tests
.all-contributorsrc		.all-contributorsrc
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-hooks.yaml		.pre-commit-hooks.yaml
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
CREDITS.md		CREDITS.md
LICENSE		LICENSE
README.md		README.md
RELEASE.md		RELEASE.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

falsegreen

Table of contents

Why this exists

The methodology

What it detects

How falsegreen is validated

Test levels (the pyramid)

Diagnostic and coupling codes (opt-in)

The two layers

Install and use

1. CLI (pip)

2. Pre-commit hook

3. Semantic pass (multi-language)

Configuration

Project config file

Baseline (adopt on a legacy repo)

How it compares

Project layout

Contributing, security, license

Contributors ✨

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

falsegreen

Table of contents

Why this exists

The methodology

What it detects

How falsegreen is validated

Test levels (the pyramid)

Diagnostic and coupling codes (opt-in)

The two layers

Install and use

1. CLI (pip)

2. Pre-commit hook

3. Semantic pass (multi-language)

Configuration

Project config file

Baseline (adopt on a legacy repo)

How it compares

Project layout

Contributing, security, license

Contributors ✨

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages