Skip to content

vinicq/falsegreen

Repository files navigation

falsegreen

CI PyPI Python Downloads License: MIT PRs Welcome Docs

One problem, one tool: the false positive. falsegreen finds Python/pytest tests that pass green without protecting anything - tests that let broken code through because the assertion is empty, always true, never runs, or checks the wrong thing.

A test that tells you a broken program is safe is worse than no test at all. AI coding assistants produce these at scale. The tool catches them before they merge.

The scanner is a zero-dependency AST pass. It validates each test against 47 active false-positive codes, patterns a parser can prove: an assertion that never runs, a check that is empty or always true, a swallowed exception, a mock assertion with a typo, a check stranded in dead code. HIGH findings block the commit; LOW ones warn. A third group (diagnostic and coupling, six codes) can be enabled per project for informational checks that do not affect the exit code. The semantic layer, intent-based patterns no static tool can see, lives in falsegreen-skill, the LLM companion covering Python and other languages.

The checks are grounded in the rotten-green-test research (Soares 2023; Delplanque et al., ICSE 2019) and cross-walked against the published test-smell catalog. See CREDITS.md.

The falsegreen family (install the one for your stack):

Tool Stack Install Package
falsegreen Python / pytest pip install falsegreen PyPI
falsegreen-js JS / TS npm i -D falsegreen-js (npx falsegreen-js) npm
robotframework-falsegreen Robot Framework pip install robotframework-falsegreen PyPI
falsegreen-skill semantic LLM pass npx falsegreen-skill analyze <path> npm

Install: pip install falsegreen. Pre-commit hook also available (see below). LLM semantic pass: falsegreen-skill.


Quick guide for first-time users

If you have never run this tool, start here. The five sections below take you from zero to a CI gate. The deeper reference (every code, the methodology, the research) follows after.

What it does

falsegreen reads your pytest tests and finds the ones that pass green without checking anything. A test can call your code, run, and report success while asserting nothing real, so a bug ships and the green bar lies about it. The tool reads the test files only (it never runs them) and points at the spots a parser can prove are empty, always true, unreachable, or self-confirming.

A test it flags, and the fix:

# BAD: runs the code, then asserts a constant. It can never fail.
def test_add():
    result = add(2, 3)
    assert True

# CLEAN: asserts the actual result. Breaks if add() breaks.
def test_add():
    assert add(2, 3) == 5

Install

pip install falsegreen

Needs Python 3.8 or newer. No third-party runtime dependencies. To run it once without installing anything: uvx falsegreen tests/ or pipx run falsegreen tests/.

Quick start

Point it at your test folder:

falsegreen tests/

Run on the test_add example above and you get:

HIGH confidence (almost certainly a false positive)
---------------------------------------------------
  test_demo.py:9  [C5] always-true check (assert True / tuple / or True)
      level: unit   fix: assert the real behaviour, not a constant or tautology

Summary: 1 high, 0 low.
By level: unit:1
Top fixes:
  C5 (1): assert the real behaviour, not a constant or tautology

How to read that finding:

  • test_demo.py:9 - the file and line.
  • [C5] - the code. C5 is "always-true check". The catalog (below) explains every code.
  • level: unit - which level of the test pyramid this file sits at.
  • fix: - the one-line hint. Here: assert the real behaviour, not a constant.

python -m falsegreen tests/ does the same thing if the falsegreen command is not on your PATH.

Common options

falsegreen tests/ --json          # machine-readable JSON instead of text
falsegreen tests/ --format sarif  # text (default) | json | sarif | junit
falsegreen tests/ --disable C6,C8 # turn specific codes off
falsegreen tests/ --summary       # one line "N scanned, M flagged" to stderr

Exit codes wire it into CI: 0 clean, 10 low-confidence findings only, 20 at least one high-confidence finding. Block the build on 20.

GitHub Actions:

name: falsegreen
on: [push, pull_request]
jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.x" }
      - run: pip install falsegreen
      - run: falsegreen tests/   # exit 20 fails the job

What the codes mean

Each finding carries a code (C5, C2b, C7...) and a confidence. HIGH codes are near-certain and block the commit; LOW codes warn and want a human look. The code names the pattern: C2b is "ran but checked nothing", C5 is "always true", C7 is "compares a value to itself". The full list, with a BAD and CLEAN example for each, is in docs/guide.md and the online docs. The What it detects table below is the quick reference.


Usage and configuration reference

The quick guide above gets you running. This section is the complete reference: every install channel, every flag, every output format, every config knob, and the CI recipes. All command output shown here is captured from a real run, not invented.

Install

Channel Command When to use
pip (global or venv) pip install falsegreen the normal install; adds the falsegreen command
pipx (isolated) pipx install falsegreen keep it off your project's dependency tree
pipx run (no install) pipx run falsegreen tests/ one-off, latest release from PyPI
uvx (no install) uvx falsegreen tests/ one-off via uv, no environment touched
from source pip install -e . in a clone hacking on the scanner

Version floor: Python 3.8 or newer. No third-party runtime dependencies, the scanner is a pure ast pass. Pin a version in CI with pip install falsegreen==0.9.0.

Invocation

falsegreen                          # scan the current directory
falsegreen tests/                   # scan a folder
falsegreen tests/test_login.py      # scan a single file
falsegreen tests/ src/              # scan several paths at once
falsegreen --staged                 # only the test files staged in git (pre-commit)
python -m falsegreen tests/         # module form, identical behaviour, no PATH dependency

There is no stdin mode: pass file or directory paths (or nothing, which scans the cwd). Discovery walks the given paths for test_*.py / *_test.py and unittest-style files; non-test files are ignored.

Output formats

--format text|json|sarif|junit selects the shape (default text). --json is a shorthand for --format json. --output PATH writes to a file instead of stdout; a directory or trailing-slash path (.falsegreen/) receives report.<ext>.

Fixture used for every sample below (test_demo.py):

def add(a, b):
    return a + b

def test_add():
    result = add(2, 3)
    assert True            # C5: always-true, line 7

def test_weak():
    result = add(2, 3)
    assert result          # C6: weak check, line 12

text (default):

HIGH confidence (almost certainly a false positive)
---------------------------------------------------
  test_demo.py:7  [C5] always-true check (assert True / tuple / or True)
      level: unit   fix: assert the real behaviour, not a constant or tautology

LOW confidence (test smell, confirm by hand or with /falsegreen)
----------------------------------------------------------------
  test_demo.py:12  [C6] weak check (only verifies that something came back)  (truthiness of a value, not compared to an expected result)
      level: unit   fix: assert the value, not just that something came back

Summary: 1 high, 1 low.
By level: unit:2
Top fixes:
  C5 (1): assert the real behaviour, not a constant or tautology
  C6 (1): assert the value, not just that something came back

json (--json or --format json): a flat array, one object per finding.

[
  {
    "file": "test_demo.py",
    "line": 7,
    "code": "C5",
    "confidence": "high",
    "title": "always-true check (assert True / tuple / or True)",
    "detail": "",
    "layer": "logic",
    "level": "unit",
    "fix": "assert the real behaviour, not a constant or tautology"
  },
  {
    "file": "test_demo.py",
    "line": 12,
    "code": "C6",
    "confidence": "low",
    "title": "weak check (only verifies that something came back)",
    "detail": "truthiness of a value, not compared to an expected result",
    "layer": "logic",
    "level": "unit",
    "fix": "assert the value, not just that something came back"
  }
]

sarif (--format sarif): SARIF 2.1.0 for GitHub code scanning. HIGH maps to error, LOW to warning, info to note; each rule carries a helpUri into the online catalog and tags the judgment family. Abridged:

{
  "$schema": "https://json.schemastore.org/sarif-2.1.0.json",
  "version": "2.1.0",
  "runs": [
    {
      "tool": { "driver": {
        "name": "falsegreen",
        "version": "0.9.0",
        "rules": [
          { "id": "C5", "defaultConfiguration": { "level": "error" },
            "helpUri": "https://vinicq.github.io/falsegreen-docs/catalog/python/#c5",
            "properties": { "tags": ["J2"] } }
        ]
      } },
      "results": [
        { "ruleId": "C5", "level": "error",
          "message": { "text": "always-true check (assert True / tuple / or True)" },
          "properties": { "tags": ["J2", "layer:logic", "level:unit"] },
          "locations": [ { "physicalLocation": {
            "artifactLocation": { "uri": "test_demo.py" },
            "region": { "startLine": 7 } } } ] }
      ]
    }
  ]
}

junit (--format junit): JUnit XML. HIGH becomes a <failure>, lower findings become <skipped>, so a CI test reporter shows them as a failing suite.

<?xml version="1.0" encoding="utf-8"?>
<testsuites name="falsegreen" tests="2" failures="1" skipped="1" errors="0"><testsuite name="falsegreen" tests="2" failures="1" skipped="1" errors="0"><testcase classname="falsegreen.C5" name="C5 test_demo.py:7"><failure message="always-true check (assert True / tuple / or True)">test_demo.py:7</failure></testcase><testcase classname="falsegreen.C6" name="C6 test_demo.py:12"><skipped message="weak check (only verifies that something came back) ..." /></testcase></testsuite></testsuites>

--summary prints a one-line tally to stderr (independent of --format, so you can keep machine output on stdout):

falsegreen: scanned 1 test file(s), 2 finding(s) [1 high, 1 low]  C5:1 C6:1
  by judgment: J2:1 J4:1

Configuration

Exit codes (the contract CI relies on):

Code Meaning
0 clean, or only info/baselined findings
10 low-confidence findings only
20 at least one high-confidence finding

Block the build on 20. 10 is a warn band you can choose to fail or not.

Disable codes: --disable C6,C2b turns codes off for this run. Persist it in config with disable = [...] or severity = { C6 = "off" }.

Inline suppression: a comment on the offending line.

assert x == x  # falsegreen: ignore        # silence every code on this line
assert y == y  # falsegreen: ignore[C7]    # silence only C7

Only the falsegreen: token suppresses; a plain # ignore does not.

Severity and confidence filtering: there is no --severity flag. You tune severity in config, per code. severity values are high, low, info, or off. Promoting a code to high makes it block (exit 20); demoting to info moves it into the DIAGNOSTIC/COUPLING section where it never affects the exit code; off is the same as disabling it.

Config file: [tool.falsegreen] in pyproject.toml, or a flat .falsegreen.toml at the repo root (.falsegreen.toml wins if both exist). Point at an explicit file with --config PATH.

[tool.falsegreen]
disable = ["C13b"]            # turn these codes off everywhere
exclude = ["tests/legacy/*"]  # skip files matching these globs
long_test_threshold = 30      # line limit for M2 (default 50)
inline_setup_threshold = 3    # stmt limit for D5 (default 5)

[tool.falsegreen.severity]
C8 = "high"    # promote: now blocks
C6 = "off"     # disable
C22 = "low"    # enable the async-never-awaits check
D1 = "info"    # enable Assertion Roulette (diagnostic, no exit impact)
M2 = "info"    # enable Long Test Method

Precedence, highest first: --disable CLI, inline # falsegreen: ignore, config file, built-in default. The diagnostic and coupling group (D1, D3-D6, M2, C22) is off by default and is enabled only through config severity = "info". There is no --diagnostics flag on the Python scanner (that flag belongs to the JS and Robot siblings); enable diagnostics here through config.

--config-audit is a separate mode: instead of scanning test files it reads the pytest and coverage config (pyproject.toml, pytest.ini, tox.ini, setup.cfg) and reports the project-layer ways a suite stays green by configuration. Run on a pytest.ini carrying addopts = -x:

LOW confidence (test smell, confirm by hand or with /falsegreen)
----------------------------------------------------------------
  pytest.ini:1  [PL2] filterwarnings does not promote warnings to errors ...
  pytest.ini:1  [PL7] no coverage gate (--cov-fail-under / fail_under) ...
  pytest.ini:1  [PL8] addopts stops the run early (-x / --maxfail / --exitfirst) ...

Summary: 0 high, 3 low.
By level: project:3

The PL codes: PL1 (python -O / PYTHONOPTIMIZE strips every assert), PL2 (warnings not promoted to errors), PL7 (no coverage gate), PL8 (addopts stops the run early). The per-file scan cannot see config, so this mode complements it.

--baseline / --write-baseline adopt the scanner on a legacy repo without a wall of red:

falsegreen --write-baseline tests/   # writes .falsegreen-baseline.json, exit 0
falsegreen --baseline tests/         # suppresses recorded findings, fails on new ones

Captured:

falsegreen: wrote 2 fingerprint(s) to .falsegreen-baseline.json

A finding is fingerprinted by relative path, code, detail, and normalized source line, not line number, so adding code above it does not re-trigger a baselined finding. Commit .falsegreen-baseline.json; the ratchet only tightens. Both flags take an optional explicit path.

CI integration

GitHub Actions (text gate plus SARIF upload to code scanning):

name: falsegreen
on: [push, pull_request]
jobs:
  scan:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      security-events: write      # required for the SARIF upload
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.x" }
      - run: pip install falsegreen
      - name: Scan and emit SARIF
        run: falsegreen tests/ --format sarif --output falsegreen.sarif
        continue-on-error: true   # let the upload run even when exit 20
      - uses: github/codeql-action/upload-sarif@v3
        with: { sarif_file: falsegreen.sarif }
      - name: Fail on high-confidence findings
        run: falsegreen tests/    # exit 20 fails the job

Pre-commit hook:

  - repo: https://github.com/vinicq/falsegreen
    rev: v0.9.1          # pin a tag; run `pre-commit autoupdate` to move it
    hooks:
      - id: falsegreen

Then pre-commit install. The hook entry is falsegreen --staged with pass_filenames: false, so it reads the staged test files itself; do not add file arguments. HIGH findings block the commit. Bypass once with git commit --no-verify, or set FALSEGREEN_BLOCK=0 to make the hook warn-only. To run at push time instead, add stages: [pre-push] under the hook.

Raw git hook without the pre-commit framework:

python -m falsegreen.hook_install --repo .      # install
python -m falsegreen.hook_install --uninstall   # remove

Scope: what it does NOT do

It never imports or runs your tests; detection is structural. It does not judge whether an expected value contradicts intended behaviour, whether a mock replaced the unit under test, or whether the test re-implements the production formula. Those are semantic (cases 10/11/12/15/18) and need the falsegreen-skill LLM pass. It is not a style linter or a coverage tool. The full code catalog, with a BAD and CLEAN example per code, is in the online docs and docs/guide.md.


Table of contents


Why this exists

Coverage tells you which lines ran. It does not tell you whether anything was checked. A suite can report 90% coverage while most of those tests assert nothing real - the green bar is a comfort, not a guarantee.

The danger is not a test that fails. It is a test that passes when it should not.

falsegreen targets that one problem. It is not a style linter and not a coverage tool. It asks one question per test: is there a way for the code to be wrong and this test to stay green? If yes, the test is not protecting what it claims.

This matters more now that a large share of tests come from AI assistants. The common machine-written failure modes are exactly the patterns here: assertion roulette, mocking the unit under test, asserting the value you fed the mock, copying the expected value from the current output. They look thorough. They pass. falsegreen is the second reader that asks whether they actually verify anything.


The methodology

One rule drives everything: a test is only useful if it fails when the code breaks. If you have never seen a test go red, you do not know that it tests anything.

The patterns are organized into five families:

  • A. The test never checks anything. The assertion is skipped, missing, swallowed, or the test is never collected by the runner.
  • B. The check is weak or always true. It accepts almost any output, or it is true by construction.
  • C. The test checks itself, not the program. It mocks the thing under test, asserts the value it configured, or re-implements the production logic as its own answer key.
  • D. Green depends on outside factors. Time, randomness, test order, a fixed sleep.
  • E. The test passes but checks the wrong thing. The assertion runs against a real result, yet the expected value contradicts what the code should do, so the test freezes a bug.

Family E is why the project has a semantic layer. It carries the core principle: the expected value must come from an independent oracle, not from the code's current output. The oracle hierarchy, highest first: explicit spec or requirement, documented contract (docstring, types, API), independent human judgment, current code. Code is the lowest-priority oracle. Treating it as the highest is how you rubber-stamp a bug.

The plain-language guide, with a real-world analogy and before/after for each case, is in docs/guide.md. The full detection reference lives in falsegreen-skill.


What it detects

The scanner ships 47 active false-positive codes across the five families, plus CC (commented-out assert). HIGH findings block a commit; LOW ones warn. Cases that require reading production intent (10, 11, 12, 15, 18) need the semantic layer.

# Case Why it fools you Code Conf
1 assert inside an if/for that may not run check silently skips C1 LOW
2 Test with no assertion at all proves only it did not crash C2/C2b HIGH/LOW
3 assert inside try whose except swallows it failure discarded C3 HIGH
4 Test the runner never collects vanishes from the count C4/C4b HIGH/LOW
5 Always-true check (assert True, non-empty tuple, or True) passes by construction C5 HIGH
6 Weak check (truthiness, len > 0, "x" in str(...)) accepts almost anything C6 LOW
7 Compares a value to itself true by construction C7 HIGH
8 Exact equality on a float fails on rounding, not bugs C8 LOW
9 pytest.raises(Exception) too broad accepts the wrong error C9 LOW
10 Mocks the unit under test tests the mock, not the code semantic -
11 Asserts the value fed to the mock an echo, not a result semantic -
12 Re-implements the production formula both agree on the same wrong number semantic -
13 Mock assertion misspelled / not called becomes a no-op C13/C13b HIGH/LOW
14 Golden/snapshot written from the output records today's bug as correct C14 LOW
15 Passes only if another test ran first borrowed state semantic -
16 Depends on time, randomness, or a sleep passes or fails by luck C16 LOW
17 skip inside a broad except turns red into yellow C17 HIGH
18 Expected value contradicts what code should do freezes a bug as correct semantic -

Six codes cover patterns added after the original eighteen:

Code Pattern Why it fools you Family Conf
C18 Compares str()/repr()/f-string to literal checks formatting, not the value B LOW
C19 pytest.raises wraps more than one call earlier line raises; target never reached A LOW
C20 assert in dead code after return/raise/fail() never runs A HIGH
C21 Every assert is conditional, none unconditional false condition passes the whole test A LOW
C22 async test asserts but never awaits the unit checks an un-awaited coroutine A off
C23 Opens a real file at a literal hard-coded path binds test to a layout; often a credential outside the repo (Mystery Guest) D LOW

Eleven additional codes covering the most common patterns in real test suites:

Code Pattern Conf
C25 @pytest.mark.xfail without strict=True — XPASS silently accepted LOW
C27 try/except/pass used instead of pytest.raises HIGH
C28 pytest.raises binding declared but exception content never inspected LOW
C29 os.environ mutated directly in a test — state leaks between tests LOW
C30 responses.add() / httpretty.register_uri() without activating the interceptor LOW
C31 capsys.readouterr() result never asserted LOW
C32 @pytest.mark.skip without reason= LOW
C33 sklearn / ML metric computed but never asserted LOW
C34 Suboptimal assert form — pytest provides a clearer, more idiomatic alternative LOW
C35 Retry/flaky decorator masks flaky behaviour instead of fixing the root cause LOW
C36 pytest.fail() with no reason argument — failure message is empty LOW
C37 Duplicate case in @pytest.mark.parametrize — same argument set runs twice LOW
CC Commented-out assert LOW

More from the consolidated catalog:

Code Pattern Conf
C2c self.subTest(...) block wraps work but asserts nothing — the subTest analogue of an empty test LOW
C6b Assertion coupled to positional argument layout — breaks on a benign reorder, not a bug LOW
C6c Asserts a mock's call_count truthiness — only that it was called, not how many times LOW
C8b Approximate-equality with no explicit tolerance (assertAlmostEqual, == pytest.approx()) — the default tolerance hides a wrong value LOW
C11a Self-confirming literal — the expected value is assigned by the test itself LOW
C24 Module-global mutable state shared across tests — borrowed state, not isolation LOW
C38 Two test functions share a name — the later one silently overrides the first HIGH
C39 Test returns a comparison instead of asserting it — pytest ignores the value HIGH
C41 Assertion on an in-place method that returns None (assert not lst.sort()) — trivially satisfied LOW
C42 assert on a generator expression or lambda — the object is always truthy HIGH
C43 pytest.skip() after test logic — the checks below it never run LOW
C44 Numeric tautology (len(x) >= 0, abs(x) >= 0, a mock's call_count >= 0) — always true HIGH
C45 Empty @pytest.mark.parametrize list — the test is generated with zero cases HIGH
C48 Dark patch — the test forces a test-mode flag (os.environ["TESTING"], settings.TESTING) then asserts, exercising the product's test-only branch LOW
C49 pytest.warns / assertWarns / deprecated_call wraps more than one call — an unrelated earlier line may warn while the target never does (warns sibling of C19) LOW
C50 caplog / assertLogs output captured but never asserted — the capture has no effect on pass/fail (logging sibling of C31) LOW
C51 Empty-bodied pytest.raises / warns context — no call inside the block, so the call that should raise is never made HIGH
C52 Membership self-confirmation — assert x in {x}, the collection is built from the subject so membership is true by construction (membership variant of C7) LOW
C55 Assertion compares two mock-rooted values — assert m.foo == m.bar, both sides are the test's own doubles not the SUT LOW
C56 Sync assert of a never-awaited coroutine — the operand calls a local async def with no await, so the check runs on a coroutine object not its value (sync sibling of C22) LOW
C57 Comparison against an unconfigured Mock attribute — assert x == m.attr on a bare Mock() with no spec=, which auto-creates a fresh truthy Mock (the AST-decidable corner of the deferred C40) LOW
C59 Bare top-level comparison — result == expected as a statement, the value is computed and discarded so nothing is asserted (loose-statement sibling of C39) HIGH

Codes the scanner does not detect

The static layer is close to saturated. A handful of catalog codes are deliberately left out because a per-file AST pass cannot judge them without a high false-positive rate, or because they are not a per-file property at all. They are listed here so the gap is honest, not hidden. The reasoning follows the consolidated catalog.

High false-positive without deeper analysis (left to the semantic pass).

  • C40 (assert on a Mock attribute with no spec, always truthy): without spec or autospec analysis the false-positive rate is high, since the same shape is a valid check on a real object. The concept lives in the skill (Family F7).
  • C46 (real network or database call with no double): legitimate at the integration level, where crossing the boundary is the point. Flagging it per file, without knowing the test's layer, is a high false-positive. It belongs to the skill and the project layer.
  • C47 (assertion depends on dict or set ordering): most collections are used deterministically, so flagging unordered-vs-sequence comparisons fires far too often. It stays a note in the skill.

Runtime and culture (not a per-file property). The PL series is about how the suite is invoked and configured, not what a single test file contains. PL1, PL2, PL7, and PL8 are covered by --config-audit (python -O/PYTHONOPTIMIZE strips every assert, warnings not promoted to errors, no coverage gate, addopts that stops the run early), read from the pytest config it parses. The rest need execution or pipeline inspection: PL4 (a collection error counted as "0 tests" while CI stays green), and PL3, PL5, PL6 (a coverage pragma in production code, importorskip hiding a broken import, CI running a subset via -k / -m). They are documented, not promised, and sit outside the "test file" target.

Semantic Family E or F7 (mutation testing and the skill). Mocking the unit under test, asserting the value you fed the mock, re-implementing the production formula, borrowing state from another test, an expected value that contradicts the spec: none of these can be proven by structure. C14 (a snapshot generated from the code's own output) is the only codable corner of this family. The honest path for the rest is mutation testing (mutmut, cosmic-ray), which mutates the production code and checks whether any test goes red, plus the LLM semantic pass in falsegreen-skill.

How the scanner detects. It parses each test file with Python's ast module and inspects the tree. It never imports or runs the test, so a malicious or broken test cannot execute through it. Detection is structural: an assert whose expression is a constant, both sides of a comparison AST-identical, a pytest.raises argument of Exception, a mock-named receiver with a no-parentheses assert_called_once, a Test* class with __init__, and so on. Precision is the priority for HIGH codes, because they block commits: each one is stress-tested against look-alikes (optional-dependency skips, abstract base test classes, @patch-injected mocks, exact-count len(x) == N) and stays quiet on them.

How the semantic pass detects. Cases 10, 11, 12, 15, and 18 cannot be proven by structure. A parser sees a mock but cannot tell whether it replaced an edge (network, disk, clock) or the thing under test. It sees an arithmetic expression but cannot tell whether the expected value was derived independently or copied from the code. That judgment requires reading the production code against an independent oracle - that is what falsegreen-skill does.

Why two confidence levels. A blocking gate that cries wolf gets disabled. So only near-certain, mechanically unambiguous patterns are HIGH (they block). The rest are LOW (they warn) and are starting points for human or semantic judgment, not verdicts.

How falsegreen is validated

A tool that flags tests for not protecting anything has to show it protects something itself.

  • The scanner (deterministic). Every rule ships with two tests: one proving it fires on the bad pattern, one proving it stays quiet on a legitimate look-alike. The scanner also runs on its own source on every commit (the self-scan), because the false-positive detector is not allowed to contain one. It is also validated against real-world Python projects - the most recent corpus run covered 40 projects with over 58,000 test functions. That pass surfaced false positives in two rule classes (C7 on deliberate __eq__ tests, C4 on test-named route handlers). Both were fixed, each with regression tests. The HIGH count across all 40 projects after fixes: 0. Each false positive is recorded in the commit history and the CHANGELOG.
  • The semantic pass (LLM). Validation for the LLM-based semantic layer is tracked in falsegreen-skill, where benchmark corpora for Python and TypeScript are maintained with precision/recall measurements.

Test levels (the pyramid)

falsegreen scans tests at every level of the pyramid. Discovery is level-agnostic - it reads any pytest/unittest file - but a few codes are read in light of the level, so a valid pattern at one level is not flagged at another.

  • Unit: a function with its boundaries doubled. The oracle is assert (or self.assert*).
  • Integration (API and database): API tests through requests/httpx or a framework TestClient (FastAPI, Flask, Django), database tests against a real datastore (SQLAlchemy, the Django ORM, testcontainers). These cross the I/O boundary on purpose, so the response or row IS the verification at that level. The weak-check code (C6) relaxes in the web layer, where the presence of a response is a real check.
  • E2E: Playwright for Python and Selenium. expect(locator).to_be_visible() is the oracle.

A real API or database hit inside a test that claims to be a unit test is itself the smell (mystery guest, resource optimism, state leak), not the level of the test. C23 (real file at a literal path), C29 (os.environ mutated), and C30 (mock interceptor never activated) flag those forms.

Patterns by test level and scope

The same false-green shape is classified by the level the test runs at: the level is a per-finding axis (J3), read as unit, integration, or E2E. The codes that cluster at each level in Python:

  • Unit: C5/C7 (always-true, self-compare), C2b (calls the code, asserts nothing), C56 (sync assert of a never-awaited coroutine), and the semantic S1/S5 (intent mismatch, tests the framework).
  • Integration: C9b (request oracle off, expected_status=any), C50 (caplog/assertLogs captured but never asserted).
  • E2E: C16 (sleep as synchronization, uncontrolled time).

Full matrix on the docs site: patterns by test level and what we do not flag.

Diagnostic and coupling codes (opt-in)

Seven additional codes surface smells that do not create false positives but hurt observability and maintainability. All are off by default. Enable with severity = { CODE = "info" } in config. info findings appear in separate DIAGNOSTIC and COUPLING sections and do not affect the exit code.

Code Smell What it flags
C22 Async Liar async def test_* that asserts but never awaits the unit
D1 Assertion Roulette 2+ assertions in one test, all without a msg argument
D3 Duplicate Assert the same assertion written twice in the same test body
D4 Unnamed Parametrize @pytest.mark.parametrize with 3+ cases and no ids=
D5 Inline Setup Excess too many setup statements before the first assert (threshold configurable)
D6 Debug Print print() call in test body
M2 Long Test Method test body exceeds long_test_threshold lines (default 50)

The two layers

Layer What it is When it runs Catches
Scanner (this repo) Zero-dependency AST analysis CLI, CI, pre-commit 47 active false-positive codes + 6 opt-in diagnostic and coupling codes
Semantic pass (falsegreen-skill) LLM-based analysis, Python and other languages on demand bug-freezing patterns no static tool can see (cases 10/11/12/15/18)

The scanner is the fast, deterministic pre-filter. For TypeScript, JavaScript, Java, and other languages, use falsegreen-skill.


What we don't flag (and why)

Measured against the Open Catalog of Test Smells (517 documented smells), only the false-green slice is in scope. These stay out, on purpose:

  • Brittleness / false-red (a test that breaks without a real bug): sensitive equality, brittle or fragile assertions. The opposite axis; flagging it would punish correct code.
  • Hygiene / maintainability: assertion roulette, magic numbers, long or verbose tests. Linter territory (ruff), and a few are surfaced here as opt-in diagnostics (D1, M2).
  • Slow, design, naming, duplication, runtime/culture: none are about whether the test protects.

The boundary is deliberate. Where a smell has a statically provable false-green form, that form is a code here: uncontrolled time or randomness is C16, a hard-coded path is C23, shared module state is C24, an assertion that may never run is C21. See CREDITS.md for the full cross-walk against the literature.


Install and use

1. CLI (pip)

pip install falsegreen

No permanent install needed: uvx falsegreen tests/ or pipx run falsegreen tests/ runs the latest release from PyPI without touching your environment.

falsegreen                        # scan the current directory
falsegreen tests/                 # scan a folder or a single file
falsegreen --staged               # only the test files staged in git
falsegreen --format sarif         # text (default) | json | sarif | junit
falsegreen --summary              # one-line "N scanned, M flagged" to stderr
falsegreen --output report.sarif  # write to a file
falsegreen --output .falsegreen/  # write report.<ext> into a directory
falsegreen --config-audit         # audit pytest/coverage config (project-layer PL codes)
falsegreen --disable C6,C2b       # turn specific codes off

--config-audit is a separate mode: instead of scanning test files, it reads the project's pytest and coverage config (pyproject.toml, pytest.ini, tox.ini, setup.cfg) and reports the project-layer ways a suite stays green by configuration: PL1 (python -O / PYTHONOPTIMIZE strips every assert at runtime), PL2 (filterwarnings does not promote warnings to errors), PL7 (no coverage gate), PL8 (addopts stops the run early with -x/--maxfail). These complement the per-file scan, which cannot see config.

Each finding is reported with its pyramid level (unit / integration / e2e, read from the file's imports) and a one-line fix hint, and the text summary breaks the findings down by level and lists the most common fixes. --output takes a file or a directory: an extension-less or trailing-slash path (e.g. .falsegreen/) receives report.<ext> for the chosen format. Reports are run artifacts; keep the output directory gitignored.

--format sarif emits SARIF 2.1.0 (HIGH → error, LOW → warning) for GitHub code scanning and PR annotations. --format junit emits JUnit XML for CI dashboards.

python -m falsegreen ... is equivalent. Exit codes: 0 clean, 10 low-confidence findings only, 20 at least one high-confidence finding. No third-party runtime dependencies; Python 3.8+.

2. Pre-commit hook

Add to .pre-commit-config.yaml:

  - repo: https://github.com/vinicq/falsegreen
    rev: v0.9.1
    hooks:
      - id: falsegreen

Then pre-commit install. On each commit it scans the staged test files. HIGH findings block the commit. Bypass once with git commit --no-verify, or set FALSEGREEN_BLOCK=0 to make the hook warn-only.

Keeping the hook current

  • Bump the pinned rev. rev is pinned to a tag (v0.6.0 above), so the hook never changes under you. Run pre-commit autoupdate to rewrite it to the latest release, then commit the updated .pre-commit-config.yaml. Pin a tag, never a branch, so local runs and CI use the same scanner.
  • --staged and pass_filenames. The hook entry is falsegreen --staged and .pre-commit-hooks.yaml sets pass_filenames: false. The hook reads the staged test files itself from git, so it does not need pre-commit to append the matched file list. Leaving pass_filenames at its default (true) would append every matched path on top of --staged, scanning some files twice. Do not re-enable it or add file arguments in your config; let --staged own the file selection.
  • stages override. The hook runs at the pre-commit stage by default. To run it on push instead, set stages: [pre-push] under the hook in your config. A heavier suite can keep the fast HIGH gate at commit time and a fuller pass at push time.

Raw git hook (without the pre-commit framework):

python -m falsegreen.hook_install --repo .      # install
python -m falsegreen.hook_install --uninstall   # remove

3. Semantic pass (multi-language)

For cases that require reading production intent (mocking the unit under test, copying expected from current output, re-implementing the formula), use falsegreen-skill. It covers Python, TypeScript, JavaScript, Java, and other languages.


Configuration

  • Inline suppression: # falsegreen: ignore silences all codes on a line; # falsegreen: ignore[C8] silences one.
  • Disable codes globally: --disable C6,C2b.
  • Environment: FALSEGREEN_BLOCK=0 makes the pre-commit hook warn instead of block.

Project config file

[tool.falsegreen] in pyproject.toml, or a flat .falsegreen.toml at the repo root (.falsegreen.toml wins if both exist):

[tool.falsegreen]
disable = ["C13b"]            # turn these codes off everywhere
exclude = ["tests/legacy/*"]  # skip files matching these globs
long_test_threshold = 30      # line-count limit for M2 (default: 50)
inline_setup_threshold = 3    # stmt limit for D5 (default: 5)

[tool.falsegreen.severity]
C8 = "high"    # promote: now blocks the commit (exit 20)
C6 = "off"     # same as adding C6 to disable
C22 = "low"    # enable: async-never-awaits check
D1 = "info"    # enable Assertion Roulette
D3 = "info"    # enable Duplicate Assert
D4 = "info"    # enable Unnamed Parametrize
D5 = "info"    # enable Inline Setup Excess
D6 = "info"    # enable Debug Print
M2 = "info"    # enable Long Test Method

severity values: high, low, info, or off. info findings appear in DIAGNOSTIC/COUPLING sections and do not affect the exit code. long_test_threshold and inline_setup_threshold are top-level keys in [tool.falsegreen], not inside [severity]. Precedence, highest first: --disable CLI, inline # falsegreen: ignore, config file, built-in default. Point at a specific file with --config PATH.

Baseline (adopt on a legacy repo)

Record the findings you already have, then fail only on new ones:

falsegreen --write-baseline tests/   # writes .falsegreen-baseline.json, exits 0
falsegreen --baseline tests/         # suppresses recorded findings, fails on new

A finding is fingerprinted by relative path, code, detail, and normalized source line - not line number, so prepending code does not re-trigger a baselined finding. Commit .falsegreen-baseline.json and the ratchet only tightens.


How it compares

  • ruff / flake8-pytest-style - mature, fast lint rules. Overlaps on broad raises (PT011) and assert-in-except (PT017). Run both: falsegreen adds uncollected tests, always-true asserts, self-comparison, mock typos, duplicate parametrize cases, and more.
  • PyNose / pytest-smell - test-smell catalogs from research. Broader taxonomy, but no commit gate and no oracle-correctness check.
  • mutmut / cosmic-ray - mutation testing, the most honest measure of whether a green suite fails when the code is wrong. Complementary and heavier. falsegreen is the cheap pre-filter you run on every commit; mutation testing is the deep audit you run on suites that matter.
  • falsegreen-skill - the LLM companion for the semantic pass and for TypeScript, JavaScript, Java, and other languages.

The defensible gap: a deterministic commit gate that catches the mechanical false-positive patterns with zero runtime dependencies, paired with an LLM semantic layer that catches the oracle-correctness cases no static tool can see.


Project layout

falsegreen/
  src/falsegreen/scanner.py        the deterministic scanner
  src/falsegreen/hook_install.py   raw git-hook installer
  docs/guide.md                    plain-language guide to every case
  examples/python/                 a BAD + CLEAN sample for every detected code
  tests/test_scanner.py            the scanner's own tests
  .pre-commit-hooks.yaml           pre-commit integration
  pyproject.toml                   packaging

Contributing, security, license

  • CONTRIBUTING.md - dev setup, how to add a detection rule, the false-positive policy, Conventional Commits.
  • SECURITY.md - how to report a vulnerability privately.
  • CODE_OF_CONDUCT.md - Contributor Covenant 2.1.
  • CREDITS.md - research falsegreen builds on (Soares rotten-green work, PyNose, test-smell catalog, agentic-LLM studies), with author credit.
  • License: MIT, see LICENSE.

Contributors ✨

Thanks to the people who keep false-green tests out of real suites (emoji key):

All Contributors

Vinicius Queiroz
Vinicius Queiroz

💻 📖 🤔 🚧 🚇 ⚠️ 🔬
Home Seller
Home Seller

💻 📖 🚇 ⚠️

New contributors are added automatically; the table also recognizes non-code work (docs, ideas, infrastructure, tests, research) via the all-contributors spec.

About

Find false-green tests: tests that pass without verifying anything. Deterministic Python/pytest AST scanner (C1-C59 + PL config-audit), zero-dep, CI-ready.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages