One problem, one tool: the false positive. falsegreen finds Python/pytest tests that pass green without protecting anything - tests that let broken code through because the assertion is empty, always true, never runs, or checks the wrong thing.
A test that tells you a broken program is safe is worse than no test at all. AI coding assistants produce these at scale. The tool catches them before they merge.
The scanner is a zero-dependency AST pass. It validates each test against 47 active false-positive codes, patterns a parser can prove: an assertion that never runs, a check that is empty or always true, a swallowed exception, a mock assertion with a typo, a check stranded in dead code. HIGH findings block the commit; LOW ones warn. A third group (diagnostic and coupling, six codes) can be enabled per project for informational checks that do not affect the exit code. The semantic layer, intent-based patterns no static tool can see, lives in falsegreen-skill, the LLM companion covering Python and other languages.
The checks are grounded in the rotten-green-test research (Soares 2023; Delplanque et al., ICSE 2019) and cross-walked against the published test-smell catalog. See CREDITS.md.
The falsegreen family (install the one for your stack):
| Tool | Stack | Install | Package |
|---|---|---|---|
| falsegreen | Python / pytest | pip install falsegreen |
PyPI |
| falsegreen-js | JS / TS | npm i -D falsegreen-js (npx falsegreen-js) |
npm |
| robotframework-falsegreen | Robot Framework | pip install robotframework-falsegreen |
PyPI |
| falsegreen-skill | semantic LLM pass | npx falsegreen-skill analyze <path> |
npm |
Install:
pip install falsegreen. Pre-commit hook also available (see below). LLM semantic pass: falsegreen-skill.
If you have never run this tool, start here. The five sections below take you from zero to a CI gate. The deeper reference (every code, the methodology, the research) follows after.
falsegreen reads your pytest tests and finds the ones that pass green without checking anything. A test can call your code, run, and report success while asserting nothing real, so a bug ships and the green bar lies about it. The tool reads the test files only (it never runs them) and points at the spots a parser can prove are empty, always true, unreachable, or self-confirming.
A test it flags, and the fix:
# BAD: runs the code, then asserts a constant. It can never fail.
def test_add():
result = add(2, 3)
assert True
# CLEAN: asserts the actual result. Breaks if add() breaks.
def test_add():
assert add(2, 3) == 5pip install falsegreenNeeds Python 3.8 or newer. No third-party runtime dependencies. To run it once without installing anything: uvx falsegreen tests/ or pipx run falsegreen tests/.
Point it at your test folder:
falsegreen tests/Run on the test_add example above and you get:
HIGH confidence (almost certainly a false positive)
---------------------------------------------------
test_demo.py:9 [C5] always-true check (assert True / tuple / or True)
level: unit fix: assert the real behaviour, not a constant or tautology
Summary: 1 high, 0 low.
By level: unit:1
Top fixes:
C5 (1): assert the real behaviour, not a constant or tautology
How to read that finding:
test_demo.py:9- the file and line.[C5]- the code. C5 is "always-true check". The catalog (below) explains every code.level: unit- which level of the test pyramid this file sits at.fix:- the one-line hint. Here: assert the real behaviour, not a constant.
python -m falsegreen tests/ does the same thing if the falsegreen command is not on your PATH.
falsegreen tests/ --json # machine-readable JSON instead of text
falsegreen tests/ --format sarif # text (default) | json | sarif | junit
falsegreen tests/ --disable C6,C8 # turn specific codes off
falsegreen tests/ --summary # one line "N scanned, M flagged" to stderrExit codes wire it into CI: 0 clean, 10 low-confidence findings only, 20 at least one high-confidence finding. Block the build on 20.
GitHub Actions:
name: falsegreen
on: [push, pull_request]
jobs:
scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.x" }
- run: pip install falsegreen
- run: falsegreen tests/ # exit 20 fails the jobEach finding carries a code (C5, C2b, C7...) and a confidence. HIGH codes are near-certain and block the commit; LOW codes warn and want a human look. The code names the pattern: C2b is "ran but checked nothing", C5 is "always true", C7 is "compares a value to itself". The full list, with a BAD and CLEAN example for each, is in docs/guide.md and the online docs. The What it detects table below is the quick reference.
The quick guide above gets you running. This section is the complete reference: every install channel, every flag, every output format, every config knob, and the CI recipes. All command output shown here is captured from a real run, not invented.
| Channel | Command | When to use |
|---|---|---|
| pip (global or venv) | pip install falsegreen |
the normal install; adds the falsegreen command |
| pipx (isolated) | pipx install falsegreen |
keep it off your project's dependency tree |
| pipx run (no install) | pipx run falsegreen tests/ |
one-off, latest release from PyPI |
| uvx (no install) | uvx falsegreen tests/ |
one-off via uv, no environment touched |
| from source | pip install -e . in a clone |
hacking on the scanner |
Version floor: Python 3.8 or newer. No third-party runtime dependencies, the scanner is a pure ast pass. Pin a version in CI with pip install falsegreen==0.9.0.
falsegreen # scan the current directory
falsegreen tests/ # scan a folder
falsegreen tests/test_login.py # scan a single file
falsegreen tests/ src/ # scan several paths at once
falsegreen --staged # only the test files staged in git (pre-commit)
python -m falsegreen tests/ # module form, identical behaviour, no PATH dependencyThere is no stdin mode: pass file or directory paths (or nothing, which scans the cwd). Discovery walks the given paths for test_*.py / *_test.py and unittest-style files; non-test files are ignored.
--format text|json|sarif|junit selects the shape (default text). --json is a shorthand for --format json. --output PATH writes to a file instead of stdout; a directory or trailing-slash path (.falsegreen/) receives report.<ext>.
Fixture used for every sample below (test_demo.py):
def add(a, b):
return a + b
def test_add():
result = add(2, 3)
assert True # C5: always-true, line 7
def test_weak():
result = add(2, 3)
assert result # C6: weak check, line 12text (default):
HIGH confidence (almost certainly a false positive)
---------------------------------------------------
test_demo.py:7 [C5] always-true check (assert True / tuple / or True)
level: unit fix: assert the real behaviour, not a constant or tautology
LOW confidence (test smell, confirm by hand or with /falsegreen)
----------------------------------------------------------------
test_demo.py:12 [C6] weak check (only verifies that something came back) (truthiness of a value, not compared to an expected result)
level: unit fix: assert the value, not just that something came back
Summary: 1 high, 1 low.
By level: unit:2
Top fixes:
C5 (1): assert the real behaviour, not a constant or tautology
C6 (1): assert the value, not just that something came back
json (--json or --format json): a flat array, one object per finding.
[
{
"file": "test_demo.py",
"line": 7,
"code": "C5",
"confidence": "high",
"title": "always-true check (assert True / tuple / or True)",
"detail": "",
"layer": "logic",
"level": "unit",
"fix": "assert the real behaviour, not a constant or tautology"
},
{
"file": "test_demo.py",
"line": 12,
"code": "C6",
"confidence": "low",
"title": "weak check (only verifies that something came back)",
"detail": "truthiness of a value, not compared to an expected result",
"layer": "logic",
"level": "unit",
"fix": "assert the value, not just that something came back"
}
]sarif (--format sarif): SARIF 2.1.0 for GitHub code scanning. HIGH maps to error, LOW to warning, info to note; each rule carries a helpUri into the online catalog and tags the judgment family. Abridged:
{
"$schema": "https://json.schemastore.org/sarif-2.1.0.json",
"version": "2.1.0",
"runs": [
{
"tool": { "driver": {
"name": "falsegreen",
"version": "0.9.0",
"rules": [
{ "id": "C5", "defaultConfiguration": { "level": "error" },
"helpUri": "https://vinicq.github.io/falsegreen-docs/catalog/python/#c5",
"properties": { "tags": ["J2"] } }
]
} },
"results": [
{ "ruleId": "C5", "level": "error",
"message": { "text": "always-true check (assert True / tuple / or True)" },
"properties": { "tags": ["J2", "layer:logic", "level:unit"] },
"locations": [ { "physicalLocation": {
"artifactLocation": { "uri": "test_demo.py" },
"region": { "startLine": 7 } } } ] }
]
}
]
}junit (--format junit): JUnit XML. HIGH becomes a <failure>, lower findings become <skipped>, so a CI test reporter shows them as a failing suite.
<?xml version="1.0" encoding="utf-8"?>
<testsuites name="falsegreen" tests="2" failures="1" skipped="1" errors="0"><testsuite name="falsegreen" tests="2" failures="1" skipped="1" errors="0"><testcase classname="falsegreen.C5" name="C5 test_demo.py:7"><failure message="always-true check (assert True / tuple / or True)">test_demo.py:7</failure></testcase><testcase classname="falsegreen.C6" name="C6 test_demo.py:12"><skipped message="weak check (only verifies that something came back) ..." /></testcase></testsuite></testsuites>--summary prints a one-line tally to stderr (independent of --format, so you can keep machine output on stdout):
falsegreen: scanned 1 test file(s), 2 finding(s) [1 high, 1 low] C5:1 C6:1
by judgment: J2:1 J4:1
Exit codes (the contract CI relies on):
| Code | Meaning |
|---|---|
0 |
clean, or only info/baselined findings |
10 |
low-confidence findings only |
20 |
at least one high-confidence finding |
Block the build on 20. 10 is a warn band you can choose to fail or not.
Disable codes: --disable C6,C2b turns codes off for this run. Persist it in config with disable = [...] or severity = { C6 = "off" }.
Inline suppression: a comment on the offending line.
assert x == x # falsegreen: ignore # silence every code on this line
assert y == y # falsegreen: ignore[C7] # silence only C7Only the falsegreen: token suppresses; a plain # ignore does not.
Severity and confidence filtering: there is no --severity flag. You tune severity in config, per code. severity values are high, low, info, or off. Promoting a code to high makes it block (exit 20); demoting to info moves it into the DIAGNOSTIC/COUPLING section where it never affects the exit code; off is the same as disabling it.
Config file: [tool.falsegreen] in pyproject.toml, or a flat .falsegreen.toml at the repo root (.falsegreen.toml wins if both exist). Point at an explicit file with --config PATH.
[tool.falsegreen]
disable = ["C13b"] # turn these codes off everywhere
exclude = ["tests/legacy/*"] # skip files matching these globs
long_test_threshold = 30 # line limit for M2 (default 50)
inline_setup_threshold = 3 # stmt limit for D5 (default 5)
[tool.falsegreen.severity]
C8 = "high" # promote: now blocks
C6 = "off" # disable
C22 = "low" # enable the async-never-awaits check
D1 = "info" # enable Assertion Roulette (diagnostic, no exit impact)
M2 = "info" # enable Long Test MethodPrecedence, highest first: --disable CLI, inline # falsegreen: ignore, config file, built-in default. The diagnostic and coupling group (D1, D3-D6, M2, C22) is off by default and is enabled only through config severity = "info". There is no --diagnostics flag on the Python scanner (that flag belongs to the JS and Robot siblings); enable diagnostics here through config.
--config-audit is a separate mode: instead of scanning test files it reads the pytest and coverage config (pyproject.toml, pytest.ini, tox.ini, setup.cfg) and reports the project-layer ways a suite stays green by configuration. Run on a pytest.ini carrying addopts = -x:
LOW confidence (test smell, confirm by hand or with /falsegreen)
----------------------------------------------------------------
pytest.ini:1 [PL2] filterwarnings does not promote warnings to errors ...
pytest.ini:1 [PL7] no coverage gate (--cov-fail-under / fail_under) ...
pytest.ini:1 [PL8] addopts stops the run early (-x / --maxfail / --exitfirst) ...
Summary: 0 high, 3 low.
By level: project:3
The PL codes: PL1 (python -O / PYTHONOPTIMIZE strips every assert), PL2 (warnings not promoted to errors), PL7 (no coverage gate), PL8 (addopts stops the run early). The per-file scan cannot see config, so this mode complements it.
--baseline / --write-baseline adopt the scanner on a legacy repo without a wall of red:
falsegreen --write-baseline tests/ # writes .falsegreen-baseline.json, exit 0
falsegreen --baseline tests/ # suppresses recorded findings, fails on new onesCaptured:
falsegreen: wrote 2 fingerprint(s) to .falsegreen-baseline.json
A finding is fingerprinted by relative path, code, detail, and normalized source line, not line number, so adding code above it does not re-trigger a baselined finding. Commit .falsegreen-baseline.json; the ratchet only tightens. Both flags take an optional explicit path.
GitHub Actions (text gate plus SARIF upload to code scanning):
name: falsegreen
on: [push, pull_request]
jobs:
scan:
runs-on: ubuntu-latest
permissions:
contents: read
security-events: write # required for the SARIF upload
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.x" }
- run: pip install falsegreen
- name: Scan and emit SARIF
run: falsegreen tests/ --format sarif --output falsegreen.sarif
continue-on-error: true # let the upload run even when exit 20
- uses: github/codeql-action/upload-sarif@v3
with: { sarif_file: falsegreen.sarif }
- name: Fail on high-confidence findings
run: falsegreen tests/ # exit 20 fails the jobPre-commit hook:
- repo: https://github.com/vinicq/falsegreen
rev: v0.9.1 # pin a tag; run `pre-commit autoupdate` to move it
hooks:
- id: falsegreenThen pre-commit install. The hook entry is falsegreen --staged with pass_filenames: false, so it reads the staged test files itself; do not add file arguments. HIGH findings block the commit. Bypass once with git commit --no-verify, or set FALSEGREEN_BLOCK=0 to make the hook warn-only. To run at push time instead, add stages: [pre-push] under the hook.
Raw git hook without the pre-commit framework:
python -m falsegreen.hook_install --repo . # install
python -m falsegreen.hook_install --uninstall # removeIt never imports or runs your tests; detection is structural. It does not judge whether an expected value contradicts intended behaviour, whether a mock replaced the unit under test, or whether the test re-implements the production formula. Those are semantic (cases 10/11/12/15/18) and need the falsegreen-skill LLM pass. It is not a style linter or a coverage tool. The full code catalog, with a BAD and CLEAN example per code, is in the online docs and docs/guide.md.
- Why this exists
- The methodology
- What it detects
- Codes the scanner does not detect
- Diagnostic and coupling codes
- The two layers
- Install and use
- Configuration
- How it compares
- Project layout
- Contributing, security, license
Coverage tells you which lines ran. It does not tell you whether anything was checked. A suite can report 90% coverage while most of those tests assert nothing real - the green bar is a comfort, not a guarantee.
The danger is not a test that fails. It is a test that passes when it should not.
falsegreen targets that one problem. It is not a style linter and not a coverage tool. It asks one question per test: is there a way for the code to be wrong and this test to stay green? If yes, the test is not protecting what it claims.
This matters more now that a large share of tests come from AI assistants. The common machine-written failure modes are exactly the patterns here: assertion roulette, mocking the unit under test, asserting the value you fed the mock, copying the expected value from the current output. They look thorough. They pass. falsegreen is the second reader that asks whether they actually verify anything.
One rule drives everything: a test is only useful if it fails when the code breaks. If you have never seen a test go red, you do not know that it tests anything.
The patterns are organized into five families:
- A. The test never checks anything. The assertion is skipped, missing, swallowed, or the test is never collected by the runner.
- B. The check is weak or always true. It accepts almost any output, or it is true by construction.
- C. The test checks itself, not the program. It mocks the thing under test, asserts the value it configured, or re-implements the production logic as its own answer key.
- D. Green depends on outside factors. Time, randomness, test order, a fixed sleep.
- E. The test passes but checks the wrong thing. The assertion runs against a real result, yet the expected value contradicts what the code should do, so the test freezes a bug.
Family E is why the project has a semantic layer. It carries the core principle: the expected value must come from an independent oracle, not from the code's current output. The oracle hierarchy, highest first: explicit spec or requirement, documented contract (docstring, types, API), independent human judgment, current code. Code is the lowest-priority oracle. Treating it as the highest is how you rubber-stamp a bug.
The plain-language guide, with a real-world analogy and before/after for each case, is in docs/guide.md. The full detection reference lives in falsegreen-skill.
The scanner ships 47 active false-positive codes across the five families, plus CC (commented-out assert). HIGH findings block a commit; LOW ones warn. Cases that require reading production intent (10, 11, 12, 15, 18) need the semantic layer.
| # | Case | Why it fools you | Code | Conf |
|---|---|---|---|---|
| 1 | assert inside an if/for that may not run |
check silently skips | C1 |
LOW |
| 2 | Test with no assertion at all | proves only it did not crash | C2/C2b |
HIGH/LOW |
| 3 | assert inside try whose except swallows it |
failure discarded | C3 |
HIGH |
| 4 | Test the runner never collects | vanishes from the count | C4/C4b |
HIGH/LOW |
| 5 | Always-true check (assert True, non-empty tuple, or True) |
passes by construction | C5 |
HIGH |
| 6 | Weak check (truthiness, len > 0, "x" in str(...)) |
accepts almost anything | C6 |
LOW |
| 7 | Compares a value to itself | true by construction | C7 |
HIGH |
| 8 | Exact equality on a float | fails on rounding, not bugs | C8 |
LOW |
| 9 | pytest.raises(Exception) too broad |
accepts the wrong error | C9 |
LOW |
| 10 | Mocks the unit under test | tests the mock, not the code | semantic | - |
| 11 | Asserts the value fed to the mock | an echo, not a result | semantic | - |
| 12 | Re-implements the production formula | both agree on the same wrong number | semantic | - |
| 13 | Mock assertion misspelled / not called | becomes a no-op | C13/C13b |
HIGH/LOW |
| 14 | Golden/snapshot written from the output | records today's bug as correct | C14 |
LOW |
| 15 | Passes only if another test ran first | borrowed state | semantic | - |
| 16 | Depends on time, randomness, or a sleep | passes or fails by luck | C16 |
LOW |
| 17 | skip inside a broad except |
turns red into yellow | C17 |
HIGH |
| 18 | Expected value contradicts what code should do | freezes a bug as correct | semantic | - |
Six codes cover patterns added after the original eighteen:
| Code | Pattern | Why it fools you | Family | Conf |
|---|---|---|---|---|
C18 |
Compares str()/repr()/f-string to literal |
checks formatting, not the value | B | LOW |
C19 |
pytest.raises wraps more than one call |
earlier line raises; target never reached | A | LOW |
C20 |
assert in dead code after return/raise/fail() |
never runs | A | HIGH |
C21 |
Every assert is conditional, none unconditional |
false condition passes the whole test | A | LOW |
C22 |
async test asserts but never awaits the unit |
checks an un-awaited coroutine | A | off |
C23 |
Opens a real file at a literal hard-coded path | binds test to a layout; often a credential outside the repo (Mystery Guest) | D | LOW |
Eleven additional codes covering the most common patterns in real test suites:
| Code | Pattern | Conf |
|---|---|---|
C25 |
@pytest.mark.xfail without strict=True — XPASS silently accepted |
LOW |
C27 |
try/except/pass used instead of pytest.raises |
HIGH |
C28 |
pytest.raises binding declared but exception content never inspected |
LOW |
C29 |
os.environ mutated directly in a test — state leaks between tests |
LOW |
C30 |
responses.add() / httpretty.register_uri() without activating the interceptor |
LOW |
C31 |
capsys.readouterr() result never asserted |
LOW |
C32 |
@pytest.mark.skip without reason= |
LOW |
C33 |
sklearn / ML metric computed but never asserted | LOW |
C34 |
Suboptimal assert form — pytest provides a clearer, more idiomatic alternative | LOW |
C35 |
Retry/flaky decorator masks flaky behaviour instead of fixing the root cause | LOW |
C36 |
pytest.fail() with no reason argument — failure message is empty |
LOW |
C37 |
Duplicate case in @pytest.mark.parametrize — same argument set runs twice |
LOW |
CC |
Commented-out assert | LOW |
More from the consolidated catalog:
| Code | Pattern | Conf |
|---|---|---|
C2c |
self.subTest(...) block wraps work but asserts nothing — the subTest analogue of an empty test |
LOW |
C6b |
Assertion coupled to positional argument layout — breaks on a benign reorder, not a bug | LOW |
C6c |
Asserts a mock's call_count truthiness — only that it was called, not how many times |
LOW |
C8b |
Approximate-equality with no explicit tolerance (assertAlmostEqual, == pytest.approx()) — the default tolerance hides a wrong value |
LOW |
C11a |
Self-confirming literal — the expected value is assigned by the test itself | LOW |
C24 |
Module-global mutable state shared across tests — borrowed state, not isolation | LOW |
C38 |
Two test functions share a name — the later one silently overrides the first | HIGH |
C39 |
Test returns a comparison instead of asserting it — pytest ignores the value |
HIGH |
C41 |
Assertion on an in-place method that returns None (assert not lst.sort()) — trivially satisfied |
LOW |
C42 |
assert on a generator expression or lambda — the object is always truthy |
HIGH |
C43 |
pytest.skip() after test logic — the checks below it never run |
LOW |
C44 |
Numeric tautology (len(x) >= 0, abs(x) >= 0, a mock's call_count >= 0) — always true |
HIGH |
C45 |
Empty @pytest.mark.parametrize list — the test is generated with zero cases |
HIGH |
C48 |
Dark patch — the test forces a test-mode flag (os.environ["TESTING"], settings.TESTING) then asserts, exercising the product's test-only branch |
LOW |
C49 |
pytest.warns / assertWarns / deprecated_call wraps more than one call — an unrelated earlier line may warn while the target never does (warns sibling of C19) |
LOW |
C50 |
caplog / assertLogs output captured but never asserted — the capture has no effect on pass/fail (logging sibling of C31) |
LOW |
C51 |
Empty-bodied pytest.raises / warns context — no call inside the block, so the call that should raise is never made |
HIGH |
C52 |
Membership self-confirmation — assert x in {x}, the collection is built from the subject so membership is true by construction (membership variant of C7) |
LOW |
C55 |
Assertion compares two mock-rooted values — assert m.foo == m.bar, both sides are the test's own doubles not the SUT |
LOW |
C56 |
Sync assert of a never-awaited coroutine — the operand calls a local async def with no await, so the check runs on a coroutine object not its value (sync sibling of C22) |
LOW |
C57 |
Comparison against an unconfigured Mock attribute — assert x == m.attr on a bare Mock() with no spec=, which auto-creates a fresh truthy Mock (the AST-decidable corner of the deferred C40) |
LOW |
C59 |
Bare top-level comparison — result == expected as a statement, the value is computed and discarded so nothing is asserted (loose-statement sibling of C39) |
HIGH |
The static layer is close to saturated. A handful of catalog codes are deliberately left out because a per-file AST pass cannot judge them without a high false-positive rate, or because they are not a per-file property at all. They are listed here so the gap is honest, not hidden. The reasoning follows the consolidated catalog.
High false-positive without deeper analysis (left to the semantic pass).
C40(assert on aMockattribute with no spec, always truthy): without spec or autospec analysis the false-positive rate is high, since the same shape is a valid check on a real object. The concept lives in the skill (Family F7).C46(real network or database call with no double): legitimate at the integration level, where crossing the boundary is the point. Flagging it per file, without knowing the test's layer, is a high false-positive. It belongs to the skill and the project layer.C47(assertion depends on dict or set ordering): most collections are used deterministically, so flagging unordered-vs-sequence comparisons fires far too often. It stays a note in the skill.
Runtime and culture (not a per-file property). The PL series is about how the
suite is invoked and configured, not what a single test file contains. PL1, PL2,
PL7, and PL8 are covered by --config-audit (python -O/PYTHONOPTIMIZE strips every
assert, warnings not promoted to errors, no coverage gate, addopts that stops the run
early), read from the pytest config it parses. The rest need execution or pipeline
inspection: PL4 (a collection error counted as "0 tests" while CI stays green), and
PL3, PL5, PL6 (a coverage pragma in production code, importorskip hiding a broken
import, CI running a subset via -k / -m). They are documented, not promised, and sit
outside the "test file" target.
Semantic Family E or F7 (mutation testing and the skill). Mocking the unit under
test, asserting the value you fed the mock, re-implementing the production formula,
borrowing state from another test, an expected value that contradicts the spec: none of
these can be proven by structure. C14 (a snapshot generated from the code's own output)
is the only codable corner of this family. The honest path for the rest is mutation
testing (mutmut, cosmic-ray), which mutates the production code and checks whether any
test goes red, plus the LLM semantic pass in
falsegreen-skill.
How the scanner detects. It parses each test file with Python's ast module and inspects the tree. It never imports or runs the test, so a malicious or broken test cannot execute through it. Detection is structural: an assert whose expression is a constant, both sides of a comparison AST-identical, a pytest.raises argument of Exception, a mock-named receiver with a no-parentheses assert_called_once, a Test* class with __init__, and so on. Precision is the priority for HIGH codes, because they block commits: each one is stress-tested against look-alikes (optional-dependency skips, abstract base test classes, @patch-injected mocks, exact-count len(x) == N) and stays quiet on them.
How the semantic pass detects. Cases 10, 11, 12, 15, and 18 cannot be proven by structure. A parser sees a mock but cannot tell whether it replaced an edge (network, disk, clock) or the thing under test. It sees an arithmetic expression but cannot tell whether the expected value was derived independently or copied from the code. That judgment requires reading the production code against an independent oracle - that is what falsegreen-skill does.
Why two confidence levels. A blocking gate that cries wolf gets disabled. So only near-certain, mechanically unambiguous patterns are HIGH (they block). The rest are LOW (they warn) and are starting points for human or semantic judgment, not verdicts.
A tool that flags tests for not protecting anything has to show it protects something itself.
- The scanner (deterministic). Every rule ships with two tests: one proving it fires on the bad pattern, one proving it stays quiet on a legitimate look-alike. The scanner also runs on its own source on every commit (the self-scan), because the false-positive detector is not allowed to contain one. It is also validated against real-world Python projects - the most recent corpus run covered 40 projects with over 58,000 test functions. That pass surfaced false positives in two rule classes (C7 on deliberate
__eq__tests, C4 on test-named route handlers). Both were fixed, each with regression tests. The HIGH count across all 40 projects after fixes: 0. Each false positive is recorded in the commit history and the CHANGELOG. - The semantic pass (LLM). Validation for the LLM-based semantic layer is tracked in falsegreen-skill, where benchmark corpora for Python and TypeScript are maintained with precision/recall measurements.
falsegreen scans tests at every level of the pyramid. Discovery is level-agnostic - it reads any pytest/unittest file - but a few codes are read in light of the level, so a valid pattern at one level is not flagged at another.
- Unit: a function with its boundaries doubled. The oracle is
assert(orself.assert*). - Integration (API and database): API tests through
requests/httpxor a framework TestClient (FastAPI, Flask, Django), database tests against a real datastore (SQLAlchemy, the Django ORM, testcontainers). These cross the I/O boundary on purpose, so the response or row IS the verification at that level. The weak-check code (C6) relaxes in the web layer, where the presence of a response is a real check. - E2E: Playwright for Python and Selenium.
expect(locator).to_be_visible()is the oracle.
A real API or database hit inside a test that claims to be a unit test is itself the smell
(mystery guest, resource optimism, state leak), not the level of the test. C23 (real file at
a literal path), C29 (os.environ mutated), and C30 (mock interceptor never activated) flag
those forms.
The same false-green shape is classified by the level the test runs at: the level is a per-finding axis (J3), read as unit, integration, or E2E. The codes that cluster at each level in Python:
- Unit:
C5/C7(always-true, self-compare),C2b(calls the code, asserts nothing),C56(sync assert of a never-awaited coroutine), and the semanticS1/S5(intent mismatch, tests the framework). - Integration:
C9b(request oracle off,expected_status=any),C50(caplog/assertLogscaptured but never asserted). - E2E:
C16(sleep as synchronization, uncontrolled time).
Full matrix on the docs site: patterns by test level and what we do not flag.
Seven additional codes surface smells that do not create false positives but hurt observability and maintainability. All are off by default. Enable with severity = { CODE = "info" } in config. info findings appear in separate DIAGNOSTIC and COUPLING sections and do not affect the exit code.
| Code | Smell | What it flags |
|---|---|---|
C22 |
Async Liar | async def test_* that asserts but never awaits the unit |
D1 |
Assertion Roulette | 2+ assertions in one test, all without a msg argument |
D3 |
Duplicate Assert | the same assertion written twice in the same test body |
D4 |
Unnamed Parametrize | @pytest.mark.parametrize with 3+ cases and no ids= |
D5 |
Inline Setup Excess | too many setup statements before the first assert (threshold configurable) |
D6 |
Debug Print | print() call in test body |
M2 |
Long Test Method | test body exceeds long_test_threshold lines (default 50) |
| Layer | What it is | When it runs | Catches |
|---|---|---|---|
| Scanner (this repo) | Zero-dependency AST analysis | CLI, CI, pre-commit | 47 active false-positive codes + 6 opt-in diagnostic and coupling codes |
| Semantic pass (falsegreen-skill) | LLM-based analysis, Python and other languages | on demand | bug-freezing patterns no static tool can see (cases 10/11/12/15/18) |
The scanner is the fast, deterministic pre-filter. For TypeScript, JavaScript, Java, and other languages, use falsegreen-skill.
Measured against the Open Catalog of Test Smells (517 documented smells), only the false-green slice is in scope. These stay out, on purpose:
- Brittleness / false-red (a test that breaks without a real bug): sensitive equality, brittle or fragile assertions. The opposite axis; flagging it would punish correct code.
- Hygiene / maintainability: assertion roulette, magic numbers, long or verbose tests. Linter territory (ruff), and a few are surfaced here as opt-in diagnostics (
D1,M2). - Slow, design, naming, duplication, runtime/culture: none are about whether the test protects.
The boundary is deliberate. Where a smell has a statically provable false-green form, that form is a code here: uncontrolled time or randomness is C16, a hard-coded path is C23, shared module state is C24, an assertion that may never run is C21. See CREDITS.md for the full cross-walk against the literature.
pip install falsegreenNo permanent install needed: uvx falsegreen tests/ or pipx run falsegreen tests/ runs the latest release from PyPI without touching your environment.
falsegreen # scan the current directory
falsegreen tests/ # scan a folder or a single file
falsegreen --staged # only the test files staged in git
falsegreen --format sarif # text (default) | json | sarif | junit
falsegreen --summary # one-line "N scanned, M flagged" to stderr
falsegreen --output report.sarif # write to a file
falsegreen --output .falsegreen/ # write report.<ext> into a directory
falsegreen --config-audit # audit pytest/coverage config (project-layer PL codes)
falsegreen --disable C6,C2b # turn specific codes off--config-audit is a separate mode: instead of scanning test files, it reads the project's pytest and coverage config (pyproject.toml, pytest.ini, tox.ini, setup.cfg) and reports the project-layer ways a suite stays green by configuration: PL1 (python -O / PYTHONOPTIMIZE strips every assert at runtime), PL2 (filterwarnings does not promote warnings to errors), PL7 (no coverage gate), PL8 (addopts stops the run early with -x/--maxfail). These complement the per-file scan, which cannot see config.
Each finding is reported with its pyramid level (unit / integration / e2e, read from the file's imports) and a one-line fix hint, and the text summary breaks the findings down by level and lists the most common fixes. --output takes a file or a directory: an extension-less or trailing-slash path (e.g. .falsegreen/) receives report.<ext> for the chosen format. Reports are run artifacts; keep the output directory gitignored.
--format sarif emits SARIF 2.1.0 (HIGH → error, LOW → warning) for GitHub code scanning and PR annotations. --format junit emits JUnit XML for CI dashboards.
python -m falsegreen ... is equivalent. Exit codes: 0 clean, 10 low-confidence findings only, 20 at least one high-confidence finding. No third-party runtime dependencies; Python 3.8+.
Add to .pre-commit-config.yaml:
- repo: https://github.com/vinicq/falsegreen
rev: v0.9.1
hooks:
- id: falsegreenThen pre-commit install. On each commit it scans the staged test files. HIGH findings block the commit. Bypass once with git commit --no-verify, or set FALSEGREEN_BLOCK=0 to make the hook warn-only.
- Bump the pinned
rev.revis pinned to a tag (v0.6.0above), so the hook never changes under you. Runpre-commit autoupdateto rewrite it to the latest release, then commit the updated.pre-commit-config.yaml. Pin a tag, never a branch, so local runs and CI use the same scanner. --stagedandpass_filenames. The hook entry isfalsegreen --stagedand.pre-commit-hooks.yamlsetspass_filenames: false. The hook reads the staged test files itself from git, so it does not need pre-commit to append the matched file list. Leavingpass_filenamesat its default (true) would append every matched path on top of--staged, scanning some files twice. Do not re-enable it or add file arguments in your config; let--stagedown the file selection.stagesoverride. The hook runs at thepre-commitstage by default. To run it on push instead, setstages: [pre-push]under the hook in your config. A heavier suite can keep the fast HIGH gate at commit time and a fuller pass at push time.
Raw git hook (without the pre-commit framework):
python -m falsegreen.hook_install --repo . # install
python -m falsegreen.hook_install --uninstall # removeFor cases that require reading production intent (mocking the unit under test, copying expected from current output, re-implementing the formula), use falsegreen-skill. It covers Python, TypeScript, JavaScript, Java, and other languages.
- Inline suppression:
# falsegreen: ignoresilences all codes on a line;# falsegreen: ignore[C8]silences one. - Disable codes globally:
--disable C6,C2b. - Environment:
FALSEGREEN_BLOCK=0makes the pre-commit hook warn instead of block.
[tool.falsegreen] in pyproject.toml, or a flat .falsegreen.toml at the repo root (.falsegreen.toml wins if both exist):
[tool.falsegreen]
disable = ["C13b"] # turn these codes off everywhere
exclude = ["tests/legacy/*"] # skip files matching these globs
long_test_threshold = 30 # line-count limit for M2 (default: 50)
inline_setup_threshold = 3 # stmt limit for D5 (default: 5)
[tool.falsegreen.severity]
C8 = "high" # promote: now blocks the commit (exit 20)
C6 = "off" # same as adding C6 to disable
C22 = "low" # enable: async-never-awaits check
D1 = "info" # enable Assertion Roulette
D3 = "info" # enable Duplicate Assert
D4 = "info" # enable Unnamed Parametrize
D5 = "info" # enable Inline Setup Excess
D6 = "info" # enable Debug Print
M2 = "info" # enable Long Test Methodseverity values: high, low, info, or off. info findings appear in DIAGNOSTIC/COUPLING sections and do not affect the exit code. long_test_threshold and inline_setup_threshold are top-level keys in [tool.falsegreen], not inside [severity]. Precedence, highest first: --disable CLI, inline # falsegreen: ignore, config file, built-in default. Point at a specific file with --config PATH.
Record the findings you already have, then fail only on new ones:
falsegreen --write-baseline tests/ # writes .falsegreen-baseline.json, exits 0
falsegreen --baseline tests/ # suppresses recorded findings, fails on newA finding is fingerprinted by relative path, code, detail, and normalized source line - not line number, so prepending code does not re-trigger a baselined finding. Commit .falsegreen-baseline.json and the ratchet only tightens.
- ruff / flake8-pytest-style - mature, fast lint rules. Overlaps on broad
raises(PT011) and assert-in-except (PT017). Run both: falsegreen adds uncollected tests, always-true asserts, self-comparison, mock typos, duplicate parametrize cases, and more. - PyNose / pytest-smell - test-smell catalogs from research. Broader taxonomy, but no commit gate and no oracle-correctness check.
- mutmut / cosmic-ray - mutation testing, the most honest measure of whether a green suite fails when the code is wrong. Complementary and heavier. falsegreen is the cheap pre-filter you run on every commit; mutation testing is the deep audit you run on suites that matter.
- falsegreen-skill - the LLM companion for the semantic pass and for TypeScript, JavaScript, Java, and other languages.
The defensible gap: a deterministic commit gate that catches the mechanical false-positive patterns with zero runtime dependencies, paired with an LLM semantic layer that catches the oracle-correctness cases no static tool can see.
falsegreen/
src/falsegreen/scanner.py the deterministic scanner
src/falsegreen/hook_install.py raw git-hook installer
docs/guide.md plain-language guide to every case
examples/python/ a BAD + CLEAN sample for every detected code
tests/test_scanner.py the scanner's own tests
.pre-commit-hooks.yaml pre-commit integration
pyproject.toml packaging
- CONTRIBUTING.md - dev setup, how to add a detection rule, the false-positive policy, Conventional Commits.
- SECURITY.md - how to report a vulnerability privately.
- CODE_OF_CONDUCT.md - Contributor Covenant 2.1.
- CREDITS.md - research falsegreen builds on (Soares rotten-green work, PyNose, test-smell catalog, agentic-LLM studies), with author credit.
- License: MIT, see LICENSE.
Thanks to the people who keep false-green tests out of real suites (emoji key):
Vinicius Queiroz 💻 📖 🤔 🚧 🚇 |
Home Seller 💻 📖 🚇 |
New contributors are added automatically; the table also recognizes non-code work (docs, ideas, infrastructure, tests, research) via the all-contributors spec.