Skip to content

Systemic type-annotation sloppiness in AI-generated code #427

@mickume

Description

@mickume

The Pattern

An audit of all 61 af:hunt issues reveals a systemic pattern: AI-generated code consistently has correct runtime logic but sloppy static type annotations. This is not a handful of one-off mistakes — it is a repeating signature of how AI coding agents write Python.

48 of the 61 issues (79%) touch test files. The remaining 13 target production code. In almost every case the underlying logic is sound and the code works at runtime (all 4,688 tests pass), but mypy and ruff flag type-safety violations that a human reviewer would catch on first read.

Recurring Error Categories

Pattern Example Frequency
object used where a concrete type is needed caplog: object instead of LogCaptureFixture; config: object instead of AgentFoxConfig ~15 issues
Missing null-narrowing before indexing fetchone()[0] without None check ~5 issues
Wrong generator return type on fixtures -> T instead of -> Generator[T, None, None] 3 issues
Unsorted / unused imports ruff I001, F401 4 issues (with duplicates)
callable (builtin) as type annotation Should be typing.Callable 2 issues
Missing py.typed markers Internal packages imported without stubs 2 issues

These are not distinct bugs — they are six variations of the same underlying problem: the AI agent treats the type system as optional commentary rather than a contract.

Why It Matters

  1. Hunt scanner noise. Night Shift's hunt stream files an individual GitHub issue for each finding. A single object-typing habit across 10 files produces 10 issues that all have the same root cause. This buries genuinely dangerous findings (like Test asserts wrong archetype name 'coder' instead of 'fix_coder' #322, where a test asserted the wrong archetype name) under a mountain of type-annotation lint.

  2. False confidence from green tests. All 4,688 tests pass. But passing tests + failing type checks means the safety net has holes: the tests validate behavior but not the contracts between modules. A future refactor that changes a return type will not be caught by mypy if half the test suite already suppresses or ignores type errors.

  3. Duplicate test definitions go unnoticed. In test_db_plan_state.py, two test functions share the same name (Duplicate test function definitions in test_db_plan_state.py shadow earlier tests #417). Python silently overwrites the first with the second — a test is lost. This is the kind of defect that only static analysis catches, and it was drowned out by dozens of cosmetic type-annotation issues.

Root Cause

The specs (test_spec.md files) define test behavior in language-agnostic pseudocode. The AI agent translates this to Python and gets the logic right, but:

  • It defaults to object when it doesn't know the exact type of a fixture or mock return value.
  • It doesn't run mypy as part of its feedback loop, so type errors never trigger a correction cycle.
  • It copies fixture patterns (e.g., generator fixtures) from memory without verifying the return-type annotation convention.
  • It doesn't deduplicate function names when appending regression tests to an existing file.

Suggested Remediation

Short-term: bulk fix the existing issues

Most open af:hunt issues are mechanically fixable:

  • Replace object annotations with concrete types (AgentFoxConfig, DuckDBPyConnection, LogCaptureFixture, etc.)
  • Add assert row is not None before indexing fetchone() results
  • Change fixture return types to Generator[T, None, None]
  • Run ruff check --fix for import sorting
  • Deduplicate test function names in test_db_plan_state.py

This could be a single PR touching ~30 files.

Long-term: add mypy to the agent feedback loop

The agent coding sessions run make check (ruff + pytest) before committing. Adding mypy to that gate — even in non-strict mode on test files — would catch these errors during generation rather than in a post-hoc hunt scan. This would:

  • Eliminate the most common category of af:hunt findings at the source
  • Reduce issue noise so real logic bugs stand out
  • Force the agent to learn the correct annotation patterns through its retry loop

Consider: hunt scanner deduplication

The hunt scanner should detect when multiple findings share a root cause (e.g., the same object-typing pattern across files) and consolidate them into a single issue with an affected-files list, rather than filing N separate issues.

Related Issues

Open (representative sample):

Closed (same pattern, already fixed):

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions