Skip to content

e2e: chaos/resilience testing under server flakiness #197

@bosd

Description

@bosd

Goal

Validate fluvo's transport/server-resilience against real flaky-server conditions (the kind that bite real-world imports), not just mocked failures. fluvo's pitch is "imports that don't lose data", but its resilience code is currently exercised only by the in-memory FakeOdoo:

  • _execute_batch_with_retry (export: split + retry on network errors)
  • the binary-search load fallback (import)
  • connection-pool-exhaustion + "could not serialize access" handling (_handle_create_error)

The existing tests/e2e/test_integrity_failures.py only covers data-level faults (malformed rows → fail file), not transport/server-level.

Principles

  • Deterministic, not flaky. Use Toxiproxy (a TCP fault-injection proxy between fluvo and Odoo) for repeatable network faults — NOT timing roulette.
  • Assert reconciliation, not success. Under fault the contract is no silent data loss: every record is imported OR in the fail file with an error. Use assert_reconciled. A test asserting "import succeeds" would be flaky.
  • Opt-in / local (a -m chaos marker, nox -s e2e), never the CI matrix.

Scenarios

  • Toxiproxy harness — add a toxiproxy service to tests/e2e/docker-compose.yml + a fixture that routes the connection through it and injects toxics. (in progress — first PR)
  • Connection reset mid-import (toxiproxy reset_peer) → assert reconciled + recovery. (first scenario)
  • Latency > RPC timeout (toxiproxy latency/timeout) → assert the timeout is handled (fail-file/retry, no crash).
  • Bandwidth throttle / slow_close → large-batch behaviour under a degraded link.
  • Server restart / pause mid-import (podman pause/unpause/restart; add _runtime helpers) → assert no corruption + reconciliation; also exercises checkpoint/resume. (timing-dependent — the flakier one; make deterministic via a checkpoint hook.)
  • DB connection-pool exhaustion / serialization conflict (concurrent load + low pool) → assert the specific _handle_create_error paths.

Notes

  • Partly overlaps the FakeOdoo unit tests (load-failure → fallback → fail-file); the incremental value is validating the real transport (does the pooled httpx client recover from a real reset? does odoolib reconnect? does the timeout actually fire?).
  • podman pause/unpause is available locally; toxiproxy ships as a small Go binary / container image.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions