Skip to content

Commit c0598eb

Browse files
committed
test: add property-based safety gates
1 parent 36fc331 commit c0598eb

11 files changed

Lines changed: 366 additions & 49 deletions

AGENTS.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -157,8 +157,7 @@ The current trust-hardening priorities are:
157157

158158
1. periodic external receipt anchoring for journal heads;
159159
2. MCP registry publication after PyPI or remote-server enablement;
160-
3. property-based tests for safety gates and malformed venue responses;
161-
4. postmortem publication when live safety, journal integrity, privacy, or
160+
3. postmortem publication when live safety, journal integrity, privacy, or
162161
release integrity is affected.
163162

164163
Each cycle should update tests, docs, and scorecards with the behavior it

docs/autonomous-os-plan.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -79,8 +79,8 @@ speed, scale, history, and reliability.
7979
| Product narrative | 99 | keep narrative aligned as hosted Network and Intelligence launch |
8080
| CLI readiness | 100 | five-mode terminal with full-screen live cockpit exists; raw exchange records remain operator-owned proof |
8181
| Engine runtime | 100 | public production-parity OODA report plus redacted live trading evidence exist; raw exchange records remain operator-owned external proof |
82-
| Self-evolution loop | 98 | public memory, research command chain, genesis proposal classification, production-parity OODA reports, local apply, rollback, paper-first evolve gates, and agent architecture bounds exist; property-based failure-mode coverage remains |
83-
| Safety and risk | 96 | autonomous-loop failure taxonomy exists; real exchange chaos drills, property-based safety-gate tests, and external review remain |
82+
| Self-evolution loop | 98 | public memory, research command chain, genesis proposal classification, production-parity OODA reports, local apply, rollback, paper-first evolve gates, agent architecture bounds, and property-based safety coverage exist; protected live-code evolution remains human-reviewed |
83+
| Safety and risk | 98 | autonomous-loop failure taxonomy and bounded property-based safety-gate tests exist; real exchange chaos drills and external review remain |
8484
| API contracts | 100 | public runtime contracts are complete; hosted compatibility is commercial launch work |
8585
| Deployment | 96 | live Railway proof, external production log-drain evidence |
8686
| Observability and audit | 97 | checksum-chained runtime bus, hash-chained signable decision journal, local timestamp binding, external anchor packet, verifier, and signed evidence bundles exist; periodic external receipt operation, metrics backend, and log drains remain |

docs/failure-modes-autonomous-loop.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -41,13 +41,13 @@ Unknown answers mean fail closed.
4141

4242
| ID | Failure mode | Detection | Blast radius | Rollback | Journal entry | Alerting | Test or evidence |
4343
|---|---|---|---|---|---|---|---|
44-
| FM-AUTO-001 | Agent hallucinates a strategy and burns through paper budget. | Strategy registry rejects unknown runners; paper budget breaker sees order count, notional, or drawdown drift. | Paper budget for the local session. Live capital should be zero because paper-first is enforced. | Pause evolve, disable the strategy, revert the candidate config, reset paper budget after review. | `zero.evolve.run.v1`, `zero.immune.v1`, rejected `zero.paper.decision.v1`, rollback receipt. | CLI/TUI safety banner, `/immune`, metrics counter, optional operator notification. | `engine/tests/test_evolve.py`, `engine/tests/test_safety.py`; property budget fuzzing remains required before 100/100. |
45-
| FM-AUTO-002 | `evolve` produces a config that passes tests but fails at runtime. | Runtime health marks candidate as failed; production-parity OODA emits live-shadow mismatch or runtime exception. | Candidate branch and paper canary only. Protected paths must not auto-apply to live code. | `zero.evolve.rollback_receipt.v1`, restore original hash, mark proposal quarantined. | Apply receipt, rollback receipt, runtime cycle failure event. | `/runtime-parity`, `/evolve`, CI failure, operator terminal warning. | `engine/tests/test_evolve.py`, `engine/tests/test_runtime.py`; needs property tests for malformed candidate shapes. |
44+
| FM-AUTO-001 | Agent hallucinates a strategy and burns through paper budget. | Strategy registry rejects unknown runners; paper budget breaker sees order count, notional, or drawdown drift. | Paper budget for the local session. Live capital should be zero because paper-first is enforced. | Pause evolve, disable the strategy, revert the candidate config, reset paper budget after review. | `zero.evolve.run.v1`, `zero.immune.v1`, rejected `zero.paper.decision.v1`, rollback receipt. | CLI/TUI safety banner, `/immune`, metrics counter, optional operator notification. | `engine/tests/test_evolve.py`, `engine/tests/test_safety.py`, `engine/tests/test_property_safety.py`. |
45+
| FM-AUTO-002 | `evolve` produces a config that passes tests but fails at runtime. | Runtime health marks candidate as failed; production-parity OODA emits live-shadow mismatch or runtime exception. | Candidate branch and paper canary only. Protected paths must not auto-apply to live code. | `zero.evolve.rollback_receipt.v1`, restore original hash, mark proposal quarantined. | Apply receipt, rollback receipt, runtime cycle failure event. | `/runtime-parity`, `/evolve`, CI failure, operator terminal warning. | `engine/tests/test_evolve.py`, `engine/tests/test_runtime.py`, `engine/tests/test_property_safety.py`. |
4646
| FM-AUTO-003 | Agent and human edit the journal concurrently. | Journal append detects non-monotonic offset, checksum break, lock failure, or replay mismatch. | Local audit trail for the affected runtime; execution must pause if journal integrity is unknown. | Stop writers, preserve both copies, replay from last good head, restore durable volume snapshot if needed. | `zero.journal.integrity_failure.v1` or incident audit export. | P1 journal anomaly alert, CLI refusal on live preflight, runbook escalation. | `engine/tests/test_bus.py`, `docs/runtime-bus.md`; decision-journal hash chain is the next implementation cycle. |
47-
| FM-AUTO-004 | Hyperliquid returns malformed response and the agent retries N times. | Adapter schema validation fails; retry budget reaches zero; rate-limit breaker opens. | Read-only market/account freshness or one blocked live submission. Order submissions must not retry blindly. | Mark venue degraded, fail risk-increasing actions, keep reduce-only controls available. | `exchange_error`, reconciliation packet, immune breaker event. | `/hl/reconcile`, `/immune`, `/live-cockpit`, metrics exchange-error counter. | `engine/tests/test_hyperliquid.py`, `engine/tests/test_live.py`, `engine/tests/test_reconciliation.py`; malformed-response fuzzing remains required before 100/100. |
48-
| FM-AUTO-005 | Stale memory promotes an outdated pattern. | Memory stats report stale source window; genesis confidence drops; proposal age exceeds policy. | Proposal quality and paper canary time, not live execution. | Retire stale memory, regenerate proposal from fresh outcomes, require a new paper canary. | `zero.memory.entry.v1`, `zero.genesis.proposal.v1`, research report. | `/memory`, `/genesis`, docs gap or safety-review issue. | `engine/tests/test_memory.py`, `engine/tests/test_genesis.py`; needs property tests for staleness thresholds. |
47+
| FM-AUTO-004 | Hyperliquid returns malformed response and the agent retries N times. | Adapter schema validation fails; retry budget reaches zero; rate-limit breaker opens. | Read-only market/account freshness or one blocked live submission. Order submissions must not retry blindly. | Mark venue degraded, fail risk-increasing actions, keep reduce-only controls available. | `exchange_error`, reconciliation packet, immune breaker event. | `/hl/reconcile`, `/immune`, `/live-cockpit`, metrics exchange-error counter. | `engine/tests/test_hyperliquid.py`, `engine/tests/test_live.py`, `engine/tests/test_reconciliation.py`, `engine/tests/test_property_safety.py`. |
48+
| FM-AUTO-005 | Stale memory promotes an outdated pattern. | Memory stats report stale source window; genesis confidence drops; proposal age exceeds policy. | Proposal quality and paper canary time, not live execution. | Retire stale memory, regenerate proposal from fresh outcomes, require a new paper canary. | `zero.memory.entry.v1`, `zero.genesis.proposal.v1`, research report. | `/memory`, `/genesis`, docs gap or safety-review issue. | `engine/tests/test_memory.py`, `engine/tests/test_genesis.py`, `engine/tests/test_property_safety.py`. |
4949
| FM-AUTO-006 | Research command ingests prompt-injected or unsupported external claims. | Source classifier marks untrusted or unsupported evidence; research report carries evidence quality flags. | Paper-only research report and proposal queue. | Discard report, quarantine source, regenerate with trusted sources only. | `zero.research.report.v1` with rejected source metadata. | `/research`, safety-review issue when live policy would be affected. | `engine/tests/test_research.py`; adversarial source fixtures should be expanded. |
50-
| FM-AUTO-007 | Model gateway produces unsafe, expensive, or unavailable output. | Gateway budget, timeout, health, and audit checks fail closed. | Evaluation quality degradation; order path must not depend on unverified model output alone. | Fall back to local/mock provider, lower confidence, or reject decision. | Model gateway audit packet and decision rejection reason. | `/model-gateway/health`, metrics, operator warning. | `engine/tests/test_model_gateway.py`; cost-limit property tests remain required. |
50+
| FM-AUTO-007 | Model gateway produces unsafe, expensive, or unavailable output. | Gateway budget, timeout, health, and audit checks fail closed. | Evaluation quality degradation; order path must not depend on unverified model output alone. | Fall back to local/mock provider, lower confidence, or reject decision. | Model gateway audit packet and decision rejection reason. | `/model-gateway/health`, metrics, operator warning. | `engine/tests/test_model_gateway.py`, `engine/tests/test_property_safety.py`. |
5151
| FM-AUTO-008 | Paper/live shadow diverges during production-parity OODA. | `zero.runtime.production_parity.v1` reports mismatch or live-shadow fail-closed evidence. | Live promotion blocked; paper session continues. | Disable promotion, capture audit export, create regression fixture. | Runtime parity report, decision-stack packet, live-shadow refusal. | `/runtime-parity`, CLI red status, safety-review issue. | `engine/tests/test_runtime.py`, `engine/tests/test_live.py`. |
5252
| FM-AUTO-009 | MCP client asks ZERO to place an order or mutate state. | MCP safety catalog has no risk-increasing tools; unknown methods are rejected. | None if server remains read-only. | Keep server read-only, revoke unsafe registry submission, patch transcript. | MCP transcript refusal and safety catalog resource. | MCP smoke failure, CI failure. | `engine/tests/test_mcp.py`, `scripts/mcp_transcript.py --check`. |
5353
| FM-AUTO-010 | Public Network or Intelligence packet leaks private identifiers. | Privacy regression fixtures detect wallet-like, raw order ID, trace token, or private journal fields. | Public artifact exposure until publication is stopped. | Stop publishing, rotate unsafe packet, patch serializer, mark proof stale. | Public packet hash, privacy regression incident export. | P1 privacy regression alert, CI failure. | `engine/tests/test_proof_privacy.py`, `scripts/proof_privacy_regression.py`. |

docs/launch-scorecard.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,14 +10,14 @@ reserved for ZERO Intelligence.
1010

1111
## Current Autonomous Trust Score
1212

13-
**97/100**
13+
**98/100**
1414

1515
The launch repository is ready for serious contributors. The stricter
1616
autonomous-systems trust bar is not complete until journal-head anchor packets
17-
are attached to trusted external receipts on a periodic operator cadence; the
18-
MCP server has a live registry listing after package or remote publication; and
19-
every safety gate has deterministic or property-based coverage for the
20-
documented failure modes.
17+
are attached to trusted external receipts on a periodic operator cadence and
18+
the MCP server has a live registry listing after package or remote publication.
19+
Bounded property-based coverage now exists for the core safety gates and
20+
documented malformed-input paths.
2121

2222
## Ready
2323

@@ -62,6 +62,9 @@ documented failure modes.
6262
incident-postmortem publication policy
6363
- Hash-chained decision journals, operator-owned signing hooks, local timestamp
6464
bindings, external anchor packets, verifier CLI, and tamper tests
65+
- Property-based safety-gate tests for risk budgets, malformed Hyperliquid
66+
responses, memory staleness, dry-run order validation, and model-gateway
67+
retry/privacy behavior
6568
- Dependency and supply-chain policy with vulnerability response rules
6669
- One-line CLI install path with checksum and attestation verification
6770
- Registry-readiness gate for PyPI/Cargo metadata and package-channel guardrails

0 commit comments

Comments
 (0)