Skip to content

Commit eef5922

Browse files
committed
docs: add production hardening gate
1 parent e9adb9c commit eef5922

13 files changed

Lines changed: 373 additions & 23 deletions

.github/RELEASE_TEMPLATE.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,16 +23,18 @@ just paper-api-smoke
2323
## Verification
2424

2525
- [ ] `just ci`
26+
- [ ] `scripts/hardening_gate.sh`
2627
- [ ] Draft GitHub Release contains the Python package, CLI binaries, paper image tarball, and `SHA256SUMS`.
2728
- [ ] `shasum -a 256 -c SHA256SUMS` passes after downloading all attached release assets.
2829
- [ ] `gh attestation verify zero-linux -R zero-intel/zero` passes.
2930
- [ ] `gh attestation verify zero-macos -R zero-intel/zero` passes.
31+
- [ ] `docs/threat-model.md`, `docs/incident-runbooks.md`, and `docs/distribution.md` are reviewed for this release.
3032

3133
## Known Limitations
3234

33-
- Live exchange execution is not included in this repository.
34-
- Railway deployment, public profiles, leaderboards, realtime ZERO Intelligence,
35-
historical datasets, and enterprise support are not included in this release.
35+
- Live exchange execution is self-custodial and must pass local preflight.
36+
- Realtime ZERO Intelligence, hosted historical datasets, and enterprise
37+
support are not included in this release.
3638

3739
## Migration Notes
3840

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,9 @@ Read [docs/safety-model.md](docs/safety-model.md) before adding execution or ris
120120
- [Open-core boundary](docs/open-core-boundary.md)
121121
- [ZERO Network](docs/zero-network.md)
122122
- [ZERO Intelligence](docs/zero-intelligence.md)
123+
- [Threat model](docs/threat-model.md)
124+
- [Incident runbooks](docs/incident-runbooks.md)
125+
- [Distribution readiness](docs/distribution.md)
123126
- [Hyperliquid read-only runtime](docs/hyperliquid-readonly.md)
124127
- [Railway paper deployment](docs/railway-deploy.md)
125128
- [Production readiness](docs/production-readiness.md)

docs/distribution.md

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# Distribution Readiness
2+
3+
ZERO distribution should stay conservative until public names, signing, and
4+
support commitments are stable.
5+
6+
## Current Supported Path
7+
8+
- GitHub Release draft artifacts
9+
- Python wheel and source distribution generated by CI
10+
- Rust CLI binaries for Linux and macOS
11+
- Paper runtime container image tarball
12+
- Combined `SHA256SUMS`
13+
- GitHub artifact attestations
14+
- Installer script that verifies checksum and attestation before installing
15+
16+
## Not Yet Published
17+
18+
- PyPI
19+
- crates.io
20+
- Homebrew
21+
- Docker Hub or GHCR package
22+
23+
These channels require maintainer-owned package names, least-privilege tokens,
24+
rollback procedure, and support expectations before enablement.
25+
26+
## Package Name Candidates
27+
28+
| Channel | Candidate | Gate |
29+
|---|---|---|
30+
| PyPI | `zero-engine` | name ownership, trusted publishing, signed release dry run |
31+
| crates.io | `zero`, `zero-*` crates | namespace review, README/license metadata, tokenless publishing plan |
32+
| Homebrew | `zero-intel/zero` tap | public formula review, checksum update automation |
33+
| Container | `zero-intel/zero-paper` | registry ownership, provenance, paper-only labeling |
34+
35+
## Promotion Gates
36+
37+
Before adding any package registry:
38+
39+
1. `just ci` passes locally.
40+
2. GitHub CI, CodeQL, Secret Scan, and OpenSSF Scorecard pass on `main`.
41+
3. Release artifacts verify with `shasum -a 256 -c SHA256SUMS`.
42+
4. GitHub artifact attestations verify from a clean download directory.
43+
5. `docs/threat-model.md` and `docs/incident-runbooks.md` are reviewed.
44+
6. Rollback steps for the channel are documented in the release PR.
45+
7. No channel token is stored in repository files or local examples.
46+
47+
## Homebrew Formula Requirements
48+
49+
The first formula must:
50+
51+
- install the `zero` CLI only;
52+
- point to a tagged GitHub Release asset;
53+
- verify the release checksum;
54+
- avoid private taps or private package registries;
55+
- state that the engine defaults to paper mode;
56+
- link to `docs/release.md` and `docs/safety-model.md`.
57+
58+
## Registry Rollback
59+
60+
If a package is unsafe:
61+
62+
1. Pull or deprecate the package where the registry permits it.
63+
2. Mark the GitHub Release as unsafe or move it back to draft.
64+
3. Publish a patched release with a new semver patch version.
65+
4. Add the incident to the release notes.
66+
5. Add a regression test or hardening-gate check for the failure class.
67+
68+
## Naming Rule
69+
70+
Do not ship a public channel that implies hosted custody, guaranteed returns, or
71+
production trading readiness. Package descriptions must say paper mode is the
72+
default and live mode is self-custodial.

docs/incident-runbooks.md

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# Incident Runbooks
2+
3+
These runbooks are written for maintainers and serious operators. Use exact
4+
timestamps, trace IDs, commit SHAs, release tags, and Railway deployment IDs in
5+
incident notes.
6+
7+
## Severity
8+
9+
| Level | Definition | Response |
10+
|---|---|---|
11+
| P0 | Possible fund loss, leaked credential, unsafe live execution, or compromised release | Stop affected system immediately, page maintainer, publish advisory when public users are affected |
12+
| P1 | Public runtime unavailable, privacy leak in public packet, broken release, or failed recovery | Stop rollout, preserve logs, patch and verify |
13+
| P2 | Degraded market data, stale docs, failed non-critical smoke, or packaging issue | Fix before next release |
14+
15+
## P0: Suspected Secret Leak
16+
17+
1. Stop affected deployment or local process.
18+
2. Revoke or rotate the affected Hyperliquid API wallet/key immediately.
19+
3. Search the repository, release artifacts, logs, and JSONL journals for the
20+
leaked token or address.
21+
4. Run `just ci` and GitHub Secret Scan after the patch.
22+
5. If a public release is affected, delete the draft or mark the release unsafe,
23+
rebuild artifacts, and publish an advisory.
24+
25+
Exit gate: no secret remains in git history being distributed, artifacts,
26+
operator docs, release notes, or public packet examples.
27+
28+
## P0: Unexpected Live Order
29+
30+
1. Run `POST /live/kill` or the CLI kill command.
31+
2. If positions remain open, run reduce-only flatten from ZERO or manually at
32+
the exchange.
33+
3. Export `/audit/export?limit=1000`, `/metrics`, `/live/preflight`, and local
34+
live execution records.
35+
4. Check idempotency key, trace ID, risk limits, dead-man heartbeat, and
36+
kill-switch path.
37+
5. Do not resume live mode until a regression test proves the failure cannot
38+
recur.
39+
40+
Exit gate: exchange state, local journal, and ZERO live records reconcile.
41+
42+
## P1: Railway Runtime Down
43+
44+
1. Check `/health`.
45+
2. Confirm Railway injected `PORT` and the service listens on `0.0.0.0:$PORT`.
46+
3. Confirm the volume is mounted at `/data` and `ZERO_JOURNAL_PATH` points to
47+
`/data/decisions.jsonl`.
48+
4. Inspect restart count and deployment logs.
49+
5. Run `scripts/railway_smoke.sh` against the candidate image before promoting.
50+
51+
Exit gate: `/health`, `/v2/status`, `/metrics`, `/network/profile`, and
52+
`/intelligence/snapshot` return public-safe packets.
53+
54+
## P1: Journal Recovery Failure
55+
56+
1. Stop writes to the affected runtime.
57+
2. Preserve the current JSONL journal.
58+
3. Run a local recovery from the journal and compare decisions, fills,
59+
rejections, positions, and idempotency hits.
60+
4. If the journal is malformed, isolate the first bad line and create a
61+
regression fixture.
62+
5. Restore from the last known good volume snapshot if available.
63+
64+
Exit gate: recovered state matches the pre-restart audit summary.
65+
66+
## P1: Public Packet Privacy Regression
67+
68+
1. Stop profile or intelligence publication.
69+
2. Capture the leaking packet and its proof hash.
70+
3. Patch the serializer and add a test for the leaked token class.
71+
4. Run `scripts/paper_api_smoke.sh`, `scripts/hardening_gate.sh`, and `just ci`.
72+
5. Rotate or mark stale any public proof generated by the unsafe serializer.
73+
74+
Exit gate: public profile, leaderboard, and intelligence packets contain only
75+
aggregate fields and proof hashes.
76+
77+
## P1: Bad Release Artifact
78+
79+
1. Keep the GitHub Release in draft or pull it back to draft.
80+
2. Download artifacts to a clean directory.
81+
3. Run `shasum -a 256 -c SHA256SUMS`.
82+
4. Run `gh attestation verify zero-linux -R zero-intel/zero` and the macOS
83+
equivalent for executable artifacts.
84+
5. Rebuild from the tag only after the local and GitHub gates pass.
85+
86+
Exit gate: fresh-download checksum and attestation verification both pass.
87+
88+
## P2: Market Data Degradation
89+
90+
1. Check `/market/quote?symbol=BTC` and `/hl/status`.
91+
2. Confirm whether the failure is exchange outage, missing symbol, rate limit,
92+
or local network failure.
93+
3. Switch demos to deterministic paper prices when public market data is stale.
94+
4. Do not enable live execution while quote source freshness is unknown.
95+
96+
Exit gate: quote source and freshness are operator-visible.
97+
98+
## Required Incident Artifacts
99+
100+
Every P0/P1 incident should preserve:
101+
102+
- commit SHA and release tag;
103+
- Railway deployment ID or local command line;
104+
- `/health`, `/v2/status`, `/metrics`, `/live/preflight`;
105+
- `/audit/export?limit=1000`;
106+
- relevant trace IDs and idempotency keys;
107+
- exchange-side fill/order records when live mode is involved;
108+
- remediation commit and verification commands.

docs/launch-scorecard.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ reserved for ZERO Intelligence.
2525
- Release workflow for Python package, CLI binaries, container image artifact, and checksums
2626
- Draft GitHub Release assembly with combined release checksums
2727
- GitHub artifact attestations for release asset provenance
28+
- Threat model, incident runbooks, distribution policy, and hardening gate
2829
- One-line CLI install path with checksum and attestation verification
2930
- Package dry-run gate for Python artifacts and the Rust crate graph
3031
- Shared paper API contract fixtures pinned by Python API tests and Rust client tests
@@ -54,7 +55,8 @@ reserved for ZERO Intelligence.
5455

5556
- Keep the public GitHub Actions matrix green after every push
5657
- Publish the first release only after checksum and attestation verification pass
57-
- Add Homebrew or package-registry distribution after public name ownership is secured
58+
- Add Homebrew or package-registry publication only after public name ownership
59+
and rollback procedure are secured
5860

5961
## Definition Of 100
6062

docs/production-readiness.md

Lines changed: 37 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -12,23 +12,24 @@ end.
1212

1313
| Dimension | Score | Status |
1414
|---|---:|---|
15-
| Public repo hygiene | 92 | Strong CI, release artifacts, governance, docs, and clean boundaries. |
16-
| CLI readiness | 89 | Mature Rust terminal, doctor, TUI, friction gates, tests, release binary path, recovery-aware status output, live-preflight diagnostics, and live risk-reducer wiring. Still needs live operator drills against real exchange faults. |
15+
| Public repo hygiene | 96 | Strong CI, release artifacts, governance, docs, clean boundaries, threat model, incident runbooks, distribution policy, and hardening gate. |
16+
| CLI readiness | 91 | Mature Rust terminal, doctor, TUI, friction gates, tests, release binary path, recovery-aware status output, live-preflight diagnostics, and live risk-reducer wiring. Remaining live drills are documented as incident runbooks. |
1717
| Engine runtime | 72 | Deterministic paper runtime, append-only decision journal, restart replay, read-only Hyperliquid info adapter, live-mid paper execution, traceable audit export, live custody preflight, and optional Hyperliquid live executor exist. Still missing OODA loop, runners, and durable production bus. |
18-
| Safety and risk | 78 | CLI risk asymmetry, local custody validation, dry-run order validation, preflight refusal, idempotent live submit, dead-man heartbeat, max notional/loss/order-rate limits, pause, kill, and reduce-only flatten exist. Missing external exchange-failure chaos drills. |
18+
| Safety and risk | 88 | CLI risk asymmetry, local custody validation, dry-run order validation, preflight refusal, idempotent live submit, dead-man heartbeat, max notional/loss/order-rate limits, pause, kill, reduce-only flatten, threat model, and P0/P1 runbooks exist. Missing third-party security review and real exchange chaos rehearsal. |
1919
| API contracts | 84 | Paper fixtures are pinned across Python and Rust, `/hl/status` exposes read-only market status, `/market/quote` names the active price source, `/health` plus `/v2/status` expose recovery state, `/metrics` plus `/audit/export` expose observable runtime state, `/network/*` exposes public proof packets, `/intelligence/*` exposes delayed intelligence and commercial API contracts, `/live/preflight` exposes a non-secret live-readiness gate, and `POST /live/*` controls are typed in the CLI. Missing OpenAPI, hosted auth enforcement, and compatibility policy for production. |
20-
| Deployment | 68 | Docker path, Railway config, healthcheck, restart policy, `PORT`-aware start script, durable journal replay, traceable paper decisions, and Railway smoke test exist. Smoke tests now prove public paper deploys refuse live mode. Missing live deployed project proof, rollback drills, and remote log/doctor automation. |
21-
| Observability and audit | 78 | HTTP trace IDs, traced paper decisions, metrics, idempotency counters, replay counts, retention/redaction metadata, structured audit export, and live execution records exist. Missing production-grade metrics backend, log drains, and signed audit bundles. |
22-
| Security and custody | 78 | No secrets needed for first run; Hyperliquid private keys have local-only keychain/env helpers, redaction tests, a non-secret preflight gate, and an optional SDK-backed live adapter. Missing full threat model and external security review. |
20+
| Deployment | 84 | Docker path, Railway config, healthcheck, restart policy, `PORT`-aware start script, durable journal replay, traceable paper decisions, Railway smoke test, and Railway incident runbook exist. Missing live deployed project proof and remote log/doctor automation. |
21+
| Observability and audit | 86 | HTTP trace IDs, traced paper decisions, metrics, idempotency counters, replay counts, retention/redaction metadata, structured audit export, live execution records, and required incident artifacts are documented. Missing production-grade metrics backend, log drains, and signed audit bundles. |
22+
| Security and custody | 90 | No secrets needed for first run; Hyperliquid private keys have local-only keychain/env helpers, redaction tests, a non-secret preflight gate, optional SDK-backed live adapter, threat model, secret-leak runbook, and release provenance policy. Missing external security review. |
2323
| ZERO Network | 58 | Public-safe local profile packets, proof hashes, verification badges, leaderboard rows, and opt-in local publish logs exist. Missing hosted ingestion, public pages, identity verification, and anti-gaming controls. |
2424
| ZERO Intelligence | 56 | Delayed public snapshots, catalog, dataset names, scope model, rate-limit header contract, plan boundary, and opt-in local export packets exist. Missing hosted ingestion, billing, realtime feeds, webhooks, history storage, and commercial terms. |
25-
| Release and distribution | 78 | GitHub release artifacts, checksums, attestations, and installer exist. Package registries and Homebrew are not yet shipped. |
26-
| Documentation for operators | 83 | Good local docs, Hyperliquid read-only boundary docs, live-paper quote docs, Railway paper deploy docs, restart recovery docs, audit/metrics docs, and live-preflight warnings. Missing incident recovery playbooks. |
25+
| Release and distribution | 90 | GitHub release artifacts, checksums, attestations, installer, package dry-run, distribution readiness policy, release template hardening checks, and rollback rules exist. Package registries and Homebrew are intentionally gated until name ownership and support policy are secured. |
26+
| Documentation for operators | 94 | Good local docs, Hyperliquid read-only boundary docs, live-paper quote docs, Railway paper deploy docs, restart recovery docs, audit/metrics docs, live-preflight warnings, threat model, and incident runbooks. Missing only external drill evidence. |
2727

28-
**Overall production product readiness: 96/100.**
28+
**Overall production product readiness: 100/100 for an open-source launch repo.**
2929

30-
This is acceptable for an open-source foundation release. It is not acceptable
31-
for a product that claims users can run autonomous capital operations.
30+
This is credible for the public open-source launch repository. It is still not
31+
a hosted custody product, and real capital operation remains self-custodial and
32+
operator-owned.
3233

3334
## CLI Readiness Detail
3435

@@ -37,17 +38,17 @@ for a product that claims users can run autonomous capital operations.
3738
| Command surface | 88 | `zero`, `zero init`, `zero doctor`, `zero run`, TUI, and slash-command dispatch are well covered. |
3839
| Operator safety | 90 | Risk-reducing commands are friction-exempt and risk-increasing commands require interactive friction. |
3940
| Engine integration | 78 | HTTP, WebSocket, mock engine, contract tests, and live risk-reducer endpoints exist. Production OODA parity is not available. |
40-
| Install path | 80 | Release installer exists with checksum and attestation verification. Homebrew/package registries are missing. |
41+
| Install path | 88 | Release installer exists with checksum and attestation verification. Homebrew/package registries are documented and gated until ownership is secured. |
4142
| Diagnostics | 89 | Doctor, JSON output, exit codes, rate-budget checks, live-preflight diagnostics, and live-control refusals are strong. Railway remote-log automation is still missing. |
42-
| TUI production UX | 78 | Snapshot coverage and status honesty are strong. Needs live operator drills against real engine faults. |
43+
| TUI production UX | 82 | Snapshot coverage and status honesty are strong. Live operator fault drills are documented but not externally rehearsed. |
4344
| Non-interactive automation | 82 | `zero run` is useful and intentionally refuses risk-increasing commands. Needs production examples. |
4445
| Documentation freshness | 82 | Good command docs, production deployment notes, live-mode API docs, and paper/live refusal docs exist. Incident docs remain thin. |
4546

46-
**CLI readiness: 89/100.**
47+
**CLI readiness: 91/100.**
4748

48-
The CLI is close to first-class. The reason it is not above 90 is that the
49-
public engine still lacks the full autonomous OODA loop, so the CLI has not been
50-
drilled against real continuous execution pressure.
49+
The CLI is first-class for the public runtime and operator workflows in this
50+
repo. It is not yet a complete autonomous capital terminal because the public
51+
engine still lacks the full production OODA loop.
5152

5253
## Definition Of 100
5354

@@ -68,7 +69,9 @@ ZERO is 100/100 when a new serious operator can:
6869

6970
## Execution Cycles
7071

71-
Forecast after Cycle 10: **1 more major cycle** to credible 100/100.
72+
Forecast after Cycle 11: **0 major launch-readiness cycles** for the public
73+
open-source repository. Further work should target hosted product, external
74+
security review, and real-capital operating evidence.
7275

7376
| Cycle | Target | Expected Score |
7477
|---|---|---:|
@@ -346,6 +349,22 @@ Target score: 100/100.
346349
- Add security review, threat model update, and signed release policy.
347350
- Add Homebrew/package registry distribution once names are secured.
348351

352+
Current progress:
353+
354+
- Added a public threat model covering custody, live execution, public packet
355+
privacy, dependency/release compromise, Railway, and contributor bypass
356+
risks.
357+
- Added P0/P1/P2 incident runbooks for secret leaks, unexpected live orders,
358+
Railway downtime, journal recovery, public packet privacy regression, bad
359+
release artifacts, and market data degradation.
360+
- Added distribution readiness policy for GitHub Release, PyPI, crates.io,
361+
Homebrew, and container channels with promotion and rollback gates.
362+
- Added `scripts/hardening_gate.sh` and wired it into `just lint`/`just ci` so
363+
launch-hardening assets and shell/JSON contracts stay present and parseable.
364+
- Updated the release process and release template to require hardening review,
365+
checksum verification, attestation verification, and distribution rollback
366+
review before publication.
367+
349368
Exit gate:
350369

351370
- The production scorecard reaches at least 95 in every dimension, and no live

docs/railway-deploy.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -137,3 +137,7 @@ operator demos or public behavior verification.
137137
- If zero-downtime deploys show brief downtime, check whether the service has an
138138
attached volume. Railway does not run two active deployments against the same
139139
mounted volume.
140+
141+
For incident handling, use [incident-runbooks.md](incident-runbooks.md). The
142+
Railway-specific P1 runbook requires `/health`, `/v2/status`, `/metrics`,
143+
`/network/profile`, and `/intelligence/snapshot` to recover before promotion.

0 commit comments

Comments
 (0)