refactor(coordinator): modularize monoliths + de-duplicate helpers#261
refactor(coordinator): modularize monoliths + de-duplicate helpers#261Gajesh2007 wants to merge 14 commits into
Conversation
Behavior-preserving foundation for the coordinator modularization. Creates
canonical homes that later phases collapse onto, and removes verified
copy-paste duplication.
- api/httpresp.go: move writeJSON + errorResponse/withParam/withCode out of
consumer.go; add decodeJSONBody (collapses ~22 clean decode-or-400 sites).
- api/cache.go: add writeCachedJSONResult; collapse 9 cached-GET marshal/cache/
write tails (per-site 500 message preserved).
- api/auth_helpers.go: home for resolveAccountID/isAdmin/requirePrivyUser/
isAdminAuthorized/adminKeyMatches. Delete the byte-identical requireAdminKey
twin; route 9 inline Privy gates + inline admin-key compares through the
shared helpers; requirePublishingAPIKey uses extractBearerToken.
- payments/usd.go: canonical USDToMicro (rounds), MicroToUSD, FormatUSD;
CalculateCost delegates to calculateCost; consumer/stripe money helpers
routed through payments.
Two intentional, test-pinned deltas (all else byte-identical):
1. ~4 decode sites that returned bare "invalid JSON" now return
"invalid JSON: <err>" (uniform 400).
2. 5 billing sites now round USD->micro instead of truncating
(invite create, stripe withdraw, checkout/admin-credit/admin-reward).
Build, vet, gofmt, and go test ./coordinator/api/... ./coordinator/payments/...
all green. Reviewed by codex-rescue + independent agent.
…-domain fragments
Declarations-only. The 1049-line interface.go is split into per-domain files,
each holding its record/DTO types plus an interface fragment; interface.go now
composes them by embedding:
type Store interface { APIKeyStore; UsageStore; LedgerStore; ... }
New files: apikeys, usage, ledger, referrals, billing_sessions, pricing,
model_registry, releases, users, stripe_withdrawals, device_codes, invites,
provider_earnings, provider_tokens, providers, reputation, log_reports,
telemetry. ProviderLocation now has a single home (providers.go);
ErrInsufficientBalance stays a package var (ledger.go).
Zero method-set change: proven by `var _ Store` in both impls plus a
whole-module compile of every caller. All 313 JSON struct tags verified
byte-identical. Build, vet, gofmt, go test ./coordinator/store/... green.
Collapse three byte-identical SHA-256 hashing copies (memory.sha256Hex, postgres.hashKey, and the inline body of LegacyAccountID) into one exported store.HashKey in apikey.go. Route every API-key and provider-token hashing site through it so a key hashed by one backend always resolves under the other. Build + go test ./coordinator/store/... green.
…into per-domain files + dedup Both monolith impls are split along the same per-domain boundaries as the Phase 1 interface fragments, mirroring each other file-for-file: postgres.go 3561 -> 663 lines (+ 17 postgres_*.go files) memory.go 2641 -> 211 lines (+ 17 memory_*.go files) Each split is a strict PURE MOVE (verified deletion-only on the source files; no method body, SQL string, or comment changed). manifestFromRecord moved to a shared model_registry_common.go (used by both backends). Dedup collapses (logic-touching, verified): - providerColumns const + scanProviderRow(rowScanner): the 26-column provider SELECT/Scan, previously copy-pasted across 4 read methods, now lives once. - NormalizeLeaderboardLimit: the <=0||>200 -> 50 clamp shared by both backends. Verified: gofmt clean, go build/vet ./coordinator/..., and go test ./coordinator/store/... -race against a throwaway Postgres 16 (DATABASE_URL set) — all green. (Pre-existing flaky integration test TestPostgresRecordUsage, which asserts insertion order against an ORDER BY created_at DESC query, is unrelated and excluded; it fails identically on master.)
…uler.go into per-domain files
Pure-move split (verified deletion-only on the sources; no body/comment/signature
changed; mutex discipline preserved by construction):
registry.go 2809 -> 93 lines (+ pending_request, provider, concurrency,
privacy_policy, trust, catalog, clamp, persistence, lifecycle, trust_state,
model_load, counters, snapshots, scoring)
scheduler.go 947 -> 3 lines (+ routing_constants, routing_types, reserve,
routing_log, snapshot, admission, cost, tps, quick_capacity, drain)
scoring.go isolates the legacy FindProvider/FindProviderWithTrust/ScoreProvider
trio (zero production callers) ahead of their removal in 3.4.
Build, vet, gofmt, go test ./coordinator/registry/... -race all green.
…rio, migrate tests Remove FindProvider, FindProviderWithTrust, and ScoreProvider (the legacy multiplicative-scoring selection path) — they had zero production callers; the live request path uses the cost-based ReserveProviderEx. scoring.go is deleted; DefaultMaxConcurrent / MaxConcurrentRequests (still used) move to concurrency.go. New read-only registry.SelectProvider(model) backed by the live selectBestCandidateLockedFull replaces FindProvider as the test/diagnostic "select a routable provider" oracle — so tests now exercise the production selection logic instead of the deleted formula. Test migration (~100 sites across registry + api): - FindProvider(model) -> SelectProvider(model); the 2 FindProviderWithTrust sites -> SelectProvider (registry MinTrustLevel=Hardware already enforced it). - Tests asserting the old serving-status side-effect now use ReserveProvider (the production select-and-serve path). - TestCrashedOnlyProviderNotRouted now asserts the live behavior (crashed-only slot has infinite cost -> excluded), replacing the legacy "still routes" claim. - Deleted tests of the dead scoring formula (warm-bonus, decode-TPS preference, thermal/memory/GPU/crashed score factors, BenchmarkScoreProvider): the live cost path expresses these via slot-state penalties, which slotless test providers cannot exercise (and near-tie costs are randomized). go build/vet green; go test ./coordinator/registry/... -race -count=5 and the full ./coordinator/api/... suite (throwaway Postgres) pass.
…ligibility gate The structural routing gate (status / trust >= MinTrustLevel / runtime-verified / private-text / challenge freshness [/ serves-catalog]) was copy-pasted across snapshotProviderLocked, providerCanAdmitLocked, QuickCapacityCheck, and ModelCapacitySnapshot. Extract two locked predicates in admission.go: providerEligibleLocked(p, now) — the provider-level gates providerRoutableLocked(p, model, now) — eligible + serves model from catalog and route all four sites through them. QuickCapacityCheck keeps the separate concurrency-headroom check so it still distinguishes structural rejections from capacity rejections; ModelCapacitySnapshot uses the provider-level predicate (it checks per-model catalog membership while enumerating models). Behavior-preserving: identical gate set, single source of truth. go build/vet, go test ./coordinator/registry/... -race -count=3, and the full ./coordinator/api/... suite (throwaway Postgres) all green.
Address Phase 3 adversarial review: - BLOCKER: e2e/integration_test.go used the deleted registry.FindProvider (the e2e package is in the same module but outside the coordinator/ grep scope of the migration) — repoint to SelectProvider. Restores `go vet ./e2e/`. - TestMultiProvider_TrustLevelFiltering: the second case had collapsed to a no-op (default hardware floor never admits self_signed). Rewrite it to lower reg.MinTrustLevel to self_signed and assert the self_signed provider becomes selectable — restoring meaningful trust-floor coverage. - TestThermalCriticalBlocksRouting: replace the stale t.Log-both-outcomes sole-provider case (which asserted nothing and contradicted the live path) with a positive assertion that a sole critical-thermal provider is excluded. go build ./..., go vet ./e2e/, and go test ./coordinator/registry/... -race green.
…ain files Pure-move split (deletion-only on the source; no body/comment/signature changed; locking, Datadog metrics, and handleComplete settle-before-close ordering preserved by relocation): provider.go 1917 -> 300 lines, keeping handleProviderWS + providerReadLoop (the WS accept + read-loop dispatcher) and the challenge-interval consts. New files: provider_challenge, provider_challenge_verify, provider_register, provider_acme_trust, provider_inference, provider_complete, provider_attestation, provider_mdm, provider_attestation_handler, provider_trust_status. go build/vet + go test ./coordinator/api/... (throwaway Postgres) green.
…n files Pure-move split (deletion-only on the source; no body/comment/signature changed). server.go 2043 -> 231 lines, keeping the Server struct, NewServer (init order intact), and the api-key-cache consts. New files: context_keys, server_setters, ratelimit_middleware, auth_middleware, middleware (Handler/cors/recover/logging/statusWriter/newRequestID), routes (with the //go:embed install.sh directive + installScript adjacency preserved), runtime_manifest, binary_hash_policy, catalog_sync, observability, mdm_webhook. Verified: gofmt clean, go build/vet, go test ./coordinator/api/... (throwaway Postgres), and TestInstallScriptTemplating (install.sh embed) all green.
…to per-concern files
Pure-move splits (deletion-only on each source; no body/comment/signature changed):
model_registry_handlers.go 884 -> 9 (+ catalog/write handlers, auth, validate,
manifest, projection) — all 7 SyncModelCatalog
calls + manifest verify ordering preserved
billing_handlers.go ~830 -> 21 (+ stripe/referral/pricing/admin/earnings
handlers) — webhook idempotency order,
un-auth'd http.Error webhook preserved
stripe_payouts.go ~690 -> 31 (+ onboard/withdraw/webhook/status/helpers)
— withdraw ordering + asymmetric refunds +
wd-tr-/wd-po- idempotency keys preserved
me_handlers.go ~640 -> 110 (+ types/fleet/identity/build_provider/
enrich/counts) — buildMyProvider
unlock-before-PendingCount preserved
gofmt clean, go build/vet, and go test ./coordinator/api/... (throwaway Postgres)
all green; each source file diff is deletions-only.
…-domain files Pure-move split of the largest file (deletion-only on the source; no body/ comment/signature changed, so all dispatch/billing/streaming invariants — ensureMaxTokensBound-before-cost, rate-limit-before-reserve, single provider.Mu() snapshot, post-commit header/cleanup ordering, invalidateAllAPIKeyCache before+after on key mutations — are preserved): consumer.go 4303 -> 32 lines New files: request_estimate, responses_translate, response_normalize, response_build, reservation, models_handler, account_handlers, keys_handler, inference_stream, inference_nonstream, inference_dispatch (the dispatch primitives + the chat/generic/completions/anthropic handlers). gofmt clean; go build ./coordinator/... ./e2e/...; go vet; and go test ./coordinator/api/... (throwaway Postgres) all green; consumer.go diff is deletions-only.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
…istry The refactor split three monoliths into ~150 focused files. In Go a directory is a package, so the per-domain files stay flat in api/, store/, registry/ (the idiomatic layout — methods are pinned to their receiver's package). To make that navigable, add a doc.go to each package whose package doc is a grouped "file map": every non-test .go file listed once, under its concern, with a one-line purpose. No code change; comment-only files. golangci-lint (govet/ineffassign/unused) and gofmt clean.
Phase 3.4 deleted the legacy ScoreProvider path and its BenchmarkScoreProvider, which was the sole caller of makeProvider. The helper was left behind; `go test` ignores unused funcs but golangci-lint's `unused` linter (run in CI) flags it. Remove it. The remaining BenchmarkFindProvider_* benchmarks use populateRegistry + SelectProvider and are unaffected.
hankbobtheresearchoor
left a comment
There was a problem hiding this comment.
Review Summary
Clean, well-executed refactor. All 11 coordinator packages pass with -race, E2E integration green, gofmt/go vet/go build clean.
The de-duplication is real and correct: HashKey → single canonical, scanProviderRow/providerColumns → single definition, CalculateCost → delegates, USD rounding properly test-pinned. Dead code (FindProvider trio) properly excised with test sites migrated to SelectProvider.
The doc.go file maps are excellent — makes the 150-file directory navigable. Trust boundary code (attestation, challenge verification, routing gates) intact.
Findings
3 inline comments: 2 observations, 1 forward-looking note. Nothing blocking.
Verdict
Ship it. This is what AI-assisted refactoring should look like: staged, test-pinned, with intentional de-dup rather than mechanical find-replace.
| // This package was split from a handful of monoliths (consumer.go, server.go, | ||
| // provider.go, billing_handlers.go, …) into focused per-domain files. Go pins a | ||
| // method to its receiver's package, so the *Server methods all live here; files | ||
| // are grouped by concern, not into subpackages. File map: |
There was a problem hiding this comment.
🔵 Forward-looking: The doc.go + file-map pattern used in all three refactored packages (api, store, registry) is excellent for navigation. If other packages in the repo grow beyond 3-5 files, consider adopting the same pattern — a doc.go with a grouped file map. Especially useful for e2e/testbed/ which will accumulate more harness code over time.
|
🔵 Pre-existing, outside diff scope: |
| @@ -506,8 +499,7 @@ func (s *Server) handleAdminDeleteRelease(w http.ResponseWriter, r *http.Request | |||
| Version string `json:"version"` | |||
There was a problem hiding this comment.
🔵 Observation (referencing nearby unchanged lines): Several release-handler JSON responses at lines 488, 523, 556, 590 use map[string]any for small, fixed-shape payloads. These are finite response shapes — each would benefit from a typed struct for compile-time safety and godoc discoverability. Not blocking. Same pattern in leaderboard.go:92,127 and stripe_payouts_onboard.go:110,126,130. Deferred follow-up at your discretion.
Coordinator Modularization & De-duplication
Behavior-preserving refactor of the Go coordinator (~34.5k non-test LOC). Splits the monolithic
files into cohesive per-domain files and collapses verified copy-paste duplication into single
canonical helpers. Same packages, same exported surface, same wire protocol, same test results.
File structure: before → after
In Go a directory is a package, so the per-domain files stay flat in
api/,store/,registry/(idiomatic — a method must live in its receiver's package, like
net/http's ~60 flat files). Eachpackage gains a
doc.gowhose package comment is a grouped file map for navigation.Before — a few monoliths carrying many unrelated concerns:
graph TD subgraph API["coordinator/api"] a1["consumer.go<br/>4303"] a2["server.go<br/>2051"] a3["provider.go<br/>1917"] a4["model_registry_handlers.go<br/>884"] a5["billing_handlers.go<br/>830"] a6["stripe_payouts.go<br/>690"] a7["me_handlers.go<br/>640"] end subgraph STORE["coordinator/store"] s1["postgres.go<br/>3569"] s2["memory.go<br/>2641"] s3["interface.go<br/>1049"] end subgraph REG["coordinator/registry"] r1["registry.go<br/>2809"] r2["scheduler.go<br/>947"] endAfter — thin cores + a
doc.gomap + cohesive per-concern files:graph TD subgraph API["coordinator/api · 82 files + doc.go map"] adoc["doc.go<br/>file map"] g1["bootstrap<br/>server · routes · config · setters"] g7["http<br/>httpresp · cache · context · auth · middleware"] g2["inference path<br/>dispatch · stream · nonstream · build · normalize · translate · estimate · reservation"] g3["provider WS + attestation<br/>register · challenge · attestation · complete · acme · mdm — 14"] g4["me dashboard — 7"] g5["billing + stripe payouts — 14"] g6["model registry — 7"] g8["stats · releases · openrouter · telemetry"] end subgraph STORE["coordinator/store · 60 files + doc.go map"] sdoc["doc.go<br/>file map"] si["interface fragments + record types — 22"] sm["in-memory impl — 17"] sp["postgres impl — 19"] end subgraph REG["coordinator/registry · 30 files + doc.go map"] rdoc["doc.go<br/>file map"] rcore["core<br/>registry · provider · pending_request"] rroute["routing<br/>reserve · admission · cost · snapshot · quick_capacity"] rlife["lifecycle · queue · drain"] rtrust["trust · reputation · privacy"] rcap["capacity<br/>concurrency · counters · tps"] rmodel["catalog · model_load · snapshots · persistence"] endThe eight monoliths, collapsed
api/consumer.gostore/postgres.goregistry/registry.gostore/memory.goapi/server.goapi/provider.gostore/interface.goregistry/scheduler.go~20k lines of monolith → ~150 focused, single-responsibility files, each listed in its package's
doc.gofile map.Duplication cut (each → one canonical home)
store.HashKey(3 copies → 1) — cross-backend api-key/provider-token hashingscanProviderRow/providerColumns(5 readers → 1)providerEligibleLocked/providerRoutableLocked(5 sites → 1)auth_helpers.go)payments.USDToMicro/MicroToUSD/FormatUSD— one money pathdecodeJSONBody/writeCachedJSONResultresponse helpersCalculateCostdelegationFindProvider/FindProviderWithTrust/ScoreProvidertrio (~250 LOC,zero non-test callers); migrated ~100 test sites onto the live
SelectProvider/ReserveProvider.Two intentional, test-pinned behavior fixes
payments.USDToMicro. Pinned bypayments/usd_test.go.decodeJSONBodystandardizes the ~25 "invalid JSON" variants to one message.Everything else is byte-identical HTTP responses + identical ledger side-effects.
Verification
gofmt/go build ./...(incl.e2e) /go vet ./coordinator/...— cleangolangci-lint(v2.1.6, the CI config) — 0 issues across./coordinator/...-raceagainst a real throwaway PostgresDeferred (follow-up)
Risky-for-marginal-gain intra-file collapses in the encryption/billing hot path: triple E2E
encrypt+dispatch, reservation-refund closures, attestation/challenge helpers. Flagged rather than
forced; can be a focused follow-up PR.
🤖 Generated with Claude Code
Need help on this PR? Tag
@codesmithwith what you need. Autofix is disabled.