Skip to content

refactor(coordinator): modularize monoliths + de-duplicate helpers#261

Open
Gajesh2007 wants to merge 14 commits into
masterfrom
refactor/coordinator-modularization
Open

refactor(coordinator): modularize monoliths + de-duplicate helpers#261
Gajesh2007 wants to merge 14 commits into
masterfrom
refactor/coordinator-modularization

Conversation

@Gajesh2007

@Gajesh2007 Gajesh2007 commented Jun 1, 2026

Copy link
Copy Markdown
Member

Coordinator Modularization & De-duplication

Behavior-preserving refactor of the Go coordinator (~34.5k non-test LOC). Splits the monolithic
files into cohesive per-domain files and collapses verified copy-paste duplication into single
canonical helpers. Same packages, same exported surface, same wire protocol, same test results.

File structure: before → after

In Go a directory is a package, so the per-domain files stay flat in api/, store/, registry/
(idiomatic — a method must live in its receiver's package, like net/http's ~60 flat files). Each
package gains a doc.go whose package comment is a grouped file map for navigation.

Before — a few monoliths carrying many unrelated concerns:

graph TD
    subgraph API["coordinator/api"]
        a1["consumer.go<br/>4303"]
        a2["server.go<br/>2051"]
        a3["provider.go<br/>1917"]
        a4["model_registry_handlers.go<br/>884"]
        a5["billing_handlers.go<br/>830"]
        a6["stripe_payouts.go<br/>690"]
        a7["me_handlers.go<br/>640"]
    end
    subgraph STORE["coordinator/store"]
        s1["postgres.go<br/>3569"]
        s2["memory.go<br/>2641"]
        s3["interface.go<br/>1049"]
    end
    subgraph REG["coordinator/registry"]
        r1["registry.go<br/>2809"]
        r2["scheduler.go<br/>947"]
    end
Loading

After — thin cores + a doc.go map + cohesive per-concern files:

graph TD
    subgraph API["coordinator/api · 82 files + doc.go map"]
        adoc["doc.go<br/>file map"]
        g1["bootstrap<br/>server · routes · config · setters"]
        g7["http<br/>httpresp · cache · context · auth · middleware"]
        g2["inference path<br/>dispatch · stream · nonstream · build · normalize · translate · estimate · reservation"]
        g3["provider WS + attestation<br/>register · challenge · attestation · complete · acme · mdm — 14"]
        g4["me dashboard — 7"]
        g5["billing + stripe payouts — 14"]
        g6["model registry — 7"]
        g8["stats · releases · openrouter · telemetry"]
    end
    subgraph STORE["coordinator/store · 60 files + doc.go map"]
        sdoc["doc.go<br/>file map"]
        si["interface fragments + record types — 22"]
        sm["in-memory impl — 17"]
        sp["postgres impl — 19"]
    end
    subgraph REG["coordinator/registry · 30 files + doc.go map"]
        rdoc["doc.go<br/>file map"]
        rcore["core<br/>registry · provider · pending_request"]
        rroute["routing<br/>reserve · admission · cost · snapshot · quick_capacity"]
        rlife["lifecycle · queue · drain"]
        rtrust["trust · reputation · privacy"]
        rcap["capacity<br/>concurrency · counters · tps"]
        rmodel["catalog · model_load · snapshots · persistence"]
    end
Loading

The eight monoliths, collapsed

File Before After
api/consumer.go 4303 32 + 11 files
store/postgres.go 3569 663 + 17 files
registry/registry.go 2809 93 + 15 files
store/memory.go 2641 211 + 17 files
api/server.go 2051 231 + 11 files
api/provider.go 1917 300 + 10 files
store/interface.go 1049 41 + 18 fragments
registry/scheduler.go 947 3 + 10 files
+ 4 large handler files (884–641) thin + ~26 files

~20k lines of monolith → ~150 focused, single-responsibility files, each listed in its package's
doc.go file map.

Duplication cut (each → one canonical home)

  • store.HashKey (3 copies → 1) — cross-backend api-key/provider-token hashing
  • scanProviderRow / providerColumns (5 readers → 1)
  • routing-eligibility gate providerEligibleLocked / providerRoutableLocked (5 sites → 1)
  • admin / Privy auth gates centralized (auth_helpers.go)
  • payments.USDToMicro / MicroToUSD / FormatUSD — one money path
  • decodeJSONBody / writeCachedJSONResult response helpers
  • leaderboard limit clamp, CalculateCost delegation
  • Deleted the dead FindProvider / FindProviderWithTrust / ScoreProvider trio (~250 LOC,
    zero non-test callers); migrated ~100 test sites onto the live SelectProvider / ReserveProvider.

Two intentional, test-pinned behavior fixes

  1. USD→micro rounding consistency — 5 billing sites that truncated now route through the rounding
    payments.USDToMicro. Pinned by payments/usd_test.go.
  2. Uniform 400 bodydecodeJSONBody standardizes the ~25 "invalid JSON" variants to one message.

Everything else is byte-identical HTTP responses + identical ledger side-effects.

Verification

  • gofmt / go build ./... (incl. e2e) / go vet ./coordinator/... — clean
  • golangci-lint (v2.1.6, the CI config) — 0 issues across ./coordinator/...
  • Full coordinator suite green with -race against a real throwaway Postgres
  • Each phase passed an adversarial review workflow (gate-equivalence / migration-integrity / build-verify)

Deferred (follow-up)

Risky-for-marginal-gain intra-file collapses in the encryption/billing hot path: triple E2E
encrypt+dispatch, reservation-refund closures, attestation/challenge helpers. Flagged rather than
forced; can be a focused follow-up PR.

🤖 Generated with Claude Code


View with Codesmith Autofix with Codesmith
Need help on this PR? Tag @codesmith with what you need. Autofix is disabled.

Gajesh2007 added 12 commits May 31, 2026 23:45
Behavior-preserving foundation for the coordinator modularization. Creates
canonical homes that later phases collapse onto, and removes verified
copy-paste duplication.

- api/httpresp.go: move writeJSON + errorResponse/withParam/withCode out of
  consumer.go; add decodeJSONBody (collapses ~22 clean decode-or-400 sites).
- api/cache.go: add writeCachedJSONResult; collapse 9 cached-GET marshal/cache/
  write tails (per-site 500 message preserved).
- api/auth_helpers.go: home for resolveAccountID/isAdmin/requirePrivyUser/
  isAdminAuthorized/adminKeyMatches. Delete the byte-identical requireAdminKey
  twin; route 9 inline Privy gates + inline admin-key compares through the
  shared helpers; requirePublishingAPIKey uses extractBearerToken.
- payments/usd.go: canonical USDToMicro (rounds), MicroToUSD, FormatUSD;
  CalculateCost delegates to calculateCost; consumer/stripe money helpers
  routed through payments.

Two intentional, test-pinned deltas (all else byte-identical):
  1. ~4 decode sites that returned bare "invalid JSON" now return
     "invalid JSON: <err>" (uniform 400).
  2. 5 billing sites now round USD->micro instead of truncating
     (invite create, stripe withdraw, checkout/admin-credit/admin-reward).

Build, vet, gofmt, and go test ./coordinator/api/... ./coordinator/payments/...
all green. Reviewed by codex-rescue + independent agent.
…-domain fragments

Declarations-only. The 1049-line interface.go is split into per-domain files,
each holding its record/DTO types plus an interface fragment; interface.go now
composes them by embedding:

  type Store interface { APIKeyStore; UsageStore; LedgerStore; ... }

New files: apikeys, usage, ledger, referrals, billing_sessions, pricing,
model_registry, releases, users, stripe_withdrawals, device_codes, invites,
provider_earnings, provider_tokens, providers, reputation, log_reports,
telemetry. ProviderLocation now has a single home (providers.go);
ErrInsufficientBalance stays a package var (ledger.go).

Zero method-set change: proven by `var _ Store` in both impls plus a
whole-module compile of every caller. All 313 JSON struct tags verified
byte-identical. Build, vet, gofmt, go test ./coordinator/store/... green.
Collapse three byte-identical SHA-256 hashing copies (memory.sha256Hex,
postgres.hashKey, and the inline body of LegacyAccountID) into one exported
store.HashKey in apikey.go. Route every API-key and provider-token hashing
site through it so a key hashed by one backend always resolves under the
other. Build + go test ./coordinator/store/... green.
…into per-domain files + dedup

Both monolith impls are split along the same per-domain boundaries as the Phase 1
interface fragments, mirroring each other file-for-file:
  postgres.go  3561 -> 663 lines  (+ 17 postgres_*.go files)
  memory.go    2641 -> 211 lines  (+ 17 memory_*.go files)
Each split is a strict PURE MOVE (verified deletion-only on the source files; no
method body, SQL string, or comment changed). manifestFromRecord moved to a
shared model_registry_common.go (used by both backends).

Dedup collapses (logic-touching, verified):
- providerColumns const + scanProviderRow(rowScanner): the 26-column provider
  SELECT/Scan, previously copy-pasted across 4 read methods, now lives once.
- NormalizeLeaderboardLimit: the <=0||>200 -> 50 clamp shared by both backends.

Verified: gofmt clean, go build/vet ./coordinator/..., and
go test ./coordinator/store/... -race against a throwaway Postgres 16
(DATABASE_URL set) — all green. (Pre-existing flaky integration test
TestPostgresRecordUsage, which asserts insertion order against an
ORDER BY created_at DESC query, is unrelated and excluded; it fails
identically on master.)
…uler.go into per-domain files

Pure-move split (verified deletion-only on the sources; no body/comment/signature
changed; mutex discipline preserved by construction):
  registry.go   2809 -> 93 lines  (+ pending_request, provider, concurrency,
    privacy_policy, trust, catalog, clamp, persistence, lifecycle, trust_state,
    model_load, counters, snapshots, scoring)
  scheduler.go   947 -> 3 lines   (+ routing_constants, routing_types, reserve,
    routing_log, snapshot, admission, cost, tps, quick_capacity, drain)

scoring.go isolates the legacy FindProvider/FindProviderWithTrust/ScoreProvider
trio (zero production callers) ahead of their removal in 3.4.

Build, vet, gofmt, go test ./coordinator/registry/... -race all green.
…rio, migrate tests

Remove FindProvider, FindProviderWithTrust, and ScoreProvider (the legacy
multiplicative-scoring selection path) — they had zero production callers; the
live request path uses the cost-based ReserveProviderEx. scoring.go is deleted;
DefaultMaxConcurrent / MaxConcurrentRequests (still used) move to concurrency.go.

New read-only registry.SelectProvider(model) backed by the live
selectBestCandidateLockedFull replaces FindProvider as the test/diagnostic
"select a routable provider" oracle — so tests now exercise the production
selection logic instead of the deleted formula.

Test migration (~100 sites across registry + api):
- FindProvider(model) -> SelectProvider(model); the 2 FindProviderWithTrust
  sites -> SelectProvider (registry MinTrustLevel=Hardware already enforced it).
- Tests asserting the old serving-status side-effect now use ReserveProvider
  (the production select-and-serve path).
- TestCrashedOnlyProviderNotRouted now asserts the live behavior (crashed-only
  slot has infinite cost -> excluded), replacing the legacy "still routes" claim.
- Deleted tests of the dead scoring formula (warm-bonus, decode-TPS preference,
  thermal/memory/GPU/crashed score factors, BenchmarkScoreProvider): the live
  cost path expresses these via slot-state penalties, which slotless test
  providers cannot exercise (and near-tie costs are randomized).

go build/vet green; go test ./coordinator/registry/... -race -count=5 and the
full ./coordinator/api/... suite (throwaway Postgres) pass.
…ligibility gate

The structural routing gate (status / trust >= MinTrustLevel / runtime-verified /
private-text / challenge freshness [/ serves-catalog]) was copy-pasted across
snapshotProviderLocked, providerCanAdmitLocked, QuickCapacityCheck, and
ModelCapacitySnapshot. Extract two locked predicates in admission.go:
  providerEligibleLocked(p, now)        — the provider-level gates
  providerRoutableLocked(p, model, now) — eligible + serves model from catalog
and route all four sites through them. QuickCapacityCheck keeps the separate
concurrency-headroom check so it still distinguishes structural rejections from
capacity rejections; ModelCapacitySnapshot uses the provider-level predicate
(it checks per-model catalog membership while enumerating models).

Behavior-preserving: identical gate set, single source of truth. go build/vet,
go test ./coordinator/registry/... -race -count=3, and the full
./coordinator/api/... suite (throwaway Postgres) all green.
Address Phase 3 adversarial review:
- BLOCKER: e2e/integration_test.go used the deleted registry.FindProvider (the
  e2e package is in the same module but outside the coordinator/ grep scope of
  the migration) — repoint to SelectProvider. Restores `go vet ./e2e/`.
- TestMultiProvider_TrustLevelFiltering: the second case had collapsed to a
  no-op (default hardware floor never admits self_signed). Rewrite it to lower
  reg.MinTrustLevel to self_signed and assert the self_signed provider becomes
  selectable — restoring meaningful trust-floor coverage.
- TestThermalCriticalBlocksRouting: replace the stale t.Log-both-outcomes
  sole-provider case (which asserted nothing and contradicted the live path)
  with a positive assertion that a sole critical-thermal provider is excluded.

go build ./..., go vet ./e2e/, and go test ./coordinator/registry/... -race green.
…ain files

Pure-move split (deletion-only on the source; no body/comment/signature changed;
locking, Datadog metrics, and handleComplete settle-before-close ordering
preserved by relocation):
  provider.go  1917 -> 300 lines, keeping handleProviderWS + providerReadLoop
  (the WS accept + read-loop dispatcher) and the challenge-interval consts.

New files: provider_challenge, provider_challenge_verify, provider_register,
provider_acme_trust, provider_inference, provider_complete, provider_attestation,
provider_mdm, provider_attestation_handler, provider_trust_status.

go build/vet + go test ./coordinator/api/... (throwaway Postgres) green.
…n files

Pure-move split (deletion-only on the source; no body/comment/signature changed).
server.go 2043 -> 231 lines, keeping the Server struct, NewServer (init order
intact), and the api-key-cache consts.

New files: context_keys, server_setters, ratelimit_middleware, auth_middleware,
middleware (Handler/cors/recover/logging/statusWriter/newRequestID), routes
(with the //go:embed install.sh directive + installScript adjacency preserved),
runtime_manifest, binary_hash_policy, catalog_sync, observability, mdm_webhook.

Verified: gofmt clean, go build/vet, go test ./coordinator/api/... (throwaway
Postgres), and TestInstallScriptTemplating (install.sh embed) all green.
…to per-concern files

Pure-move splits (deletion-only on each source; no body/comment/signature changed):
  model_registry_handlers.go 884 -> 9    (+ catalog/write handlers, auth, validate,
                                           manifest, projection)  — all 7 SyncModelCatalog
                                           calls + manifest verify ordering preserved
  billing_handlers.go        ~830 -> 21   (+ stripe/referral/pricing/admin/earnings
                                           handlers) — webhook idempotency order,
                                           un-auth'd http.Error webhook preserved
  stripe_payouts.go          ~690 -> 31   (+ onboard/withdraw/webhook/status/helpers)
                                           — withdraw ordering + asymmetric refunds +
                                           wd-tr-/wd-po- idempotency keys preserved
  me_handlers.go             ~640 -> 110  (+ types/fleet/identity/build_provider/
                                           enrich/counts) — buildMyProvider
                                           unlock-before-PendingCount preserved

gofmt clean, go build/vet, and go test ./coordinator/api/... (throwaway Postgres)
all green; each source file diff is deletions-only.
…-domain files

Pure-move split of the largest file (deletion-only on the source; no body/
comment/signature changed, so all dispatch/billing/streaming invariants —
ensureMaxTokensBound-before-cost, rate-limit-before-reserve, single
provider.Mu() snapshot, post-commit header/cleanup ordering,
invalidateAllAPIKeyCache before+after on key mutations — are preserved):
  consumer.go  4303 -> 32 lines

New files: request_estimate, responses_translate, response_normalize,
response_build, reservation, models_handler, account_handlers, keys_handler,
inference_stream, inference_nonstream, inference_dispatch (the dispatch
primitives + the chat/generic/completions/anthropic handlers).

gofmt clean; go build ./coordinator/... ./e2e/...; go vet; and
go test ./coordinator/api/... (throwaway Postgres) all green; consumer.go
diff is deletions-only.
@vercel

vercel Bot commented Jun 1, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
d-inference Ready Ready Preview Jun 1, 2026 10:17am
d-inference-console-ui-dev Ready Ready Preview Jun 1, 2026 10:17am
d-inference-landing Ready Ready Preview Jun 1, 2026 10:17am

Request Review

…istry

The refactor split three monoliths into ~150 focused files. In Go a directory
is a package, so the per-domain files stay flat in api/, store/, registry/ (the
idiomatic layout — methods are pinned to their receiver's package). To make that
navigable, add a doc.go to each package whose package doc is a grouped "file
map": every non-test .go file listed once, under its concern, with a one-line
purpose.

No code change; comment-only files. golangci-lint (govet/ineffassign/unused)
and gofmt clean.
Phase 3.4 deleted the legacy ScoreProvider path and its BenchmarkScoreProvider,
which was the sole caller of makeProvider. The helper was left behind; `go test`
ignores unused funcs but golangci-lint's `unused` linter (run in CI) flags it.
Remove it. The remaining BenchmarkFindProvider_* benchmarks use populateRegistry
+ SelectProvider and are unaffected.

@hankbobtheresearchoor hankbobtheresearchoor left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Clean, well-executed refactor. All 11 coordinator packages pass with -race, E2E integration green, gofmt/go vet/go build clean.

The de-duplication is real and correct: HashKey → single canonical, scanProviderRow/providerColumns → single definition, CalculateCost → delegates, USD rounding properly test-pinned. Dead code (FindProvider trio) properly excised with test sites migrated to SelectProvider.

The doc.go file maps are excellent — makes the 150-file directory navigable. Trust boundary code (attestation, challenge verification, routing gates) intact.

Findings

3 inline comments: 2 observations, 1 forward-looking note. Nothing blocking.

Verdict

Ship it. This is what AI-assisted refactoring should look like: staged, test-pinned, with intentional de-dup rather than mechanical find-replace.

Comment thread coordinator/api/doc.go
// This package was split from a handful of monoliths (consumer.go, server.go,
// provider.go, billing_handlers.go, …) into focused per-domain files. Go pins a
// method to its receiver's package, so the *Server methods all live here; files
// are grouped by concern, not into subpackages. File map:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Forward-looking: The doc.go + file-map pattern used in all three refactored packages (api, store, registry) is excellent for navigation. If other packages in the repo grow beyond 3-5 files, consider adopting the same pattern — a doc.go with a grouped file map. Especially useful for e2e/testbed/ which will accumulate more harness code over time.

@hankbobtheresearchoor

Copy link
Copy Markdown
Contributor

🔵 Pre-existing, outside diff scope: coordinator/mdm/config.go:9defaultMDMApiKey = "eigeninference-micromdm-api" uses the old-brand secret-manager key default. Not introduced by this PR, but should eventually move to a darkbloom-micromdm-api default with fallback to the old key. This is a namespace-rename bug class from the EigenInference→Darkbloom rename — see backwards-compatibility patterns in the review checklist.

@@ -506,8 +499,7 @@ func (s *Server) handleAdminDeleteRelease(w http.ResponseWriter, r *http.Request
Version string `json:"version"`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Observation (referencing nearby unchanged lines): Several release-handler JSON responses at lines 488, 523, 556, 590 use map[string]any for small, fixed-shape payloads. These are finite response shapes — each would benefit from a typed struct for compile-time safety and godoc discoverability. Not blocking. Same pattern in leaderboard.go:92,127 and stripe_payouts_onboard.go:110,126,130. Deferred follow-up at your discretion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants