Skip to content

BLA-4251 classify backend uploads with create-if-absent semantics#7

Merged
shreyas-blacksmith merged 4 commits into
mainfrom
shreyaskalyan/bla-4251-bazel-cache-classification
Jun 29, 2026
Merged

BLA-4251 classify backend uploads with create-if-absent semantics#7
shreyas-blacksmith merged 4 commits into
mainfrom
shreyaskalyan/bla-4251-bazel-cache-classification

Conversation

@shreyas-blacksmith

@shreyas-blacksmith shreyas-blacksmith commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Part of BLA-4251 (Slice 2, bazel-remote half). Producer changes that let the storage-footprint accumulator count only net-new objects.

Flows (end-to-end)

High-level view of the storage-footprint pipeline across all three repos (repo that owns each flow in brackets). The same overview is on the bazel-remote, fa, and web PRs.

flowchart LR
  br["bazel-remote: conditional PUT (If-None-Match:*), classify + SizeOnDisk"] --> fa["FA: per-minute flush"]
  fa --> ch[("ClickHouse + ingested_at")]
  ch -->|"read (last_ingested_at, now-5m]"| acc["Laravel accumulator (every 15m)"]
  acc -->|"+= delta, advance cursor"| pg[("Postgres: storage_footprint_bytes (customer footprint)")]
  pg -->|"read reported footprint (no write)"| scan["MinIO scan (drift only)"]
  scan -->|"drift metrics + >10% alert"| prom["Grafana"]
  fa -->|"accounting counters"| prom
Loading

Flow 1 — Job accounting [bazel-remote + fa]
No change to the adoption endpoint. On job start, bazel-remote uploads AC/CAS entries to MinIO with If-None-Match: * (create-if-absent) and waits for the result, classifying each upload as created (net-new), already_exists (412/304), error, or dropped (queue full). Net-new bytes are tallied as item.SizeOnDisk. The host accumulates net-new bytes every minute and pushes them to ClickHouse. The upload queue is host-global and outlives any VM, so the periodic flush ships late drains after teardown on its own — no teardown-specific drain or queued metric; any genuine loss (process death before drain, where the bytes also never reached MinIO) is caught by Flow 4 drift.
Metrics: bazelre_cache_upload_bytes_created, bazelre_cache_accounting_reports_completed, bazelre_cache_accounting_reports_failed_to_send.

Flow 2 — ClickHouse [web]
remote_build_cache_operations gains an ingested_at column, server-assigned (DEFAULT now()) on insert from the host — the watermark the accumulator advances over (created_at is host-assigned and minute-truncated, so unsuitable).

Flow 3 — Laravel accounting [web]
bazelre_cache_entries gains last_ingested_at: the last ClickHouse ingested_at consumed into storage_footprint_bytes. Every 15 minutes a sharded job runs over each active row, sums net-new created bytes for the namespace from last_ingested_at to now() - 5 minutes, adds the delta to storage_footprint_bytes, and advances the cursor. This accumulator is the customer-facing footprint shown in the dashboard.

Flow 4 — Storage footprint & drift [web]
The existing MinIO scan-and-sum becomes a metrics-only drift calculator: it no longer writes storage_footprint_bytes, and instead compares scanned (actual) vs accumulator (reported) and emits Grafana metrics only. v0 alerts on drift > 10%.
Metrics: bazelre_cache_actual_storage_bytes, bazelre_cache_reported_storage_bytes, bazelre_cache_storage_drift_bytes, bazelre_cache_storage_drift_percent.

Summary

  • Backend AC/CAS uploads now issue a conditional PUT (If-None-Match: *) via PutObjectOptions.SetMatchETagExcept("*").
  • Each upload emits exactly one terminal outcome: created (no error), already_exists (412, or 304 from older MinIO), or error (PUT failed).
  • Bytes are reported as SizeOnDisk (stored/compressed) everywhere, including the queue-full dropped path, so accounting shares units with the MinIO drift scan.
  • Bumps minio-go v7.0.69 -> v7.0.72: v7.0.69 quoted the * wildcard and silently broke create-if-absent. Core.PutObject is single-shot (no multipart), so every upload regardless of size carries the header.

Verification gate (before un-drafting)

  • Confirm the deployed MinIO server returns 412 PreconditionFailed (not 200/304) for If-None-Match: * on an existing object; pin a known-good RELEASE in the ansible playbooks.

Test plan

  • go test ./cache/s3proxy/ - classifyUploadOutcome (created/already_exists 412+304/failed) and SizeOnDisk reporting on the dropped + observe paths.
  • Integration smoke against a real MinIO: re-upload an existing blob -> outcome already_exists, footprint unchanged.

Review / validation notes

  • Re-ran a cursory cross-repo contract review after the overlap/status-contract fixes.
  • Confirmed the producer/consumer status contract is aligned: bazel-remote reports backend upload failures as error, and web counts error, rejected, and dropped as failure statuses.
  • Confirmed the accumulator now has both scheduler overlap protection and a row-locked cursor compare before incrementing storage_footprint_bytes, so overlapping runs should not double count the same ClickHouse window.
  • Confirmed the MinIO drift scan no longer writes storage_footprint_bytes; it emits histogram-based drift signals instead of region-only gauges to avoid last-writer-wins metrics.
  • Local targeted tests passed:
    • bazel-remote: go test ./cache/s3proxy ./config
    • FA: go test ./cache/bazelre ./util/chbatch ./util/metric
  • Web PHPUnit was not re-run locally because this checkout is missing vendor/bin/phpunit; CI should cover BuildCacheFootprintAccumulatorTest and BuildCacheStatusContractTest.

Made with Cursor


View with Codesmith Autofix with Codesmith
Need help on this PR? Tag /codesmith with what you need. Autofix is disabled.


View with Codesmith
Need help on this PR? Tag /codesmith with what you need. Autofix is enabled. (Staging)

Backend AC/CAS uploads now use a conditional PUT (If-None-Match: *) so
only net-new objects count toward the storage footprint. Each upload
emits exactly one terminal outcome: created (no error), already_exists
(412, or 304 from older MinIO), or failed. Bytes are reported as
SizeOnDisk (stored/compressed) everywhere, including the queue-full
dropped path, so accounting shares units with the MinIO drift scan.

Bumps minio-go to v7.0.72; v7.0.69 quoted the "*" wildcard and silently
broke create-if-absent. Core.PutObject is single-shot, so every upload
(regardless of size) carries the header - no multipart bypass.
@linear-code

linear-code Bot commented Jun 26, 2026

Copy link
Copy Markdown

BLA-4251

shreyas-blacksmith and others added 2 commits June 26, 2026 13:16
go get updated go.mod/go.sum but not the gazelle-managed Bazel deps, so
the Bazel build still resolved the broken minio-go v7.0.69. Ran
//:gazelle-update-repos (gazelle update-repos -from_file=go.mod
-to_macro=deps.bzl%go_dependencies -prune): bumps minio-go to v7.0.72 and
prunes sha256-simd/gofuzz, which v7.0.72 no longer pulls in. Keeps the go
and bazel builds on the same client.

Stays on v7.0.72 (go 1.21) rather than latest v7.0.x: v7.0.9x requires
go >= 1.23, which would force WORKSPACE go_register_toolchains (currently
1.22.0) and likely rules_go/gazelle bumps.

Co-authored-by: Cursor <cursoragent@cursor.com>
classifyUploadOutcome returned "failed", but FA emits "error"
(buildCacheStatusError) and the web reader counts status IN
('error','rejected','dropped'). The divergent "failed" was silently
ignored by both consumers, so genuine upload failures never showed up in
failedOperationsCount. Emit "error" to match the shared taxonomy.

Co-authored-by: Cursor <cursoragent@cursor.com>
@shreyas-blacksmith shreyas-blacksmith marked this pull request as ready for review June 26, 2026 18:23
Comment thread cache/s3proxy/s3proxy.go Outdated
Comment on lines +229 to +233
// Create-if-absent: a backend upload only counts toward the storage footprint
// when it stores a net-new object. If the object already exists, MinIO rejects
// the conditional PUT with a precondition failure and we classify it as
// already_exists instead of created. SetMatchETagExcept("*") emits an unquoted
// "If-None-Match: *" (the quoting bug is fixed in minio-go >= v7.0.72).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm not sure we need to check in a comment about the need to update minio-go here

@shreyas-blacksmith shreyas-blacksmith merged commit f887e9c into main Jun 29, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants