perf(storage): rewrite multipart upload — zero-copy, HTTP/2, no MD5#2381
perf(storage): rewrite multipart upload — zero-copy, HTTP/2, no MD5#2381ValentaTomas wants to merge 1 commit intomainfrom
Conversation
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit b3e8ad5. Bugbot is set up for automated code reviews on this repo. Configure here. |
1fe311a to
14d6edb
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 16ead0cde5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| data := rootfsDiff.Data() | ||
| if data == nil { | ||
| return nil |
There was a problem hiding this comment.
Avoid uploading directly from evictable mmap slices
TemplateBuild.Upload now pulls diff bytes via Diff.Data() and passes that slice to StoreData, but for local/storage diffs Data() returns the backing mmap (block.Cache.Data). Those mmaps can be closed by DiffStore eviction (scheduleDelete -> cache.Delete -> Diff.Close) while async upload is still running, and unlike the previous StoreFile(path) flow there is no open file descriptor to keep data valid after deletion. In low-disk-pressure scenarios with long uploads (especially >60s eviction delay), this can turn into invalid memory access or corrupted/failed uploads.
Useful? React with 👍 / 👎.
| data, err := os.ReadFile(path) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to get file size: %w", err) | ||
| return fmt.Errorf("failed to read file: %w", err) | ||
| } | ||
|
|
There was a problem hiding this comment.
Keep StoreFile streaming for large objects
StoreFile now unconditionally calls os.ReadFile before deciding upload strategy, which regresses memory behavior for large files: the whole object is loaded into RAM at once instead of being read chunk-by-chunk as before. Any remaining StoreFile callers (e.g. peer fallback path) can now hit large transient allocations or OOM under big snapshots, even though the previous implementation bounded memory by chunk/concurrency.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Autofix Details
Bugbot Autofix prepared fixes for both issues found in the latest run.
- ✅ Fixed: Cache.Data() returns mmap reference without holding lock
- Updated Data() to return a release function that holds the RLock until the caller is done with the mmap data, preventing SIGSEGV from concurrent Close() calls during multipart uploads.
- ✅ Fixed: StorageDiff.Data() silently swallows chunker errors
- Changed Data() to return error as third return value, ensuring chunker failures are properly propagated instead of silently skipping uploads.
Or push these changes by commenting:
@cursor push 450d6b9081
Preview (450d6b9081)
diff --git a/packages/orchestrator/pkg/sandbox/block/cache.go b/packages/orchestrator/pkg/sandbox/block/cache.go
--- a/packages/orchestrator/pkg/sandbox/block/cache.go
+++ b/packages/orchestrator/pkg/sandbox/block/cache.go
@@ -410,15 +410,20 @@
return c.filePath
}
-func (c *Cache) Data() []byte {
+func (c *Cache) Data() ([]byte, func()) {
c.mu.RLock()
- defer c.mu.RUnlock()
if c.mmap == nil {
- return nil
+ c.mu.RUnlock()
+
+ return nil, func() {}
}
- return []byte(*c.mmap)
+ releaseCacheCloseLock := func() {
+ c.mu.RUnlock()
+ }
+
+ return []byte(*c.mmap), releaseCacheCloseLock
}
func NewCacheFromProcessMemory(
diff --git a/packages/orchestrator/pkg/sandbox/block/chunk.go b/packages/orchestrator/pkg/sandbox/block/chunk.go
--- a/packages/orchestrator/pkg/sandbox/block/chunk.go
+++ b/packages/orchestrator/pkg/sandbox/block/chunk.go
@@ -85,7 +85,7 @@
ReadAt(ctx context.Context, b []byte, off int64) (int, error)
WriteTo(ctx context.Context, w io.Writer) (int64, error)
Close() error
- Data() []byte
+ Data() ([]byte, func())
FileSize() (int64, error)
}
@@ -297,7 +297,7 @@
return c.cache.Close()
}
-func (c *FullFetchChunker) Data() []byte {
+func (c *FullFetchChunker) Data() ([]byte, func()) {
return c.cache.Data()
}
diff --git a/packages/orchestrator/pkg/sandbox/block/streaming_chunk.go b/packages/orchestrator/pkg/sandbox/block/streaming_chunk.go
--- a/packages/orchestrator/pkg/sandbox/block/streaming_chunk.go
+++ b/packages/orchestrator/pkg/sandbox/block/streaming_chunk.go
@@ -460,7 +460,7 @@
return c.cache.Close()
}
-func (c *StreamingChunker) Data() []byte {
+func (c *StreamingChunker) Data() ([]byte, func()) {
return c.cache.Data()
}
diff --git a/packages/orchestrator/pkg/sandbox/build/diff.go b/packages/orchestrator/pkg/sandbox/build/diff.go
--- a/packages/orchestrator/pkg/sandbox/build/diff.go
+++ b/packages/orchestrator/pkg/sandbox/build/diff.go
@@ -30,7 +30,7 @@
block.Slicer
CacheKey() DiffStoreKey
CachePath() (string, error)
- Data() []byte
+ Data() ([]byte, func(), error)
FileSize() (int64, error)
Init(ctx context.Context) error
}
@@ -43,8 +43,8 @@
return "", NoDiffError{}
}
-func (n *NoDiff) Data() []byte {
- return nil
+func (n *NoDiff) Data() ([]byte, func(), error) {
+ return nil, func() {}, nil
}
func (n *NoDiff) Slice(_ context.Context, _, _ int64) ([]byte, error) {
diff --git a/packages/orchestrator/pkg/sandbox/build/local_diff.go b/packages/orchestrator/pkg/sandbox/build/local_diff.go
--- a/packages/orchestrator/pkg/sandbox/build/local_diff.go
+++ b/packages/orchestrator/pkg/sandbox/build/local_diff.go
@@ -110,8 +110,10 @@
return b.cache.Path(), nil
}
-func (b *localDiff) Data() []byte {
- return b.cache.Data()
+func (b *localDiff) Data() ([]byte, func(), error) {
+ data, release := b.cache.Data()
+
+ return data, release, nil
}
func (b *localDiff) Close() error {
diff --git a/packages/orchestrator/pkg/sandbox/build/storage_diff.go b/packages/orchestrator/pkg/sandbox/build/storage_diff.go
--- a/packages/orchestrator/pkg/sandbox/build/storage_diff.go
+++ b/packages/orchestrator/pkg/sandbox/build/storage_diff.go
@@ -150,13 +150,15 @@
return b.cachePath, nil
}
-func (b *StorageDiff) Data() []byte {
+func (b *StorageDiff) Data() ([]byte, func(), error) {
c, err := b.chunker.Wait()
if err != nil {
- return nil
+ return nil, func() {}, err
}
- return c.Data()
+ data, release := c.Data()
+
+ return data, release, nil
}
func (b *StorageDiff) FileSize() (int64, error) {
diff --git a/packages/orchestrator/pkg/sandbox/template_build.go b/packages/orchestrator/pkg/sandbox/template_build.go
--- a/packages/orchestrator/pkg/sandbox/template_build.go
+++ b/packages/orchestrator/pkg/sandbox/template_build.go
@@ -174,7 +174,12 @@
})
eg.Go(func() error {
- data := rootfsDiff.Data()
+ data, release, err := rootfsDiff.Data()
+ if err != nil {
+ return fmt.Errorf("failed to get rootfs data: %w", err)
+ }
+ defer release()
+
if data == nil {
return nil
}
@@ -183,7 +188,12 @@
})
eg.Go(func() error {
- data := memfileDiff.Data()
+ data, release, err := memfileDiff.Data()
+ if err != nil {
+ return fmt.Errorf("failed to get memfile data: %w", err)
+ }
+ defer release()
+
if data == nil {
return nil
}This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.
| } | ||
|
|
||
| return c.Data() | ||
| } |
There was a problem hiding this comment.
StorageDiff.Data() silently swallows chunker errors
Medium Severity
StorageDiff.Data() returns nil when b.chunker.Wait() fails, which causes the caller in template_build.go to silently skip the upload (if data == nil { return nil }). Every other method on StorageDiff (e.g., Close, ReadAt, Slice, FileSize) properly propagates the error from chunker.Wait(). Silently skipping an upload on error could lead to an incomplete template build stored without any error being reported.
Reviewed by Cursor Bugbot for commit 33cbbd4. Configure here.
| return nil | ||
| } | ||
|
|
||
| return []byte(*c.mmap) |
| return nil | ||
| } | ||
|
|
||
| return []byte(*c.mmap) |
There was a problem hiding this comment.
Use-after-free risk: Data() acquires a read lock, checks c.mmap != nil, then releases the lock before returning the slice.
| return nil | ||
| } | ||
|
|
||
| return []byte(*c.mmap) |
There was a problem hiding this comment.
Use-after-free risk: Data() acquires a read lock, checks c.mmap != nil, then releases the lock before returning the slice. The returned []byte is backed by mmap memory whose lifetime is controlled by Cache.Close() -> munmap. Any goroutine holding this slice (e.g. an upload goroutine calling storeData for a large memfile) can be left with a dangling pointer if Cache.Close() is called concurrently. The lock only protects the nil check, not the validity of the backing memory during use. In template_build.go this slice is passed into uploader.Upload which fans it out to concurrent HTTP goroutines with retries -- if the cache is closed while a retry is in flight the result is a SIGBUS or silent stale data.
| objectName := o.path | ||
|
|
||
| fileInfo, err := os.Stat(path) | ||
| data, err := os.ReadFile(path) |
There was a problem hiding this comment.
StoreFile now unconditionally calls os.ReadFile(path) for all file sizes before passing to storeData. Previously, files larger than 50 MB were streamed from disk in parallel without a full in-memory copy. Any caller that still reaches StoreFile with a large file will now allocate the entire file on the heap instead of streaming. The large-file streaming path has been silently dropped from this method.
| return nil | ||
| } | ||
|
|
||
| return c.Data() |
There was a problem hiding this comment.
StorageDiff.Data() can silently return incomplete data. When backed by a StreamingChunker, the mmap cache only contains bytes for ranges fetched via ReadAt; all other regions are zero-filled. The return value is a non-nil full-length slice, but callers in template_build.go only check if data == nil to skip the upload. If a StorageDiff-backed diff ever appears in the upload path (e.g. a snapshot whose base was only partially streamed), the upload proceeds with zero-filled holes and silently corrupts the stored object. Consider returning nil from Data() when completeness cannot be guaranteed.
| } | ||
|
|
||
| func (c *cachedSeekable) StoreData(ctx context.Context, data []byte) (e error) { | ||
| ctx, span := c.tracer.Start(ctx, "write object from data") | ||
| defer func() { | ||
| recordError(span, e) | ||
| span.End() | ||
| }() | ||
|
|
||
| return c.inner.StoreData(ctx, data) |
There was a problem hiding this comment.
🔴 cachedSeekable.StoreData() omits the NFS write-through cache population that StoreFile() performs when EnableWriteThroughCacheFlag is set, causing the first post-eviction reads for newly-uploaded templates to fall back to GCS instead of being served from the local NFS cache. The fix is to add a createCacheBlocksFromData helper (mirroring createCacheBlocksFromFile) and call it—along with writeLocalSize—inside StoreData() when the flag is enabled.
Extended reasoning...
The asymmetry
StoreFile() in cachedSeekable (storage_cache_seekable.go:281) conditionally calls createCacheBlocksFromFile() and writeLocalSize() when featureflags.EnableWriteThroughCacheFlag is set, eagerly populating the NFS block cache so subsequent reads are served locally. The new StoreData() method added in this PR (lines 305–314) only delegates to c.inner.StoreData() and skips both steps entirely.
Reachability
cachedSeekable is returned by cache.OpenSeekable() (storage_cache.go:108–126), which wraps the inner provider's Seekable in a cachedSeekable. When persistence is wrapped with WrapInNFSCache—as it is in the template-build path—template_build.uploadMemfile/uploadRootfs calls OpenSeekable and then StoreData on the resulting cachedSeekable. peerSeekable.StoreData() also delegates directly to the fallback (raw GCS), so the cachedSeekable layer is the only place the write-through cache would be populated.
Impact
For deployments with EnableWriteThroughCacheFlag=true, every template uploaded via StoreData() will have its NFS cache unpopulated. The first ReadAt() or OpenRangeReader() call for each 4 MB chunk after the in-memory template cache entry is evicted will miss NFS and fall back to a GCS round-trip, defeating the write-through optimisation for the new zero-copy path.
Addressing the refutations
One refuter argues this is self-healing: cachedSeekable.ReadAt() writes back to NFS on GCS fallback, so the miss is only one round-trip per chunk on first access after eviction. This is technically correct—the cache does heal lazily—but the write-through flag exists precisely to avoid that initial latency spike for freshly-uploaded templates. Replacing eager population with lazy healing is a behavioural regression for users who enabled the flag to guarantee warm-cache reads.
A second refuter argues cachedSeekable.StoreData() is dead code because the upload path sets t.persistence to the raw gcpStorage, not the NFS-wrapped provider. This is plausible if WrapInNFSCache is only applied on the read path. If that is true the bug cannot manifest today, but: (a) the public interface says cachedSeekable satisfies SeekableWriter including StoreData, (b) the omission creates a latent correctness trap for any future caller that does use the NFS-wrapped provider for writes, and (c) three of the four confirmatory verifiers independently concluded the method is reachable in the build path.
Step-by-step proof (flag-enabled scenario)
- persistence = WrapInNFSCache(gcpStorage) → OpenSeekable() returns a cachedSeekable.
- templateBuild.uploadMemfile() calls object.StoreData(ctx, data) on that cachedSeekable.
- cachedSeekable.StoreData() calls c.inner.StoreData() (GCS write) and returns—no NFS blocks written, no size file written.
- Later, the in-memory template entry is evicted.
- A sandbox reads memfile chunk at offset 0: cachedSeekable.ReadAt() opens the NFS chunk file → ENOENT (cache miss) → falls back to GCS ReadAt → writes chunk to NFS asynchronously.
- If StoreData had called createCacheBlocksFromData (equivalent of createCacheBlocksFromFile operating on []byte), step 5 would have been an NFS hit with no GCS latency.
Fix
Add createCacheBlocksFromData(ctx, data []byte) that iterates over the in-memory slice in chunkSize windows and writes each chunk to NFS (the data is already in memory, so this is cheap). Then in StoreData(), mirror the StoreFile() conditional:
if c.flags.BoolFlag(ctx, featureflags.EnableWriteThroughCacheFlag) {
c.goCtx(ctx, func(ctx context.Context) {
size, err := c.createCacheBlocksFromData(ctx, data)
// ... handle err, call writeLocalSize
})
}
return c.inner.StoreData(ctx, data)| } | ||
|
|
||
| func (c *Cache) Data() []byte { | ||
| c.mu.RLock() | ||
| defer c.mu.RUnlock() | ||
|
|
||
| if c.mmap == nil { | ||
| return nil | ||
| } | ||
|
|
||
| return []byte(*c.mmap) |
There was a problem hiding this comment.
🔴 Cache.Data() returns a mmap-backed []byte after releasing its RLock, so Cache.Close() can call mmap.Unmap() while upload goroutines are still reading the slice, causing a SIGSEGV. Fix by following the existing addressBytes() pattern: return both the slice and a releaseCacheCloseLock closure that keeps the RLock held until the upload is complete.
Extended reasoning...
The bug in detail
Cache.Data() (cache.go:411-421) acquires c.mu.RLock(), constructs []byte(*c.mmap) — a slice pointing directly into the memory-mapped region — and then immediately releases the lock via defer. Once Data() returns, the caller holds a raw pointer into mmap memory with no reference-counting, no lock, and no lifetime guarantee. Concurrently, Cache.Close() acquires c.mu.Lock() (write lock) and calls c.mmap.Unmap(), which instructs the kernel to tear down the mapping. Any subsequent read of the returned slice accesses unmapped memory and produces a SIGSEGV.
The concrete crash path
In template_build.go the Upload() function spawns two errgroup goroutines: one calls memfileDiff.Data() to get the mmap slice, then passes it to uploadMemfile → object.StoreData → gcpmultipart.Upload, where it is sliced as data[start:end] per part and written to HTTP PUT requests. For GB-scale files this upload can take minutes. The RLock is released the moment Data() returns — before uploadMemfile is even called. If any concurrent code path calls diff.Close() during this window, the mmap is unmapped while the upload is still streaming parts.
Why Close() can be called concurrently
Multiple triggerable paths exist: (1) TTL-based eviction in templateCache — the OnEviction callback calls template.Close() → diff.Close() → cache.Close(); (2) disk-space-based eviction in DiffStore, which has a scheduleDelete path with a 60-second delay added explicitly to prevent race conditions with exposed slices, but 60 seconds is insufficient for multi-GB uploads; (3) context cancellation during sandbox cleanup or build failure triggering deferred Close() calls. Any of these racing with the upload goroutine produces the use-after-free.
The smoking gun: addressBytes() already solves this
The same file (cache.go) contains addressBytes(), which explicitly keeps the RLock held and returns a releaseCacheCloseLock func() to the caller (used correctly in FullFetchChunker.fetchToCache and StreamingChunker.runFetch). Data() was introduced as part of this PR without following that pattern, breaking the safety invariant the rest of the codebase upholds.
Step-by-step proof
- The errgroup in template_build.go calls memfileDiff.Data() — RLock acquired, []byte(*c.mmap) returned, RLock released (lock count = 0).
- Upload goroutine enters gcpmultipart.Upload and issues, say, 20 concurrent part PUTs, each reading data[i*ChunkSize:(i+1)*ChunkSize].
- After 60 seconds, disk-space eviction fires: DiffStore.deleteOldestFromCache → scheduleDelete timer fires → cache.Delete → OnEviction → diff.Close() → cache.Close().
- cache.Close() calls c.mu.Lock() — succeeds immediately because lock count is 0 — then c.mmap.Unmap().
- Part PUT goroutines are still executing bytes.NewReader(data[start:end]) and writing to the HTTP body; the underlying memory is now unmapped → SIGSEGV.
How to fix
Data() should follow the addressBytes() pattern: acquire RLock, return both the slice and a release closure, and require the caller to hold that closure for the lifetime of the upload. The template_build.go goroutines would call release() in a defer after uploadMemfile/uploadRootfs returns. Alternatively, Data() could copy the mmap into a heap buffer (safe but defeats the zero-copy goal of this PR).
d787a42 to
cbf0853
Compare
| } | ||
|
|
||
| return []byte(*c.mmap) | ||
| } |
There was a problem hiding this comment.
Cache.Data() missing isClosed check before returning mmap
Low Severity
Cache.Data() does not check c.isClosed() before returning the mmap contents. After Close() calls c.mmap.Unmap(), the mmap pointer is not set to nil, so Data() passes the nil check and returns a byte slice referencing unmapped memory. Every other method on Cache (Slice, addressBytes, Size) checks isClosed() as a safety guard. If Data() is ever called after Close(), it would return a dangling reference.
Reviewed by Cursor Bugbot for commit 62a5e03. Configure here.
466867d to
04ea467
Compare
Extract multipart upload into packages/shared/pkg/storage/multipart/. The new implementation takes a []byte directly (typically mmap'd) and slices into it per part — zero allocation, zero copy in the upload loop. Key changes: - Remove MD5 per-part computation (was 20% CPU in prod profile) - Remove mutex (parts array uses distinct indices per goroutine) - Remove sync.Pool and buffer copies - Use mmap for large files in StoreFile instead of os.ReadFile - Simplify XML types to private, minimal structs - Deduplicate HTTP boilerplate via doRequest helper - Trim Uploader struct to 3 fields (token, client, baseURL)
04ea467 to
b3e8ad5
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 4 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit b3e8ad5. Configure here.
| } | ||
|
|
||
| return t.uploadMemfile(ctx, *memfilePath) | ||
| return t.uploadMemfile(ctx, data) |
There was a problem hiding this comment.
Errors from Data() silently skip uploads causing data loss
High Severity
StorageDiff.Data() swallows the error from chunker.Wait() and returns nil, which Upload() treats identically to NoDiff (no data to upload). If the chunker fails for any reason, the memfile or rootfs upload is silently skipped and the overall Upload returns success. The old code propagated errors — CachePath() errors caused the snapshot upload to fail, and even without that, StoreFile would fail explicitly on a missing/incomplete file. The new path hides failures, risking silent data loss during template builds.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit b3e8ad5. Configure here.
| }() | ||
|
|
||
| return c.inner.StoreData(ctx, data) | ||
| } |
There was a problem hiding this comment.
StoreData skips write-through cache unlike StoreFile
Medium Severity
cachedSeekable.StoreData directly delegates to c.inner.StoreData without populating the local write-through cache, unlike StoreFile which conditionally creates cache blocks and writes local size when EnableWriteThroughCacheFlag is enabled. When this flag is on, uploads via StoreData won't populate the local disk cache, causing subsequent reads of the same object to miss the local cache and fall through to remote storage.
Reviewed by Cursor Bugbot for commit b3e8ad5. Configure here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b3e8ad523e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| func (o *awsObject) StoreData(ctx context.Context, data []byte) error { | ||
| return o.Put(ctx, data) | ||
| } |
There was a problem hiding this comment.
Use multipart upload in AWS StoreData
TemplateBuild.Upload now sends memfile/rootfs through StoreData, so this new method is on the hot path for snapshot uploads. In awsObject, StoreData calls Put (single PutObject) instead of the multipart uploader used by StoreFile; on AWS this regresses large-object handling and causes uploads to fail for sufficiently large snapshots (notably >5 GiB) that previously worked through multipart upload.
Useful? React with 👍 / 👎.
| } | ||
|
|
||
| func (o *gcpObject) StoreFile(ctx context.Context, path string) (e error) { | ||
| ctx, span := tracer.Start(ctx, "write to gcp from file system") | ||
| ctx, span := tracer.Start(ctx, "store file to gcp") | ||
| defer func() { | ||
| recordError(span, e) | ||
| span.End() | ||
| }() | ||
|
|
||
| bucketName := o.storage.bucket.BucketName() | ||
| objectName := o.path | ||
|
|
||
| fileInfo, err := os.Stat(path) | ||
| data, err := os.ReadFile(path) | ||
| if err != nil { | ||
| return fmt.Errorf("failed to get file size: %w", err) | ||
| return fmt.Errorf("failed to read file: %w", err) | ||
| } | ||
|
|
||
| // If the file is too small, the overhead of writing in parallel isn't worth the effort. | ||
| // Write it in one shot instead. | ||
| if fileInfo.Size() < gcpMultipartUploadChunkSize { | ||
| return o.storeData(ctx, data) |
There was a problem hiding this comment.
🔴 gcpObject.StoreFile now calls os.ReadFile(path) unconditionally, loading the entire file into heap memory before deciding whether to use multipart upload; previously, files ≥50 MB were streamed chunk-by-chunk via UploadFileInParallel using ReadAt, bounding peak memory to ~50 MB. While the primary upload path (uploadMemfile/uploadRootfs) was migrated to StoreData() in this PR, the StoreFile method remains on the SeekableWriter interface and is still delegated through cachedSeekable.StoreFile → c.inner.StoreFile and peerSeekable.StoreFile → fallback.StoreFile; any caller that reaches those paths with a large file (e.g. a 500 MB rootfs) will now allocate the full file on the heap instead of streaming, risking OOM under memory pressure.
Extended reasoning...
What the bug is and how it manifests
The new gcpObject.storeData() helper is called from both StoreData() (the new path) and StoreFile() (the old path). StoreFile() now reads the entire file with os.ReadFile(path) before delegating to storeData(), whereas the old implementation only did a full in-memory read for files smaller than 50 MB (gcpMultipartUploadChunkSize). For files ≥ 50 MB the old code called UploadFileInParallel, which opened the file, read one 50 MB chunk at a time with file.ReadAt, and released each buffer after the part was uploaded — peak heap usage was bounded to approximately one chunk (~50 MB).
The specific code path
packages/shared/pkg/storage/storage_google.go:386-400 now reads:
data, err := os.ReadFile(path) // entire file into heap
return o.storeData(ctx, data)
The entire file is resident in memory from before the first HTTP PUT until the upload completes. For a 500 MB memfile or rootfs this is a 500 MB heap allocation per concurrent build.
Why existing code doesn't prevent it
The PR correctly migrated the hot path: template_build.go uploadMemfile/uploadRootfs now calls object.StoreData(ctx, data) with an already-mmap'd slice, bypassing StoreFile entirely. However, StoreFile is a required method on the SeekableWriter interface (storage.go:102-104), and two delegating implementations still forward to it:
• cachedSeekable.StoreFile (storage_cache_seekable.go:281) → c.inner.StoreFile(ctx, path)
• peerSeekable.StoreFile (peerclient/seekable.go:130) → fallback.StoreFile(ctx, path)
Any call to StoreFile on either wrapper (e.g. via NFS-cache write-through on a template stored by file path, or a peer fallback write) will trigger the full allocation.
Addressing the refutation
One verifier argued that a codebase-wide grep shows zero business-logic callers of .StoreFile() after this PR, so the scenario is unreachable today. That observation is accurate for the current snapshot of business-logic code. However: (a) StoreFile is a required interface method — future callers or integrations may rely on it; (b) the peerSeekable and cachedSeekable delegating wrappers preserve the reachability of the regression should any caller chain through them; (c) the regression silently removes a safety property that was explicitly designed into the old implementation (streaming large files to avoid OOM). Leaving the method in a broken state for large files is a footgun.
Impact
On nodes handling large sandbox images through the peerSeekable or cachedSeekable code paths, or if StoreFile is called by any downstream code, a 500 MB rootfs upload would allocate ~500 MB on the heap instead of ~50 MB. Under concurrent builds this can cause OOM kills of the orchestrator process.
Step-by-step proof
- A caller obtains a cachedSeekable object (e.g. storage_cache.go OpenSeekable wrapping gcpStorage).
- The caller invokes object.StoreFile(ctx, '/path/to/500mb-rootfs').
- cachedSeekable.StoreFile calls c.inner.StoreFile(ctx, path) (storage_cache_seekable.go:304).
- c.inner is a gcpObject; its StoreFile calls data, _ = os.ReadFile('/path/to/500mb-rootfs') — 500 MB heap allocation.
- storeData() is called with the full slice; multipart upload proceeds with 500 MB locked in heap until all parts complete.
- Compare: old code at this step opened the file, called UploadFileInParallel which iterated chunk-by-chunk with ReadAt — peak heap was one 50 MB chunk.
How to fix
Restore the streaming behaviour inside StoreFile: open the file, stat its size, and if size >= gcpmultipart.ChunkSize call the uploader with a file-based reader (or mmap the file and pass the slice, then munmap after upload). Alternatively, update storeData to accept an io.ReaderAt + size so that both StoreFile and StoreData can use the same parallel-upload code without requiring the full slice in RAM for file-backed uploads.
🔬 also observed by chatgpt-codex-connector
| return b.cachePath, nil | ||
| } | ||
|
|
||
| func (b *StorageDiff) Data() []byte { | ||
| c, err := b.chunker.Wait() | ||
| if err != nil { | ||
| return nil | ||
| } | ||
|
|
||
| return c.Data() | ||
| } |
There was a problem hiding this comment.
🔴 StorageDiff.Data() returns nil when b.chunker.Wait() fails instead of propagating the error, causing the caller in template_build.go to silently skip the upload (via 'if data == nil { return nil }') as though there were no diff. Every other method on StorageDiff (Close, ReadAt, Slice, WriteTo, FileSize) properly returns the error; Data() is the only method that swallows it. Fix by changing the Diff.Data() interface to return ([]byte, error) so errors can propagate.
Extended reasoning...
What the bug is
StorageDiff.Data() at storage_diff.go:152-160 calls b.chunker.Wait() and, if it returns an error (e.g. network failure during chunker initialization, context cancellation, or any error set via SetError), the method returns nil with no error signal:
func (b *StorageDiff) Data() []byte {
c, err := b.chunker.Wait()
if err != nil {
return nil // error silently dropped
}
return c.Data()
}
The specific code path that triggers it
In template_build.go (Upload function), both the memfile and rootfs upload goroutines call Diff.Data() and then check the result:
data := memfileDiff.Data()
if data == nil {
return nil // treated as NoDiff, upload silently skipped
}
return t.uploadMemfile(ctx, data)
Since NoDiff.Data() also returns nil (by design), the caller cannot distinguish between no diff to upload and a chunker failure. Both result in silently skipping the upload with nil returned from the goroutine.
Why existing code does not prevent it
Every other method on StorageDiff propagates errors from chunker.Wait(): Close, ReadAt, Slice, WriteTo, and FileSize all return the error to callers. The Data() method introduced by this PR is the sole exception. The root cause is structural: the Diff.Data() interface is declared as returning only []byte with no error return, making proper error propagation impossible without a signature change.
What the impact would be
If b.chunker.Wait() returns an error (e.g., upstream storage cannot be fetched during Init, or context is cancelled between Init and Upload), the memfile or rootfs upload is silently skipped. TemplateBuild.Upload() returns nil (success), Snapshot.Upload() returns nil, and the caller has no indication that the template build is incomplete. This creates a stored template entry missing its memfile or rootfs data: silent data loss during template builds.
How to fix it
Change the Diff interface Data() signature to return ([]byte, error). StorageDiff.Data() can then return (nil, err) on chunker failure and callers in template_build.go can properly check and propagate the error. NoDiff.Data() would return (nil, nil). Alternatively, store the initialization error in StorageDiff and expose it through a separate method.
Step-by-step proof
- StorageDiff is initialized with chunker = utils.NewSetOnceblock.Chunker. During Init(), if OpenSeekable or Size fails, chunker.SetError(errMsg) is called.
- Later, TemplateBuild.Upload() runs. The memfile goroutine calls memfileDiff.Data().
- Inside Data(), b.chunker.Wait() returns the stored error immediately.
- Data() returns nil to the caller with no error.
- The caller checks: if data == nil { return nil }. Condition is true, returns nil just like NoDiff.
- The errgroup collects no errors. Upload() returns nil to the caller.
- The template is recorded as successfully built but the memfile was never written to GCS.
| client *retryablehttp.Client | ||
| } | ||
|
|
||
| func NewUploader(ctx context.Context, bucketName, objectName string) (*Uploader, error) { | ||
| creds, err := google.FindDefaultCredentials(ctx, "https://www.googleapis.com/auth/cloud-platform") | ||
| if err != nil { | ||
| return nil, fmt.Errorf("get credentials: %w", err) | ||
| } | ||
|
|
||
| token, err := creds.TokenSource.Token() | ||
| if err != nil { | ||
| return nil, fmt.Errorf("get token: %w", err) | ||
| } | ||
|
|
||
| return &Uploader{ | ||
| token: token.AccessToken, | ||
| baseURL: "https://" + bucketName + ".storage.googleapis.com/" + objectName, | ||
| client: httpClient, | ||
| }, nil | ||
| } | ||
|
|
||
| func (u *Uploader) Upload(ctx context.Context, data []byte, maxConcurrency int) (int64, error) { | ||
| uploadID, err := u.initiate(ctx) |
There was a problem hiding this comment.
🟣 The new gcpmultipart.NewUploader fetches an OAuth2 access token once via creds.TokenSource.Token() and stores it as a plain string, which will expire after ~1 hour; if a multi-part upload runs longer than that, all subsequent PUT requests will receive HTTP 401 with no retry path since DefaultRetryPolicy does not retry 401. This is a pre-existing issue: the deleted gcp_multipart.go used the exact same pattern (token.AccessToken stored as a string), and this PR carries it forward unchanged. The fix is to store creds.TokenSource and call .Token() per request so the client library handles auto-refresh.
Extended reasoning...
What the bug is
NewUploader (uploader.go:77-80) calls creds.TokenSource.Token() once and stores the raw access token string in Uploader.token. Every subsequent doRequest() call sets "Authorization: Bearer "+u.token using this static string. Google OAuth2 access tokens have a fixed ~1-hour lifetime. Once the token expires, all part PUTs return HTTP 401 Unauthorized.
The specific code path
The retryablehttp client uses retryablehttp.DefaultRetryPolicy as its CheckRetry function. DefaultRetryPolicy only retries on connection errors, 429 (Too Many Requests), and 500-range server errors. HTTP 401 is not in that list. So once the token expires mid-upload, every remaining part PUT — including retries — will receive 401 and fail permanently. The upload returns an error with no recovery.
Why existing code does not prevent it
The token is captured as a plain string at construction time. There is no per-request token refresh. doRequest() has no mechanism to detect a 401 caused by token expiry versus a permissions error, and even if it did, DefaultRetryPolicy prevents the retry from happening. The OAuth2 client library TokenSource would handle caching and auto-refresh automatically if called per request, but only Token() is called once at construction.
Pre-existing nature
The deleted gcp_multipart.go had the identical pattern: NewMultipartUploaderWithRetryConfig called creds.TokenSource.Token() at construction and stored token.AccessToken as a plain string in MultipartUploader.token, used identically in all requests. The new gcpmultipart package is a rewrite but replicates the same token-lifetime semantics. Critically, NewUploader is called fresh per upload from storeData() (storage_google.go), so each upload begins with a fresh token. The window for expiry is only during a single upload lifetime.
Practical impact
At typical GCP network speeds (>100 MB/s) with 16 concurrent 50 MB parts, even a 10 GB file completes in minutes, well within the 1-hour window. The risk materializes only for extremely large files (hundreds of GB) or severely degraded network/retry conditions where total elapsed time exceeds ~1 hour.
Step-by-step proof
- NewUploader is called; creds.TokenSource.Token() returns a token with Expiry = now + 59 minutes.
- token.AccessToken is stored as u.token; the TokenSource is discarded.
- Upload begins; 16 goroutines issue concurrent PUT requests.
- Due to large file size and intermittent 500 errors requiring retries, elapsed time exceeds 60 minutes.
- All subsequent PUT requests include "Authorization: Bearer ".
- GCS returns HTTP 401.
- DefaultRetryPolicy sees 401, returns (false, nil) — do not retry.
- Each part goroutine returns an error; errgroup cancels remaining parts.
- Upload fails permanently with no token-refresh recovery.
Fix
Store creds.TokenSource (not token.AccessToken) in the Uploader struct. In doRequest(), call tokenSource.Token() to get a fresh (or cached, if still valid) token before setting the Authorization header. The oauth2 package TokenSource implementations cache tokens and only fetch a new one when the current one is within the expiry window.
jakubno
left a comment
There was a problem hiding this comment.
Can we separate this into multiple PRs? 🙏🏻
|
Superseded by smaller split PRs: removing MD5/mutex first, then zero-copy upload path as follow-up. |



Summary
gcpmultipart)Upload(data []byte)slices directly into caller's mmap'd dataStoreData(ctx, []byte)to skipos.ReadFilefor mmap'd cache filesblock.Cachemmap → HTTP bodySupersedes #2378