Skip to content

fix(s3): guard _put_many_files and _read_many_files against empty input#3194

Merged
npow merged 1 commit into
masterfrom
fix/s3_no_file
May 14, 2026
Merged

fix(s3): guard _put_many_files and _read_many_files against empty input#3194
npow merged 1 commit into
masterfrom
fix/s3_no_file

Conversation

@romain-intel
Copy link
Copy Markdown
Contributor

Both methods called s3op with an empty input file when given an empty iterator. s3op exits 0 in this case but writes an informational message to stderr ("Uploading 0 files.."). The if stderr: check treated this as an error and the error handler then crashed with IndexError trying to access url_info[0] or prefixes_and_ranges[0] on an empty list.

Trigger: the content-addressed store skips uploading blobs that already exist in S3 (deduplication). When all blobs from a step are already present, put_many is called with an empty iterator, hitting this bug on every re-run of the same flow with the same inputs.

Fix: return early from both methods when the input list is empty. There is nothing to do, so no subprocess needs to be spawned.

Add nine unit tests covering:

  • Correct empty return value for both methods
  • s3op not called on empty input
  • Regression: no IndexError when s3op would have written info to stderr
  • All s3op modes (get/list/info/put) safe with empty input
  • Non-empty input still takes the normal code path

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 12, 2026

Greptile Summary

This PR fixes a crash in _put_many_files and _read_many_files that occurred when called with an empty iterator — a real-world trigger being the content-addressed store skipping all uploads on a cache hit. Both methods now return early before spawning any subprocess.

  • _read_many_files: bare return correctly stops the generator (yields nothing), fully compatible with all callers that iterate over the result.
  • _put_many_files: returns [] explicitly, matching the non-empty code path's list return type.
  • Nine unit tests are added covering empty returns, no-subprocess guarantees, regression documentation, all op modes, and the non-empty happy path.

Confidence Score: 5/5

Safe to merge — the fix is minimal, targeted, and backward-compatible with all existing callers.

Both guards are correct: _read_many_files is a generator function so a bare return simply stops iteration, and _put_many_files returns [] consistent with its non-empty list return type. All existing callers iterate over or directly return these values and handle the empty case naturally. The tests confirm no subprocess is spawned, the return values are correct, and the non-empty code path is unaffected.

No files require special attention.

Important Files Changed

Filename Overview
metaflow/plugins/datatools/s3/s3.py Adds two early-return guards (one in _read_many_files, one in _put_many_files) that exit immediately when the input list is empty, preventing IndexError in the error-handler branch downstream.
test/unit/test_s3_empty_input.py New test file with nine unit tests for the empty-input guards; covers empty-return values, no-subprocess assertions, regression documentation, all op modes, and non-empty happy paths.

Reviews (2): Last reviewed commit: "fix(s3): guard _put_many_files and _read..." | Re-trigger Greptile

Comment thread test/unit/test_s3_empty_input.py Outdated
Both methods called s3op with an empty input file when given an empty
iterator. s3op exits 0 in this case but writes an informational message
to stderr ("Uploading 0 files.."). The `if stderr:` check treated this
as an error and the error handler then crashed with IndexError trying to
access url_info[0] or prefixes_and_ranges[0] on an empty list.

Trigger: the content-addressed store skips uploading blobs that already
exist in S3 (deduplication). When all blobs from a step are already
present, put_many is called with an empty iterator, hitting this bug on
every re-run of the same flow with the same inputs.

Fix: return early from both methods when the input list is empty.
There is nothing to do, so no subprocess needs to be spawned.

Add nine unit tests covering:
- Correct empty return value for both methods
- s3op not called on empty input
- Regression: no IndexError when s3op would have written info to stderr
- All s3op modes (get/list/info/put) safe with empty input
- Non-empty input still takes the normal code path
@npow npow merged commit 23a45b0 into master May 14, 2026
45 checks passed
@npow npow deleted the fix/s3_no_file branch May 14, 2026 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants