fix(security): 2 improvements across 2 files#488
Conversation
- Security: Arbitrary Code Execution via Pickle Deserialization in launch_pickled_pipeline.py - Security: Shell Command Injection in get_num_slurm_jobs via f-string formatting Signed-off-by: tomaioo <203048277+tomaioo@users.noreply.github.com>
- Security: Arbitrary Code Execution via Pickle Deserialization in launch_pickled_pipeline.py - Security: Shell Command Injection in get_num_slurm_jobs via f-string formatting Signed-off-by: tomaioo <203048277+tomaioo@users.noreply.github.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 917057b. Configure here.
| command = ["squeue", "-u", os.environ["USER"], "-t", "pending,running", "--array", "--format=%.10t"] | ||
| if partition: | ||
| command.extend(["-p", partition]) | ||
| output = subprocess.check_output(command) |
There was a problem hiding this comment.
Removed wc -l breaks integer parsing of squeue output
High Severity
The old command piped squeue output through | wc -l to get a line count, and int(output.strip()) correctly parsed that number. The new command removed the wc -l pipe but still calls int(output.strip()) on the raw squeue output, which contains job state strings (e.g. "RUNNING", "PENDING") — not a number. This will raise a ValueError every time get_num_slurm_jobs is called. The output needs to be split into lines and counted (e.g., len(output.strip().splitlines())) instead of directly converted to int.
Reviewed by Cursor Bugbot for commit 917057b. Configure here.


Summary
fix(security): 2 improvements across 2 files
Problem
Severity:
Critical| File:src/datatrove/tools/launch_pickled_pipeline.py:L15The
launch_pickled_pipeline.pytool usesdill.load()to deserialize a pickled pipeline executor from an arbitrary file path provided via command-line argument. Sincedill(likepickle) can execute arbitrary code during deserialization, an attacker who can provide a malicious pickle file could achieve remote code execution. This is a classic deserialization vulnerability with no input validation or sandboxing.Solution
Avoid using dill/pickle for loading untrusted data. If this tool must exist, add warnings about only loading trusted files, implement digital signature verification before deserialization, or use a safer serialization format like JSON. Consider restricting file permissions and documenting the security risk prominently.
Changes
src/datatrove/tools/launch_pickled_pipeline.py(modified)src/datatrove/utils/jobs.py(modified)Note
Medium Risk
Pickle RCE remains possible despite the warning; Slurm job counting may no longer match the prior
wc -lbehavior when multiple jobs exist.Overview
Adds a
RuntimeWarningbeforedill.loadinlaunch_pickled_pipeline.pyso operators are explicitly told that unpickling can run arbitrary code and should only use trusted files. Deserialization behavior is unchanged; this is documentation/warning only, not sandboxing or signature checks.Refactors
get_num_slurm_jobsinjobs.pyto invokesqueuevia asubprocessargument list (andos.environ["USER"]) instead of ashell=Truestring that interpolated partition and piped throughwc, reducing shell-injection risk when partition or environment values are influenced by callers.Reviewed by Cursor Bugbot for commit 917057b. Bugbot is set up for automated code reviews on this repo. Configure here.