Skip to content

fix(security): 2 improvements across 2 files#488

Open
tomaioo wants to merge 2 commits into
huggingface:mainfrom
tomaioo:fix/security/arbitrary-code-execution-via-pickle-dese
Open

fix(security): 2 improvements across 2 files#488
tomaioo wants to merge 2 commits into
huggingface:mainfrom
tomaioo:fix/security/arbitrary-code-execution-via-pickle-dese

Conversation

@tomaioo

@tomaioo tomaioo commented May 26, 2026

Copy link
Copy Markdown

Summary

fix(security): 2 improvements across 2 files

Problem

Severity: Critical | File: src/datatrove/tools/launch_pickled_pipeline.py:L15

The launch_pickled_pipeline.py tool uses dill.load() to deserialize a pickled pipeline executor from an arbitrary file path provided via command-line argument. Since dill (like pickle) can execute arbitrary code during deserialization, an attacker who can provide a malicious pickle file could achieve remote code execution. This is a classic deserialization vulnerability with no input validation or sandboxing.

Solution

Avoid using dill/pickle for loading untrusted data. If this tool must exist, add warnings about only loading trusted files, implement digital signature verification before deserialization, or use a safer serialization format like JSON. Consider restricting file permissions and documenting the security risk prominently.

Changes

  • src/datatrove/tools/launch_pickled_pipeline.py (modified)
  • src/datatrove/utils/jobs.py (modified)

Note

Medium Risk
Pickle RCE remains possible despite the warning; Slurm job counting may no longer match the prior wc -l behavior when multiple jobs exist.

Overview
Adds a RuntimeWarning before dill.load in launch_pickled_pipeline.py so operators are explicitly told that unpickling can run arbitrary code and should only use trusted files. Deserialization behavior is unchanged; this is documentation/warning only, not sandboxing or signature checks.

Refactors get_num_slurm_jobs in jobs.py to invoke squeue via a subprocess argument list (and os.environ["USER"]) instead of a shell=True string that interpolated partition and piped through wc, reducing shell-injection risk when partition or environment values are influenced by callers.

Reviewed by Cursor Bugbot for commit 917057b. Bugbot is set up for automated code reviews on this repo. Configure here.

tomaioo added 2 commits May 26, 2026 05:33
- Security: Arbitrary Code Execution via Pickle Deserialization in launch_pickled_pipeline.py
- Security: Shell Command Injection in get_num_slurm_jobs via f-string formatting

Signed-off-by: tomaioo <203048277+tomaioo@users.noreply.github.com>
- Security: Arbitrary Code Execution via Pickle Deserialization in launch_pickled_pipeline.py
- Security: Shell Command Injection in get_num_slurm_jobs via f-string formatting

Signed-off-by: tomaioo <203048277+tomaioo@users.noreply.github.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 917057b. Configure here.

command = ["squeue", "-u", os.environ["USER"], "-t", "pending,running", "--array", "--format=%.10t"]
if partition:
command.extend(["-p", partition])
output = subprocess.check_output(command)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed wc -l breaks integer parsing of squeue output

High Severity

The old command piped squeue output through | wc -l to get a line count, and int(output.strip()) correctly parsed that number. The new command removed the wc -l pipe but still calls int(output.strip()) on the raw squeue output, which contains job state strings (e.g. "RUNNING", "PENDING") — not a number. This will raise a ValueError every time get_num_slurm_jobs is called. The output needs to be split into lines and counted (e.g., len(output.strip().splitlines())) instead of directly converted to int.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 917057b. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant