fix(security): 2 improvements across 2 files by tomaioo · Pull Request #488 · huggingface/datatrove

tomaioo · 2026-05-26T12:33:13Z

Summary

fix(security): 2 improvements across 2 files

Problem

Severity: Critical | File: src/datatrove/tools/launch_pickled_pipeline.py:L15

The launch_pickled_pipeline.py tool uses dill.load() to deserialize a pickled pipeline executor from an arbitrary file path provided via command-line argument. Since dill (like pickle) can execute arbitrary code during deserialization, an attacker who can provide a malicious pickle file could achieve remote code execution. This is a classic deserialization vulnerability with no input validation or sandboxing.

Solution

Avoid using dill/pickle for loading untrusted data. If this tool must exist, add warnings about only loading trusted files, implement digital signature verification before deserialization, or use a safer serialization format like JSON. Consider restricting file permissions and documenting the security risk prominently.

Changes

src/datatrove/tools/launch_pickled_pipeline.py (modified)
src/datatrove/utils/jobs.py (modified)

Note

Medium Risk
Pickle RCE remains possible despite the warning; Slurm job counting may no longer match the prior wc -l behavior when multiple jobs exist.

Overview
Adds a RuntimeWarning before dill.load in launch_pickled_pipeline.py so operators are explicitly told that unpickling can run arbitrary code and should only use trusted files. Deserialization behavior is unchanged; this is documentation/warning only, not sandboxing or signature checks.

Refactors get_num_slurm_jobs in jobs.py to invoke squeue via a subprocess argument list (and os.environ["USER"]) instead of a shell=True string that interpolated partition and piped through wc, reducing shell-injection risk when partition or environment values are influenced by callers.

^{Reviewed by Cursor Bugbot for commit 917057b. Bugbot is set up for automated code reviews on this repo. Configure here.}

- Security: Arbitrary Code Execution via Pickle Deserialization in launch_pickled_pipeline.py - Security: Shell Command Injection in get_num_slurm_jobs via f-string formatting Signed-off-by: tomaioo <203048277+tomaioo@users.noreply.github.com>

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 917057b. Configure here.}

cursor · 2026-05-26T12:34:33Z

+    command = ["squeue", "-u", os.environ["USER"], "-t", "pending,running", "--array", "--format=%.10t"]
+    if partition:
+        command.extend(["-p", partition])
+    output = subprocess.check_output(command)


Removed wc -l breaks integer parsing of squeue output

High Severity

The old command piped squeue output through | wc -l to get a line count, and int(output.strip()) correctly parsed that number. The new command removed the wc -l pipe but still calls int(output.strip()) on the raw squeue output, which contains job state strings (e.g. "RUNNING", "PENDING") — not a number. This will raise a ValueError every time get_num_slurm_jobs is called. The output needs to be split into lines and counted (e.g., len(output.strip().splitlines())) instead of directly converted to int.

^{Reviewed by Cursor Bugbot for commit 917057b. Configure here.}

tomaioo added 2 commits May 26, 2026 05:33

cursor Bot reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(security): 2 improvements across 2 files#488

fix(security): 2 improvements across 2 files#488
tomaioo wants to merge 2 commits into
huggingface:mainfrom
tomaioo:fix/security/arbitrary-code-execution-via-pickle-dese

tomaioo commented May 26, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tomaioo commented May 26, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Changes

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 26, 2026

Choose a reason for hiding this comment

Removed wc -l breaks integer parsing of squeue output

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tomaioo commented May 26, 2026 •

edited by cursor Bot

Loading

Removed `wc -l` breaks integer parsing of squeue output