Skip to content

[Bug]: Tar path traversal (Zip Slip) in decompress_to_cache — arbitrary file write outside cache directory #626

@CrepuscularIRIS

Description

@CrepuscularIRIS

Summary

The decompress_to_cache method uses tarfile.extractall() without sanitizing member paths, making it vulnerable to Zip Slip / Tar Slip (CVE-2007-4559 class). A malicious tar archive can write files to arbitrary locations outside cache_dir.

Note: This is related to but distinct from #327, which tracks the DeprecationWarning/platform compatibility aspect. This issue specifically addresses the security vulnerability — path traversal allowing arbitrary file writes.

Affected Code

File: fastembed/common/model_management.py, lines 304–311

@classmethod
def decompress_to_cache(cls, targz_path: str, cache_dir: str) -> str:
    # ...
    with tarfile.open(targz_path, "r:gz") as tar:
        tar.extractall(
            path=cache_dir,   # No filter, no member sanitization
        )

Reproduction

import tarfile, os, tempfile, io

# Create a malicious tar that writes outside the intended directory
with tempfile.NamedTemporaryFile(suffix='.tar.gz', delete=False) as f:
    evil_tar = f.name

with tarfile.open(evil_tar, 'w:gz') as tar:
    payload = b"PWNED"
    info = tarfile.TarInfo(name="../../tmp/fastembed_pwned.txt")
    info.size = len(payload)
    tar.addfile(info, io.BytesIO(payload))

# Call decompress_to_cache with this tar
cache_dir = tempfile.mkdtemp()
from fastembed.common.model_management import ModelManagement
ModelManagement.decompress_to_cache(evil_tar, cache_dir)

# File written outside cache_dir:
print(os.path.exists("/tmp/fastembed_pwned.txt"))  # True

Attack Surface

  • Custom model URLs via add_custom_model() pointing to attacker-controlled servers
  • Compromised HuggingFace repos or GCS buckets (supply chain attack)
  • MITM on HTTP redirects

Impact

  • Arbitrary file write to any path writable by the process (SSH keys, cron jobs, Python packages, shell configs)
  • On Python 3.14, the default filter changes to 'data', which will silently change extraction behavior and may break existing archives

Suggested Fix

Add path filtering to block traversal:

@classmethod
def decompress_to_cache(cls, targz_path: str, cache_dir: str) -> str:
    with tarfile.open(targz_path, "r:gz") as tar:
        # Python 3.12+: use filter='data' to block traversal
        # Python 3.11 and earlier: manual sanitization
        try:
            tar.extractall(path=cache_dir, filter='data')
        except TypeError:
            # Python < 3.12 fallback
            for member in tar.getmembers():
                member_path = os.path.realpath(os.path.join(cache_dir, member.name))
                if not member_path.startswith(os.path.realpath(cache_dir) + os.sep):
                    raise ValueError(f"Unsafe tar member path: {member.name}")
            tar.extractall(path=cache_dir)

Found via automated codebase analysis. Confirmed independently by three reviewers (Claude, Codex, Gemini).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions