Skip to content

spacy download --url is silently ignored or rejected — the custom URL flag never works #13963

@shaun0927

Description

@shaun0927

How to reproduce the behaviour

The --url flag added in #13848 to python -m spacy download cannot succeed under any input. Either the user-supplied URL is silently replaced with the default GitHub URL, or it is rejected by the post-construction guard.

import sys
import importlib
from unittest.mock import patch

import spacy
from spacy import about

# `from spacy.cli import download` resolves to the *function*, not the module,
# because of an alias in cli/__init__.py. Force-import the module:
importlib.import_module("spacy.cli.download")
dl = sys.modules["spacy.cli.download"]

captured = {}
def fake_run(cmd):
    captured["url"] = next(arg for arg in cmd if arg.startswith("http"))

# Case 1: --url WITHOUT trailing slash
captured.clear()
with patch.object(dl, "run_command", fake_run), \
     patch.object(dl, "_get_pip_install_cmd", lambda: ["pip", "install"]):
    dl.download_model("foo-1.0.tar.gz", custom_url="https://my-mirror.example.com/models")
print(captured["url"])
# → https://github.com/explosion/spacy-models/releases/download/foo-1.0.tar.gz
#   (the user's mirror was silently discarded)

# Case 2: --url WITH trailing slash
with patch.object(dl, "run_command", fake_run), \
     patch.object(dl, "_get_pip_install_cmd", lambda: ["pip", "install"]):
    dl.download_model("foo-1.0.tar.gz", custom_url="https://my-mirror.example.com/models/")
# → ValueError: Download from foo-1.0.tar.gz rejected. Was it a relative path?

Root cause

spacy/cli/download.py:180-186:

base_url = custom_url if custom_url else about.__download_url__
# urljoin requires that the path ends with /, or the last path part will be dropped
if not base_url.endswith("/"):
    base_url = about.__download_url__ + "/"      # ← clobbers custom_url
download_url = urljoin(base_url, filename)
if not download_url.startswith(about.__download_url__):
    raise ValueError(f"Download from {filename} rejected. Was it a relative path?")

Two interlocking defects:

  1. Line 183 unconditionally replaces base_url with the default URL when the input lacks a trailing slash, discarding the user's choice.
  2. The line-185 startswith(about.__download_url__) guard rejects any custom URL that is preserved, because a custom mirror by definition does not start with the GitHub URL.

Result: the --url flag cannot reach a non-default URL under any input.

Impact

Users in air-gapped or mirrored environments — the exact users the feature was added for, per #13848 — believe their downloads are local but are silently being dispatched to github.com. Egress that operators thought they had blocked at the network layer is the only remaining safeguard.

Your Environment

  • spaCy version: 3.8.13 (also reproduces on master HEAD, prep for 3.8.14)
  • Python version: 3.12
  • Platform: macOS / Linux

PR: shaun0927/spaCy@fix/download-custom-url (incoming).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions