perf: skip URL round-trip when path is already safe#14045
Conversation
_ensure_quoted_url() calls quote(unquote()) wrapped by an urlunsplit(urlsplit()) round-trip for remote URLs. This is quite slow. The good news is that we can skip all of this safely for http(s):// URLs by quickly scanning the URL for any characters that need to be quoted and %-escapes. If there aren't any, then return the URL unmodified immediately. In an (admittedly) unscientific test, this reduces the total wall time spent in collector.parse_links() by ~half, from 120ms to 50ms for pip install black setuptools mypy --dry-run as measured under Python's sampling profiler. The remaining time is spent on JSON parsing and Link instance initialization. A new test_ensure_quoted_url_idempotent_for_clean_urls() asserts that the fast path is a true identity on the URL shapes Warehouse and common simple API responses serve. Credit goes to Bernát Gábor who came up with this optimization. Co-authored-by: Bernat Gabor <gaborjbernat@gmail.com>
sepehr-rs
left a comment
There was a problem hiding this comment.
Hi @ichard26, thank you for working on this!
I benchmarked this against main using real URLs captured from a live pip install black mypy --dry-run run (so the fast path coverage reflects actual PyPI traffic, not synthetic data).
Microbenchmark:
main:
------------------------------------------------- benchmark: 1 tests -------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
----------------------------------------------------------------------------------------------------------------------
test_benchmark_urls 459.0823 486.0665 472.4067 11.9938 468.1179 21.1751 2;0 2.1168 5 1
----------------------------------------------------------------------------------------------------------------------PR:
----------------------------------------------- benchmark: 1 tests ----------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-----------------------------------------------------------------------------------------------------------------
test_benchmark_urls 41.4362 43.5073 42.3067 0.5526 42.4556 0.5768 7;1 23.6369 23 1
-----------------------------------------------------------------------------------------------------------------~46x faster compared to main.
End-to-end: pip install black mypy --no-deps --dry-run with a local index serving real PyPI simple pages, 30 runs:
main:
Time (mean ± σ): 610.0 ms ± 4.8 ms [User: 547.1 ms, System: 63.9 ms]
Range (min … max): 603.2 ms … 624.4 ms 30 runsPR:
Time (mean ± σ): 574.5 ms ± 8.4 ms [User: 510.9 ms, System: 64.2 ms]
Range (min … max): 565.1 ms … 599.5 ms 30 runs−35.5 ms (6%). Ranges don't overlap so the improvement is statistically solid.
LGTM from me.
|
@sepehr-rs as a general note, it's good practice to include your benchmark commands and/or scripts when citing benchmark results. I don't know what
This seems obviously wrong? It looks closer to 10x faster. Also, where is 460ms coming from? Your end-to-end run on main takes only 610ms and I can promise you that URL parsing did not take 2/3 of the total runtime. |
|
@ichard26, You're right, my apologies for the sloppy math. It's closer to 10x. I should have included the benchmark script from the start. Here's a link to it: I used ~5,800 URLs captured from a live dry run to amplify the signal, which explains the 460ms baseline, that's not representative of a single pip install invocation. |
|
URLs are tricky and sensitive, I asked an AI to fuzz this and this is what it came back with (the code details could be tweaked but the tests look good): The fast path can fire on empty-authority URLs. Fix: require a non-empty authority and an explicit path ( if url.startswith(("https://", "http://")):
after_scheme = url.removeprefix("https:").removeprefix("http:") # "//..."
if after_scheme[2:].find("/") > 0 and _UNSAFE_URL_PATH_CHARS_RE.search(after_scheme) is None:
return urlDiff-tested this against the slow path across 3.9-3.15 against ~29k fuzzed URLs per version, this has zero divergences and fast-paths 100% of PyPI file URLs. Tests to add. Should take the fast path (returned unchanged): Must not take the fast path (current |
_ensure_quoted_url()callsquote(unquote())wrapped by anurlunsplit(urlsplit())round-trip for remote URLs. This is quite slow.The good news is that we can skip all of this safely for http(s):// URLs by quickly scanning the URL for any characters that need to be quoted and %-escapes. If there aren't any, then return the URL unmodified immediately.
In an (admittedly) unscientific test, this reduces the total wall time spent in
collector.parse_links()by ~half, from 120ms to 50ms foras measured under Python's sampling profiler. The remaining time is spent on JSON parsing and Link instance initialization.
Flamegraphs
A new
test_ensure_quoted_url_idempotent_for_clean_urls()asserts that the fast path is a true identity on the URL shapes Warehouse and common simple API responses serve.Credit goes to Bernát Gábor who came up with this optimization.
This supersedes PR #13986. (closes #13986)