Small micro optimisations in w3lib.http by albertedwardson · Pull Request #246 · scrapy/w3lib

albertedwardson · 2025-07-26T17:39:06Z

This PR includes some small micro-optimizations in w3lib.http. They're not huge changes, but they help improve performance a bit here and there.

I'm really thankful for all the amazing work done by the Scrapy team and contributors — I’ve learned a lot from the ecosystem. This PR is my small way of saying thanks.

I'm currently very into micro-optimizations, but I hope to keep contributing more (and maybe more useful things!) over time.

Also, please let me know: should I add a pytest-benchmark suite, or is a simple synthetic benchmark per PR enough?

Thanks for your time

albertedwardson · 2025-07-26T18:05:58Z

I used the following script for benchmarking:

import timeit

from w3lib.http import headers_dict_to_raw, headers_raw_to_dict

# Sample headers for testing
headers_raw = (
    b"Content-Type: text/html\r\n"
    b"Accept: gzip\r\n"
    b"Cache-Control: no-cache\r\n"
    b"X-Test: value1\r\n"
    b"X-Test: value2\r\n"
)

headers_dict = {
    b"Content-Type": [b"text/html"],
    b"Accept": [b"gzip"],
    b"Cache-Control": [b"no-cache"],
    b"X-Test": [b"value1", b"value2"],
}

# Benchmark wrappers
def benchmark_headers_raw_to_dict():
    headers_raw_to_dict(headers_raw)

def benchmark_headers_dict_to_raw():
    headers_dict_to_raw(headers_dict)

# Run benchmarks
if __name__ == "__main__":
    raw_to_dict_time = timeit.timeit(benchmark_headers_raw_to_dict, number=1_000_000)
    dict_to_raw_time = timeit.timeit(benchmark_headers_dict_to_raw, number=1_000_000)

    print("Benchmark results (1,000,000 runs):")
    print(f"headers_raw_to_dict: {raw_to_dict_time:.4f} seconds")
    print(f"headers_dict_to_raw: {dict_to_raw_time:.4f} seconds")

On my machine, here are the best results (out of 10 runs):

Before changes:

headers_raw_to_dict: 2.4984 seconds
headers_dict_to_raw: 2.9800 seconds

After changes:

headers_raw_to_dict: 2.2313 seconds
headers_dict_to_raw: 2.1083 seconds

Gallaecio

Nice

albertedwardson · 2025-07-27T11:44:07Z

In my latest commit, headers_raw_to_dict now completes in approximately 2.004 seconds in the benchmark script I provided. I believe the speedup comes from reducing Python function calls by using BytesIO directly, which delegates iteration to the underlying C implementation.

codecov · 2025-07-27T11:47:12Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.96%. Comparing base (5423e0a) to head (a5d30c0).
⚠️ Report is 8 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #246      +/-   ##
==========================================
+ Coverage   97.92%   97.96%   +0.03%     
==========================================
  Files           9        9              
  Lines         483      491       +8     
  Branches       78       83       +5     
==========================================
+ Hits          473      481       +8     
  Misses          6        6              
  Partials        4        4

Files with missing lines	Coverage Δ
w3lib/http.py	`100.00% <100.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

wRAR · 2025-07-27T11:54:20Z

I don't have experience with pytest-benchmark, but if you have any specific ideas you are welcome to implement them, it's fine to do as a separate PR.

albertedwardson · 2025-07-27T12:17:44Z

Okay, will do, but I think this might require a major refactor of the tests or lead to quite a bit of code duplication, unfortunately... As a reference, I’ll use the aiohttp benchmarking suite; they also use codspeed for generating nice benchmarking reports.

…uce time of `headers_dict_to_raw` to 1.7883 seconds in benchmark

albertedwardson · 2025-07-27T12:31:37Z

w3lib/http.py

+    if not headers_dict:
+        return b""
+
+    parts = b""


Surprisingly, concatenating plain bytes is noticeably faster than using bytearray. I believe this is thanks to Python’s "new" adaptive interpreter

Well, it's only faster with relatively small headers — see #247. I had incorrectly assumed that bytes would be faster due to adaptive interpreter optimizations. But now I believe the real reason is that strings and bytes don’t always recreate themselves on each concatenation, as explained here

albertedwardson · 2025-07-27T12:39:00Z

So, summing up, the previous benchmark times were:

headers_raw_to_dict: 2.4984 
headers_dict_to_raw: 2.9800 seconds

And now:

headers_raw_to_dict: 2.0040 seconds
headers_dict_to_raw: 1.7675 seconds

So, headers_raw_to_dict is approximately 20% faster, and headers_dict_to_raw is about 40% faster.

albertedwardson · 2025-07-27T14:07:12Z

Do I need to fix the codecov report with missing base or will the problem be automatically resolved after the merge?

wRAR · 2025-07-27T14:33:46Z

codecov has some problems recently, I don't think we can do anything about that.

wRAR

Thanks! Please let us know when you think it's ready for merging.

albertedwardson · 2025-07-27T18:13:01Z

I believe it’s ready to merge. I’ve done my best on these functions and don’t plan any further changes :)

albertedwardson added 3 commits July 26, 2025 20:24

Small micro optimisations

17da20e

Merge branch 'scrapy:master' into microopts

c0d3047

headers_raw_to_dict actually regressed, fix

d737f5a

fix typing

0db3f3f

Gallaecio approved these changes Jul 26, 2025

View reviewed changes

Gallaecio requested a review from wRAR July 26, 2025 18:24

even faster and cleaner headers_raw_to_dict

789e5de

add tests for new early exit cases, restore coverage, surpisingly red…

7b163a7

…uce time of `headers_dict_to_raw` to 1.7883 seconds in benchmark

albertedwardson commented Jul 27, 2025

View reviewed changes

stick with bytearray, see scrapy#247

a5d30c0

wRAR approved these changes Jul 27, 2025

View reviewed changes

wRAR merged commit f45e3ff into scrapy:master Jul 27, 2025
26 checks passed

Conversation

albertedwardson commented Jul 26, 2025

Uh oh!

albertedwardson commented Jul 26, 2025

Uh oh!

Gallaecio left a comment

Choose a reason for hiding this comment

Uh oh!

albertedwardson commented Jul 27, 2025

Uh oh!

codecov bot commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wRAR commented Jul 27, 2025

Uh oh!

albertedwardson commented Jul 27, 2025

Uh oh!

albertedwardson Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertedwardson Jul 27, 2025

Choose a reason for hiding this comment

Uh oh!

albertedwardson commented Jul 27, 2025

Uh oh!

albertedwardson commented Jul 27, 2025

Uh oh!

wRAR commented Jul 27, 2025

Uh oh!

wRAR left a comment

Choose a reason for hiding this comment

Uh oh!

albertedwardson commented Jul 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

codecov bot commented Jul 27, 2025 •

edited

Loading

albertedwardson Jul 27, 2025 •

edited

Loading