fix(url): canonicalize_url('http://您好.中国:80/') failed by hidva · Pull Request #98 · scrapy/w3lib

hidva · 2017-09-14T08:45:06Z

expect: 'http://xn--5usr0o.xn--fiqs8s:80/'
actual: 'http://xn--5usr0o.xn--:80-u68dy61b/'

w3lib/url.py

codecov · 2017-09-14T09:11:35Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (master@34435d0). Learn more about missing BASE report.
Report is 279 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##             master      #98   +/-   ##
=========================================
  Coverage          ?   95.61%           
=========================================
  Files             ?        7           
  Lines             ?      593           
  Branches          ?      132           
=========================================
  Hits              ?      567           
  Misses            ?       17           
  Partials          ?        9

Files with missing lines	Coverage Δ
w3lib/url.py	`97.21% <100.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

expect: 'http://xn--5usr0o.xn--fiqs8s:80/' actual: 'http://xn--5usr0o.xn--:80-u68dy61b/'

hidva · 2017-09-14T09:19:42Z

ERROR: pypy: InterpreterNotFound: pypy

redapple · 2017-10-05T13:06:40Z

A good catch @pp-qq !
This also affects safe_url_string() (and would be fixed too with your change to _safe_ParseResult())
This change needs an update to the tests. Could you add 1 or 2 for this case?

redapple · 2017-10-05T13:08:47Z

w3lib/url.py

    # or missing labels (e.g. http://.example.com)
    try:
-        netloc = parts.netloc.encode('idna')
+        idx = parts.netloc.rfind(u':')


What do you think of using .hostname and .port attributes from the SplitResult instead of looking for a port number in netloc?

redapple · 2017-10-05T13:19:50Z

@pp-qq , the issue with pypy on Travis is unrelated. I'm trying to fix it in #99

expect: 'http://xn--5usr0o.xn--fiqs8s:80/' actual: 'http://xn--5usr0o.xn--:80-u68dy61b/'

…lt()`

hidva · 2017-10-19T03:47:19Z

ERROR: pypy: InterpreterNotFound: pypy
ERROR: The command "pip install -U tox twine wheel codecov" failed and exited with 2 during .

redapple · 2017-10-26T08:47:31Z

w3lib/url.py

+    :rtype: unicode
+    """
+    try:
+        idx = onetloc.rfind(u':')


Thanks for the tests.
My #98 (comment) still holds though.
Have you considered using SplitResult .hostname and .port attributes?

Current behavior:

# Calling this function on an already "safe" URL will return the URL unmodified. >>> w3lib.url.safe_url_string('https://www.baIdu.com') 'https://www.baIdu.com' # I >>> w3lib.url.canonicalize_url('http://www.baidu.com:80000000') 'http://www.baidu.com:80000000/'

when _encode_netloc() using SplitResult .hostname and .port attributes:

>>> w3lib.url.safe_url_string('https://www.baIdu.com') 'https://www.baidu.com' >>> w3lib.url.canonicalize_url('http://www.baidu.com:80000000') 'http://www.baidu.com/'

because python 3.5 says:

hostname is Host name (lower case)...

Is this what we want?

I'm fine with lowercasing the hostname. But it is indeed a change in behavior.

>>> w3lib.url.canonicalize_url('http://www.baidu.com:80000000') 'http://www.baidu.com/'

what's the reason for the "lost" port number?

python 3.5 says: port is None when value if not valid:

Python 2.7.12 (default, Nov 19 2016, 06:48:10) [GCC 5.4.0 20160609] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from six.moves.urllib.parse import (urljoin, urlsplit, urlunsplit, ... urldefrag, urlencode, urlparse, ... quote, parse_qs, parse_qsl, ... ParseResult, unquote, urlunparse) >>> >>> t = urlparse('http://www.BaidU.com:80') # valid port >>> t.hostname, t.port ('www.baidu.com', 80) >>> t = urlparse('http://www.BaidU.com:65536') # invalid port >>> t.hostname, t.port ('www.baidu.com', None)

And the new _encode_netloc():

def _encode_netloc(parts): """ :type parts: ParseResult :rtype: unicode """ try: hostname = to_unicode(parts.hostname.encode('idna')) netloc = hostname if parts.port is None else '%s:%s' % (hostname, parts.port) # lost the port number if parts.port is None except UnicodeError: netloc = parts.netloc return netloc

It's a shame that urlsplit and urlparse in Python 2.7 and 3.5 simply swallow an invalid port number.
Python 3.6 does report an error:

$ python Python 3.6.2 (default, Aug 24 2017, 10:48:24) [GCC 6.3.0 20170406] on linux Type "help", "copyright", "credits" or "license" for more information. >>> from urllib.parse import urlsplit >>> t = urlsplit('http://www.BaidU.com:654444') >>> t.port Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.6/urllib/parse.py", line 169, in port raise ValueError("Port out of range 0-65535") ValueError: Port out of range 0-65535

nandgator · 2022-07-20T06:44:56Z

Bumping to close outdated PR.

hidva commented Sep 14, 2017

View reviewed changes

w3lib/url.py Outdated Show resolved Hide resolved

hidva force-pushed the master branch from b67cda7 to 749734c Compare September 14, 2017 09:11

fix(url): canonicalize_url('http://您好.中国:80/') failed

3312f87

expect: 'http://xn--5usr0o.xn--fiqs8s:80/' actual: 'http://xn--5usr0o.xn--:80-u68dy61b/'

hidva force-pushed the master branch from 749734c to 3312f87 Compare September 14, 2017 09:12

redapple reviewed Oct 5, 2017

View reviewed changes

pp-qq added 3 commits October 19, 2017 11:12

fix(url): canonicalize_url('http://您好.中国:80/') failed

d6934da

expect: 'http://xn--5usr0o.xn--fiqs8s:80/' actual: 'http://xn--5usr0o.xn--:80-u68dy61b/'

fix(url): fix safe_url_string() with the change to `_safe_ParseResu…

0606f77

…lt()`

test(url): add test cases

90037c7

redapple suggested changes Oct 26, 2017

View reviewed changes

Gallaecio closed this Aug 8, 2019

Gallaecio reopened this Aug 8, 2019

Gallaecio approved these changes Aug 12, 2019

View reviewed changes

Conversation

hidva commented Sep 14, 2017

Uh oh!

Uh oh!

codecov bot commented Sep 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hidva commented Sep 14, 2017

Uh oh!

redapple commented Oct 5, 2017

Uh oh!

redapple Oct 5, 2017

Choose a reason for hiding this comment

Uh oh!

redapple commented Oct 5, 2017

Uh oh!

hidva commented Oct 19, 2017

Uh oh!

redapple Oct 26, 2017

Choose a reason for hiding this comment

Uh oh!

hidva Oct 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

redapple Oct 28, 2017

Choose a reason for hiding this comment

Uh oh!

redapple Oct 28, 2017

Choose a reason for hiding this comment

Uh oh!

hidva Oct 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

redapple Nov 23, 2017

Choose a reason for hiding this comment

Uh oh!

nandgator commented Jul 20, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Sep 14, 2017 •

edited

Loading

hidva Oct 28, 2017 •

edited

Loading

hidva Oct 28, 2017 •

edited

Loading