fix(url): canonicalize_url('http://您好.中国:80/') failed#98
fix(url): canonicalize_url('http://您好.中国:80/') failed#98hidva wants to merge 4 commits intoscrapy:masterfrom
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #98 +/- ##
=========================================
Coverage ? 95.61%
=========================================
Files ? 7
Lines ? 593
Branches ? 132
=========================================
Hits ? 567
Misses ? 17
Partials ? 9
🚀 New features to boost your workflow:
|
|
ERROR: pypy: InterpreterNotFound: pypy |
|
A good catch @pp-qq ! |
w3lib/url.py
Outdated
| # or missing labels (e.g. http://.example.com) | ||
| try: | ||
| netloc = parts.netloc.encode('idna') | ||
| idx = parts.netloc.rfind(u':') |
There was a problem hiding this comment.
What do you think of using .hostname and .port attributes from the SplitResult instead of looking for a port number in netloc?
|
ERROR: pypy: InterpreterNotFound: pypy |
| :rtype: unicode | ||
| """ | ||
| try: | ||
| idx = onetloc.rfind(u':') |
There was a problem hiding this comment.
Thanks for the tests.
My #98 (comment) still holds though.
Have you considered using SplitResult .hostname and .port attributes?
There was a problem hiding this comment.
Current behavior:
# Calling this function on an already "safe" URL will return the URL unmodified.
>>> w3lib.url.safe_url_string('https://www.baIdu.com')
'https://www.baIdu.com' # I
>>> w3lib.url.canonicalize_url('http://www.baidu.com:80000000')
'http://www.baidu.com:80000000/'when _encode_netloc() using SplitResult .hostname and .port attributes:
>>> w3lib.url.safe_url_string('https://www.baIdu.com')
'https://www.baidu.com'
>>> w3lib.url.canonicalize_url('http://www.baidu.com:80000000')
'http://www.baidu.com/'because python 3.5 says:
hostnameis Host name (lower case)...
Is this what we want?
There was a problem hiding this comment.
I'm fine with lowercasing the hostname. But it is indeed a change in behavior.
There was a problem hiding this comment.
>>> w3lib.url.canonicalize_url('http://www.baidu.com:80000000')
'http://www.baidu.com/'
what's the reason for the "lost" port number?
There was a problem hiding this comment.
python 3.5 says: port is None when value if not valid:
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from six.moves.urllib.parse import (urljoin, urlsplit, urlunsplit,
... urldefrag, urlencode, urlparse,
... quote, parse_qs, parse_qsl,
... ParseResult, unquote, urlunparse)
>>>
>>> t = urlparse('http://www.BaidU.com:80') # valid port
>>> t.hostname, t.port
('www.baidu.com', 80)
>>> t = urlparse('http://www.BaidU.com:65536') # invalid port
>>> t.hostname, t.port
('www.baidu.com', None)And the new _encode_netloc():
def _encode_netloc(parts):
"""
:type parts: ParseResult
:rtype: unicode
"""
try:
hostname = to_unicode(parts.hostname.encode('idna'))
netloc = hostname if parts.port is None else '%s:%s' % (hostname, parts.port)
# lost the port number if parts.port is None
except UnicodeError:
netloc = parts.netloc
return netlocThere was a problem hiding this comment.
It's a shame that urlsplit and urlparse in Python 2.7 and 3.5 simply swallow an invalid port number.
Python 3.6 does report an error:
$ python
Python 3.6.2 (default, Aug 24 2017, 10:48:24)
[GCC 6.3.0 20170406] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib.parse import urlsplit
>>> t = urlsplit('http://www.BaidU.com:654444')
>>> t.port
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.6/urllib/parse.py", line 169, in port
raise ValueError("Port out of range 0-65535")
ValueError: Port out of range 0-65535
|
Bumping to close outdated PR. |
expect:
'http://xn--5usr0o.xn--fiqs8s:80/'actual:
'http://xn--5usr0o.xn--:80-u68dy61b/'