Skip to content

fix userinfo in case of some special characters#144

Open
kastakhov wants to merge 4 commits into
libwww-perl:masterfrom
kastakhov:fix-uri-parsing
Open

fix userinfo in case of some special characters#144
kastakhov wants to merge 4 commits into
libwww-perl:masterfrom
kastakhov:fix-uri-parsing

Conversation

@kastakhov

Copy link
Copy Markdown

Fix for #143

Comment thread lib/URI/_server.pm Outdated
}
if ($str =~ m,^((?:$URI::scheme_re:)?)//(.*:.*@)?([^/?\#]*)(.*)$,os) {
my $scheme = $1;
my $ui = $2 || '';

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a more descriptive name for this variable?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this variable was not your creation. :)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, I have renamed to $userinfo.
I decided to rename in whole file as it might be a little confusing when we're using $ui or $userinfo.

Comment thread lib/URI.pm Outdated
}

if ($_[0] =~ m{^((?:$URI::scheme_re:)?)//([^/?\#]+)(.*)$}os) {
if ($_[0] =~ m{^((?:$URI::scheme_re:)?)//((.*:.*@)?[^/?\#]+)(.*)$}os) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a discussion with @genio on IRC:

It would be nice to avoid (.*@) or (.*:.*@) as it's greedy and could be prone to bugs.

We could take inspiration from https://metacpan.org/dist/Mojolicious/source/lib/Mojo/URL.pm#L52-64

After gathering the host and port part, ^([^@]+@) should be fine.

@kastakhov kastakhov Aug 20, 2024

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, but what if user wanna use '@' in password, using [^@]+@ will cause that the everything before first '@' symbol will be matched and password parsed incorrectly.

like the similar approach is already used in LWP/Protocol/http.pm.
And I already mentioned what issue it might cause in my PR.

For my standpoint (.*:.*@) is pretty lazy, but okay, I agree that it could be better.

@oalders, what you think about ([^:]+:.*@) for matching user info?

This will match everything except : as username cannot contain colon, then match : as separator, match everything after first colon as password or nothing as according to rfc3986#section-3.2.1 it might be empty, and finally match @ as second separator.

@kastakhov kastakhov Aug 20, 2024

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, the regex which Mojo/URL.pm#L52-64 is used also cannot parse username/password if it contains any of /#? symbols :)
I know, it's really rarely usecase, but unfortunately I faced with it already two times :d

For instance, this regex with username and password which contain all special characters in unescaped form.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, but what if user wanna use '@' in password, using [^@]+@ will cause that the everything before first '@' symbol will be matched and password parsed incorrectly.

Then it should be URL-encoded as %40. I'm not sure if there's any way around that. Something has to serve as a stopping point. If the RE is too greedy, it might end up stopping with a @ inside of the path, like http://host/path/directory@foobar/file.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the RE is too greedy, it might end up stopping with a @ inside of the path, like http://host/path/directory@foobar/file.

Hm.
Now I see, it'll match this path incorrectly.

@kastakhov kastakhov Aug 26, 2024

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, actually I can revert this change as the most important part is in lib/URI/_server.pm#14.

There are another issue, if password contain multiple '@' symbols before any of '/#?' symbol URI parsing fails again :d

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are another issue, if password contain multiple '@' symbols before any of '/#?' symbol URI parsing fails again :d

If the password was inserted directly into the URL that way without percent-encoding, that's not really the URI module's fault. The password has to be encoded properly first. That's the point of escaping characters.

@oalders

oalders commented Aug 20, 2024

Copy link
Copy Markdown
Member

Thanks very much for this @kastakhov. If we could tweak the regex in both places as suggested, that would be helpful.

@kastakhov

Copy link
Copy Markdown
Author

Hi, @oalders, @genio, any another suggestions for these changes or now it looks good for you?

@SineSwiper

SineSwiper commented Aug 26, 2024

Copy link
Copy Markdown
Contributor

This is very adjacent to this issue, and feel free to tell me to open a different issue/PR, but...

The user/password methods allow bad data to pollute the URL (lines 45 and the similar line 22), because it is only doing a very light set of manual URL-escaping. It should, at a minimum escape potential @ characters in the password ($new =~ s/@/%40/g), and at a maximum, just call uri_escape for both user/password sets.

@oalders

oalders commented Aug 26, 2024

Copy link
Copy Markdown
Member

Is there a good reason not to just call uri_escape?

@genio

genio commented Aug 26, 2024

Copy link
Copy Markdown
Member

I still don't think this is a good idea, as you're supposed to URI-escape things. Here's an example with both Mojo and URI:

#!/usr/bin/env perl

use v5.36;
use strict;
use warnings;

use Mojo::URL ();
use Mojo::Util ();
use URI ();
use URI::Escape ();

my $unescaped_u = 'u1!"#$%&\'()*+,-./;<=>?@[\]^_`{|}~';
my $unescaped_p = 'p1!"#$%&\'()*+,-./;<=>?@[\]^_`{|}~';

{
    # mojolicious way:
    my $username = Mojo::Util::url_escape($unescaped_u);
    my $password = Mojo::Util::url_escape($unescaped_p);
    my $foo = "http://${username}:${password}\@example.com/pub/a/2001/08/27/bjornstad.html";
    # say "Mojo: ", $foo;
    my $url = Mojo::URL->new($foo);
    # say "Mojo: ", $url->to_string();
    say "Mojo: ", $url->userinfo();
}

{
    # URI way
    my $username = URI::Escape::uri_escape($unescaped_u);
    my $password = URI::Escape::uri_escape($unescaped_p);
    my $foo = "http://${username}:${password}\@example.com/pub/a/2001/08/27/bjornstad.html";
    # say "URI:  ", $foo;
    my $url = URI->new($foo);
    # say "URI:  ", $url->as_string();
    say "URI:  ", URI::Escape::uri_unescape($url->userinfo());
}

Yielding:

% perl url.pl
Mojo: u1!"#$%&'()*+,-./;<=>?@[\]^_`{|}~:p1!"#$%&'()*+,-./;<=>?@[\]^_`{|}~
URI:  u1!"#$%&'()*+,-./;<=>?@[\]^_`{|}~:p1!"#$%&'()*+,-./;<=>?@[\]^_`{|}~

@SineSwiper

Copy link
Copy Markdown
Contributor

While I agree everybody should URI-escape the UN/PWs, it should still try to safely parse as much as it can. IMO, [^:]+:[^@]+@ should work to capture userinfo without escaping out to parts that it shouldn't.

Also, I'll just create a different issue for the user/password set problem.

@SineSwiper

Copy link
Copy Markdown
Contributor

Well, disregard my user/password set problem. I realized that userinfo is running through URI::Escape::escape_char($1) and everything works correctly, as long as it didn't start off confused (ie: URI->new with reserved characters in the password).

@kastakhov

kastakhov commented Aug 26, 2024

Copy link
Copy Markdown
Author

While I agree everybody should URI-escape the UN/PWs, it should still try to safely parse as much as it can. IMO, [^:]+:[^@]+@ should work to capture userinfo without escaping out to parts that it shouldn't.

Also, I'll just create a different issue for the user/password set problem.

Just in case, originally I faced with this issue here..

make _uric_escape regex more specific for userinfo part
update UT
@kastakhov

Copy link
Copy Markdown
Author

Now the only one issue still here.... when password contain '@' unescaped symbol before '/#?' symbols, user info part matches incorrectly. 🙃

@genio

genio commented Aug 26, 2024

Copy link
Copy Markdown
Member

The seemingly relevant RFC pieces are linked here:
https://datatracker.ietf.org/doc/html/rfc3986#section-3.2.1
https://datatracker.ietf.org/doc/html/rfc3986#section-7.3
https://datatracker.ietf.org/doc/html/rfc3986#section-7.5

I think my cursory reading of these and understanding thus far leads me to believe we should leave things as-is and maybe provide documentation for how to provide complex authentication using URI::Escape::uri_escape. I believe making changes to the behavior here could result in unexpected double percent encoding for some folks who are already doing it the way in the sample script above.

@SineSwiper

SineSwiper commented Aug 27, 2024

Copy link
Copy Markdown
Contributor

I found Section 2.2 earlier, but didn't see where those declarations were used. Good find.

So, this means we should stay away from unnecessarily parsing gen-delims in the userinfo section. At the very least, .* is way too broad.

Now the only one issue still here.... when password contain '@' unescaped symbol before '/#?' symbols, user info part matches incorrectly.

Again, a delimiter has to exist unescaped. There's no such thing as a parser algorithm that can take in every ASCII or Unicode unescaped character as a sub-field of the whole string. At some point, we have to force the user to escape their data before it gets dropped into a URL.

Comment thread lib/URI/_server.pm
if (_host_escape($host)) {
$str = "$scheme//$ui$host$port$rest";
}
if ($str =~ m,^((?:$URI::scheme_re:)?)//([^:]+:[^@]*@)?([^/?\#]*)(.*)$,os) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not match correctly valid string where the optional password is omitted (such as ftp://user@host) and will also match what I think is invalid string ftp://user:@host

Wouldn't it be better written as something like: m,^((?:$URI::scheme_re:)?)[/][/]([^:]+(?:[:][^@]+)?[@])?([^/?\#@]*)(.*)$,os

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants