Skip to content

mlxml: preserve percent-encoding in URI paths#21

Open
ssahani wants to merge 1 commit intolibguestfs:masterfrom
ssahani:preserve-uri-percent-encoding
Open

mlxml: preserve percent-encoding in URI paths#21
ssahani wants to merge 1 commit intolibguestfs:masterfrom
ssahani:preserve-uri-percent-encoding

Conversation

@ssahani
Copy link

@ssahani ssahani commented Feb 21, 2026

xmlParseURI automatically decodes percent-encoded characters in the path component, which causes issues when paths contain %2f (encoded forward slash). For example, a VMX file path containing '%2f' would be decoded to '/', creating an invalid double-slash in the path.

Extract the raw path from the original URI string before xmlParseURI processes it, preserving percent-encoding like %2f. This allows proper handling of paths with special characters that need to remain encoded.

Fixes: https://issues.redhat.com/browse/RHEL-136481

@@ -0,0 +1,124 @@
From ea4c85204a87e0156f1d22d845844b4a4ad747cc Mon Sep 17 00:00:00 2001
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this file for?

@rwmjones
Copy link
Member

rwmjones commented Feb 21, 2026

I don't understand why this change is needed at all. Can you explain what problem you're trying to solve, ideally with examples.

In general, trying to hand-parse URIs is a big code smell. Libraries (like libxml2) exist to take care of the many details, standards and complications around URIs. Therefore the barrier to entry for this code is much larger than normal since it is doing something that is probably inadvisable. And since I don't understand the purpose of this code, I cannot right now think about accepting it.

@ssahani ssahani force-pushed the preserve-uri-percent-encoding branch 3 times, most recently from f28d178 to ae524c7 Compare February 21, 2026 11:24
@ssahani
Copy link
Author

ssahani commented Feb 24, 2026

This file name
esx8.0-rhel9.6-x86_64-efi-disk-name-contains-specical-chars-%2f/esx8.0-rhel9.6-x86_64-efi-disk-name-contains-specical-chars-%2f.vmx

is getting changed into

esx8.0-rhel9.6-x86_64-efi-disk-name-contains-specical-chars-//esx8.0-rhel9.6-x86_64-efi-disk-name-contains-specical-chars-/.vmx

This path does not exist . xmlParseURI automatically decodes percent-encoded characters (e.g. %2f becomes /), which corrupts paths where %2f is a literal part of the filename on VMFS/ESXi filesystems. xmlParseURIRaw with raw=1 preserves the percent-encoding so the path is passed to SFTP exactly as the user specified it.

@crobinso
Copy link
Collaborator

This changes the semantics of this API. You'd need to audit every user that's calling parse_uri throughout libguestfs code to see if they are passing in escaped URIs or not. If they are expecting the URI to be unescaped, this patch breaks those callers.

If the path is literally esx8.0-rhel9.6-x86_64-efi-disk-name-contains-specical-chars-%2f/esx8.0-rhel9.6-x86_64-efi-disk-name-contains-specical-chars-%2f.vmx, the valid RI form of that would be splitting on the / and passing each part through URL escaping, so '%2f' becomes '%252f'.

As other patches have addressed, vmware has obvious bugs here. So IMO probably better to finding whereever this path is passed to parse_uri in our vmware specific code, and URI escaping it first.

@rwmjones
Copy link
Member

The proposed commit now changes Xml.parse_uri for all callers throughout virt-v2v and guestfs-tools (I counted at least 13 places where this function is called). It's doubtful this change is correct for all of them, and I also doubt that you have studied those call sites.

mlxml is essentially a mini-binding around the xml* functions from libxml2, as it says in the documentation. If a binding around xmlParseURIRaw is needed then it'll need to be a separate binding in this library (eg. Xml.parse_uri_raw : string -> bool -> uri).

@ssahani
Copy link
Author

ssahani commented Feb 26, 2026

Thanks for the reviews. Updated .

@rwmjones
Copy link
Member

It's OK, but the whole of the code building the tuple is now duplicated. It'd be nice not to have that.

It's hard to understand from the libxml2 docs what xmlParseURIRaw even does. I had to look at the source code. It causes some parts of the URI (at least, authority, hostname, path, query string) to be left %-encoded rather than being decoded.

The libxml2 function isn't used anywhere else as far as I can see. But glib's g_uri_parse does have a similar feature (G_URI_FLAGS_ENCODED) so I guess that's fine. URI parsing is hard, news at 11...

Add a new parse_uri_raw binding that wraps xmlParseURIRaw(str, raw).
When raw is true, percent-encoding in URI paths is preserved.
On VMFS/ESXi, %2f is literal in filenames and must not be decoded
to '/'.  The existing parse_uri binding remains unchanged.

Fixes: https://issues.redhat.com/browse/RHEL-136481
Signed-off-by: Susant Sahani <ssahani@redhat.com>
@ssahani ssahani force-pushed the preserve-uri-percent-encoding branch from b9f7d81 to 63bcae2 Compare March 1, 2026 14:01
@ssahani
Copy link
Author

ssahani commented Mar 1, 2026

Indeed. Updated. Thanks for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants