Skip to content

fix(transformer): handle URLs with spaces in HtmlTransformer#147

Merged
marevol merged 1 commit intomasterfrom
fix/html-transformer-url-with-spaces
Mar 6, 2026
Merged

fix(transformer): handle URLs with spaces in HtmlTransformer#147
marevol merged 1 commit intomasterfrom
fix/html-transformer-url-with-spaces

Conversation

@marevol
Copy link
Contributor

@marevol marevol commented Mar 6, 2026

Summary

  • Add fallback URL resolution in HtmlTransformer when java.net.URI throws URISyntaxException for URLs containing spaces or other RFC 2396-invalid characters
  • Prevents child URLs from being silently dropped during HTML crawling

Changes Made

  • HtmlTransformer.java: Added fallback logic in the URISyntaxException catch block that manually constructs absolute URLs by handling protocol-relative (//), absolute-path (/), query/fragment (?/#), scheme-based, and relative path URLs
  • HtmlTransformerTest.java: Added 6 test cases covering relative, absolute-path, protocol-relative, parent-traversal, and edge-case URL resolution with spaces

Testing

  • All new tests verify correct URL encoding of spaces (%20) in various URL patterns
  • Tests cover parent traversal above root directory to ensure safe path normalization

Breaking Changes

  • None. The fallback only activates when URI.resolve() already fails, so existing behavior is unchanged for valid URLs.

🤖 Generated with Claude Code

Add a fallback URL resolution mechanism when java.net.URI throws
URISyntaxException (e.g., for URLs containing spaces or other
characters not allowed by RFC 2396). The fallback manually constructs
absolute URLs by handling protocol-relative, absolute-path, relative,
and scheme-based URLs without relying on URI.resolve().

Also add tests covering relative, absolute-path, protocol-relative,
and parent-traversal URLs that contain spaces.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@marevol marevol merged commit 30dbce4 into master Mar 6, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant