Bug Description
When running kurt content map url with a domain that doesn't include the protocol (e.g., juhache.substack.com instead of https://juhache.substack.com), the command incorrectly treats it as a single page instead of discovering all subpages via sitemap or crawling.
Steps to Reproduce
- Run:
uv run kurt content map url "juhache.substack.com"
- Observe: Only 1 page discovered with method
single_page
- Expected: Should discover 90+ pages from the sitemap
Root Cause
The is_single_page_url() function in src/kurt/utils/url_utils.py:19 uses urlparse() which treats URLs without protocols as paths, not domains.
Example:
from urllib.parse import urlparse
# Without protocol - WRONG
urlparse("juhache.substack.com").path # Returns 'juhache.substack.com'
# With protocol - CORRECT
urlparse("https://juhache.substack.com").path # Returns ''
This causes the function to incorrectly identify juhache.substack.com as a single page instead of a multi-page site.
Workaround
Always include the protocol:
uv run kurt content map url "https://juhache.substack.com"
Proposed Fix
Add URL normalization to ensure protocol is present. This should be done in one of these locations:
-
Option 1: CLI command level (src/kurt/commands/content/map.py:99)
- Normalize URL before passing to
map_url_content()
-
Option 2: Utility function (src/kurt/utils/url_utils.py)
- Add
ensure_protocol(url: str) -> str function
- Call it in
is_single_page_url() before parsing
-
Option 3: Content mapping layer (src/kurt/content/map/__init__.py:41)
- Normalize in
map_url_content() function
Suggested implementation:
def ensure_protocol(url: str) -> str:
"""Add https:// protocol if missing."""
if not url.startswith(('http://', 'https://')):
return f'https://{url}'
return url
Impact
- Users must remember to always include
https:// when using map url command
- Inconsistent with user expectations (most tools auto-add protocol)
- Affects all URL-based commands that use
is_single_page_url()
Related Files
src/kurt/utils/url_utils.py:19 - is_single_page_url() function
src/kurt/commands/content/map.py:99 - CLI command entry point
src/kurt/content/map/__init__.py:41 - map_url_content() function
Bug Description
When running
kurt content map urlwith a domain that doesn't include the protocol (e.g.,juhache.substack.cominstead ofhttps://juhache.substack.com), the command incorrectly treats it as a single page instead of discovering all subpages via sitemap or crawling.Steps to Reproduce
uv run kurt content map url "juhache.substack.com"single_pageRoot Cause
The
is_single_page_url()function insrc/kurt/utils/url_utils.py:19usesurlparse()which treats URLs without protocols as paths, not domains.Example:
This causes the function to incorrectly identify
juhache.substack.comas a single page instead of a multi-page site.Workaround
Always include the protocol:
uv run kurt content map url "https://juhache.substack.com"Proposed Fix
Add URL normalization to ensure protocol is present. This should be done in one of these locations:
Option 1: CLI command level (
src/kurt/commands/content/map.py:99)map_url_content()Option 2: Utility function (
src/kurt/utils/url_utils.py)ensure_protocol(url: str) -> strfunctionis_single_page_url()before parsingOption 3: Content mapping layer (
src/kurt/content/map/__init__.py:41)map_url_content()functionSuggested implementation:
Impact
https://when usingmap urlcommandis_single_page_url()Related Files
src/kurt/utils/url_utils.py:19-is_single_page_url()functionsrc/kurt/commands/content/map.py:99- CLI command entry pointsrc/kurt/content/map/__init__.py:41-map_url_content()function