Skip to content

feat: RFC-3986 compliance hardening and reference resolution#10

Merged
ob-aion merged 21 commits into
mainfrom
feat/optim
May 20, 2026
Merged

feat: RFC-3986 compliance hardening and reference resolution#10
ob-aion merged 21 commits into
mainfrom
feat/optim

Conversation

@ob-aion
Copy link
Copy Markdown
Collaborator

@ob-aion ob-aion commented May 19, 2026

Summary

  • Fix RFC-3986 compliance gaps: percent-encoding hex is case-insensitive (§2.1/§6.2.2.1); ports must be ASCII digits (§3.2.3); userinfo and a non-IPv6 host/port split on the last @/: (§3.2); a present-but-empty query or fragment is preserved and round-trips (§5.3); the Sitemap XML entity table escapes all five entities and a Sitemap URL must be < 2048 chars (sitemaps.org); an IPv6 zone identifier in a URI must use the %25 delimiter with a non-empty ZoneID of unreserved/pct-encoded characters (RFC 6874 §2).
  • Add resolveURI(base, reference) and removeDotSegments(path) — RFC-3986 §5.2 reference resolution implemented verbatim.
  • Centralize coded-error construction in one helper, hoist per-call regexps to module scope, and tighten TypeScript (exactOptionalPropertyTypes, erasableSyntaxOnly, isolatedDeclarations).
  • Add property-based tests (fast-check) and a 100% coverage gate; add a mitata benchmark with a recorded baseline.
  • Cite every RFC-3986 operation's section in the Compliance reference; document that resolveURI ignores a base fragment (§5.1); add explicit empty-port (§3.2.3) and deep dot-segment (§5.2.4) test coverage.
  • Restructure the README's API section: each public function in a <details> collapsible, grouped under topical sub-headers (Punycode, Parsing, Reference resolution, Validators, Checkers, Encoders, Decoders), with markdown parameter tables. Document the exported types (ParsedURI, URIComponents, CheckedURI) as their own entries. Move Compliance between Usage and API; add a Limitations section that gathers behavior caveats and non-goals as one flat bullet list.
  • Surface the full RFC stack in package.json.description, the README tagline, the GitHub repo About, and the topics list: IDN (RFC-3987), IPv6 zone identifiers (RFC 6874), domain rules (RFC 1034 / 1123), Sitemap protocol.
  • Link to bench/baseline.md from the Compliance section so the parse / validate / encode numbers are reachable from the README body, not only from Contributing.
  • Wire CI through the shared reusable workflow. The ci.yml secrets: block forwards NPM_PACKAGE_REGISTRY_TOKEN and NPM_EXTRA_CONFIG for the first publish; both lines drop in a follow-up PR once the npm Trusted Publisher is configured.

Test plan

  • pnpm lint && pnpm typecheck && pnpm test — 675 tests pass, lint and typecheck clean
  • pnpm test:coverage — 100% lines, branches, functions, statements
  • pnpm build — ESM + CJS + types emit; the dual-types matrix resolves (types-first conditions, separate .d.mts/.d.cts)
  • pnpm bench — within the documented regression budget
  • gh repo view coroboros/uri --json description,homepageUrl,repositoryTopicsdescription matches the README tagline word-for-word; topics include rfc-3987, rfc-1034, rfc-1123, idn (20 total)
  • <details> blocks in the rendered README fold and unfold cleanly on GitHub; the anchored cross-references inside (e.g. [checkURI](#checkers)) resolve

Breaking changes

  • ParsedURI.query / ParsedURI.fragment are now '' (were null) for a present-but-empty ?/#; recomposeURI, href, and the encoders/decoders emit the trailing delimiter accordingly.
  • Non-numeric ports such as 0x1F are rejected with URI_INVALID_PORT (previously coerced).
  • Authorities with multiple @ or stray : split on the last delimiter, yielding a different (correct) host.
  • encodeSitemapURL / decodeSitemapURL now escape and round-trip <, >, " (previously only &, ').
  • A URL of exactly 2048 characters is rejected (the Sitemap limit is < 2048).
  • An IPv6 zone in a URI must use %25 followed by a non-empty ZoneID of unreserved/pct-encoded characters (RFC 6874 §2); a bare %, an empty zone ([fe80::1%25]), or out-of-set bytes are rejected with URI_INVALID_HOST. The standalone isIPv6 validator is unchanged.

Upgrade notes

  • Distinguish an absent component (null) from a present-empty one ('') when reading parseURI output.
  • Ensure ports are digit-only and that IPv6 zone identifiers inside a URI use %25 with a non-empty RFC 6874 ZoneID.
  • 1.0.0 is not published yet, so these refine the contract before its first release; the published 1.0.0 tag is the stable contract.

ob-aion added 21 commits May 19, 2026 20:24
Per RFC-3986 §2.1/§6.2.2.1, HEXDIG in a percent-encoding is
case-insensitive (%3a is equivalent to %3A). The validator rejected
lowercase a-f, refusing valid input; it now accepts both cases.

Per RFC-3986 §3.2.3, port = *DIGIT. Ports such as 0x1F or 1e3 were
coerced by Number() and accepted; they are now kept raw on parse and
rejected as URI_INVALID_PORT by checkURI, encodeURIString and
decodeURIString. Adds an internal isPort guard and RFC-cited tests.
Per RFC-3986 §3.2.1 the userinfo is delimited by the last '@' before
the host; per §3.2.2/§3.2.3 the port follows the last ':' of a
non-IPv6 authority. Splitting on the first occurrence silently
truncated the host (a host-confusion hazard) for inputs such as
"user:pa@ss@example.com" or "a:b:8042". Parsing now uses the last
delimiter, with RFC-cited tests.
Per RFC-3986 §5.3 a present-but-empty query or fragment (the '' from a
bare '?' or '#') is distinct from an absent one and must round-trip.
parseURI now keeps '' (present) separate from null (absent);
recomposeURI emits the delimiter whenever the component is defined,
including ''. encodeURIString and decodeURIString carry the
distinction through, and a non-empty component that fails to decode is
still ignored per the documented decode contract. parse → recompose is
now idempotent for http://h/? and http://h/#. RFC-cited tests added.
The Sitemaps XML protocol requires all five XML entities to be
escaped; only & and ' were. Adds " > < (&quot; &gt; &lt;), so
encodeSitemapURL produces XML-safe URLs and decodeSitemapURL inverts
them. The protocol also caps a URL at strictly less than 2,048
characters, so the bound is now exclusive (a 2,048-character URL is
rejected). RFC/spec-cited tests added.
Per RFC 6874 an IPv6 zone identifier inside a URI must use the
percent-encoded delimiter "%25"; a bare "%" is invalid. checkURISyntax
(so checkURI and the encoders/decoders) now rejects a bare-"%" zone
with URI_INVALID_HOST. The standalone isIPv6 literal validator stays
lenient on the delimiter by design. RFC-cited tests added.
Reference resolution was missing. removeDotSegments implements the
RFC-3986 §5.2.4 ordered loop verbatim; resolveURI implements the
§5.2.2 strict transform with §5.2.3 merge and recomposes per §5.3,
requiring an absolute base (§5.2.1). Both are exported from the public
entry point. Tests cover every RFC-3986 §5.4 normal and abnormal
example and the §5.2.4 worked traces.
The construct-as-cast / set .code / throw triplet was repeated 32
times across the checkers, encoders and decoders. A single internal
fail(code, message, cause?) helper replaces it; the thrown value is
still instanceof URIError with the same stable .code strings, so
behavior is unchanged (the full suite asserts every code). Adds
optional Error.cause support for future wrapping.
isIP, isIPv4 and isIPv6 rebuilt their RegExp on every call, and the
sitemap decoder rebuilt its alternation regexp on every sitemap
decode. All four are now compiled once at module load. The IP
patterns are stateless and the decoder regexp is used only through
String.prototype.replace (which resets lastIndex), so reuse is safe
and behavior is unchanged.
Completes the resolveURI / removeDotSegments feature: the previous
commit wired the public export and docs but the implementation file
and its test were not staged. src/resolver/index.ts implements the
RFC-3986 §5.2 transform verbatim; tests/resolver.test.ts covers every
§5.4 example.
Enable exactOptionalPropertyTypes, erasableSyntaxOnly and
isolatedDeclarations on top of strict. Option-bag optional properties
now read `?: T | undefined` so callers can forward possibly-undefined
values (non-breaking, exactOptional-friendly), and the computed
sitemap constants carry explicit annotations for isolatedDeclarations.
tsdown pins `platform: 'node'` and explicit tree-shaking. The dual
ESM/CJS types matrix is correct by construction (separate .d.mts and
.d.cts, types-first conditions). No runtime behavior change.
Adds tests/uri.property.test.ts (fast-check): parseURI totality,
parse → recompose idempotence, removeDotSegments idempotence and
no-dot-segment invariant, resolveURI empty-reference and totality,
component encode/decode round-trip — 1000 runs each. The vitest
coverage threshold is now 100% on every metric.

To make the gate honest, fail() drops its unused cause parameter, and
the handful of guards that are unreachable by construction (indices
bounded by their array length, the Appendix-B regexp always capturing
a string, a resolved target always having a scheme) are marked with
explained v8 ignore comments rather than fabricated tests. biome
excludes the generated coverage directory.
.github/workflows/ci.yml calls the shared
coroboros/ci/.github/workflows/javascript-npm-packages.yml@v0
workflow (preflight on branch/PR, publish on tag, security always)
via OIDC — no npm token, no extra config. package.json gains a
`bench` script and the mitata dev dependency, and CLAUDE.md documents
the release/publish flow and the benchmark regression budget.
Swaps the placeholder branch badge for the CI status badge. Adds API
reference for resolveURI and removeDotSegments, and a Compliance
section stating the RFCs implemented, the behavior worth knowing
(empty query/fragment, strict ports, last-delimiter authority split,
case-insensitive percent hex, IPv6 %25 zones, Sitemap escaping), and
the non-goals (no WHATWG leniency, no RFC 5952 canonicalization). The
`lowercase` option notes are corrected: only scheme and host are
lowercased for RFC normalization; lowercasing path/query/fragment is a
Sitemap convenience, not RFC behavior.
bench/uri.bench.mjs measures parse, validate, encode/decode, IP and
reference-resolution throughput across representative URI shapes,
shown next to native URL for scale (a different, WHATWG model).
bench/baseline.md records the 1.0.0 numbers, the bundle size, and the
going-forward budget: no regression > 10 % on any bucket at a fixed
feature set.
checkURI accepted an empty or malformed IPv6 zone identifier: the
"%25" delimiter was verified but the ZoneID after it was not. RFC
6874 §2 defines ZoneID = 1*( unreserved / pct-encoded ), so an empty
zone ([fe80::1%25]) or out-of-set bytes are invalid in a URI.

Reject both; valid zones such as [fe80::1%25eth0] are unaffected.
Stricter host validation — enumerated as a pre-1.0.0 change in PR #10.
Add an explicit RFC-3986 §3.2.3 empty-port case (port = *DIGIT):
'http://example.com:/path' keeps the port present-but-empty (''),
distinct from an absent port (null), and is not an error.

Strengthen the removeDotSegments property generator with up to
eight leading '../' so the §5.2.4 climb-above-root path is
exercised; idempotence and the no-dot-segment invariant still hold.
Map every RFC-3986 operation to its section in the Compliance
section — parse (Appendix B), recompose (§5.3), reference
resolution (§5.2), percent-encoding (§2.1, §6.2.2.1), character
validation (§3.1–§3.5).

Tighten the RFC 6874 entry to the §2 ZoneID grammar now enforced,
and document that resolveURI ignores a fragment on the base
(RFC-3986 §5.1).
npm exposes no pre-publish Trusted Publisher form for a not-yet-existing
scoped package, so the first 1.0.0 tag must publish via the org
NPM_PACKAGE_REGISTRY_TOKEN. Forward NPM_EXTRA_CONFIG and
NPM_PACKAGE_REGISTRY_TOKEN through to the reusable workflow; it
auto-detects the token and routes the publish via npm token.

Once 1.0.0 is live on npm and @coroboros/uri is configured as a
Trusted Publisher of coroboros/uri (workflow ci.yml, environment
empty), both secret lines will be dropped in a follow-up so 1.0.1+
publishes via OIDC + --provenance.
API section: each public function lives in its own <details>
collapsible under a topical sub-header (Punycode, Parsing,
Reference resolution, Validators, Checkers, Encoders, Decoders).
Parameter blocks use markdown tables. The exported types
(ParsedURI, URIComponents, CheckedURI) get their own entries.

Compliance now sits between Usage and API, with the URI grammar
diagrams shown once. A new Limitations section gathers the
behavior caveats and non-goals as a single flat bullet list,
each bullet RFC-cited.

Tagline, package.json description, and keywords surface the
full RFC stack: IDN (RFC-3987), IPv6 zone identifiers (RFC 6874),
domain rules (RFC 1034 / 1123), and the Sitemap protocol.
Surface bench/baseline.md from the README so readers landing on the
Compliance section can find the parse / validate / encode numbers
against native URL and URL.canParse without scrolling to Contributing.
The note states the tradeoff plainly: the toolkit is slower by design
because it does full per-character validation, IDN handling, RFC 6874
zone identifiers, and explicit coded errors.
@ob-aion ob-aion merged commit f99faf3 into main May 20, 2026
5 checks passed
@ob-aion ob-aion deleted the feat/optim branch May 20, 2026 05:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant