feat: RFC-3986 compliance hardening and reference resolution#10
Merged
Conversation
Per RFC-3986 §2.1/§6.2.2.1, HEXDIG in a percent-encoding is case-insensitive (%3a is equivalent to %3A). The validator rejected lowercase a-f, refusing valid input; it now accepts both cases. Per RFC-3986 §3.2.3, port = *DIGIT. Ports such as 0x1F or 1e3 were coerced by Number() and accepted; they are now kept raw on parse and rejected as URI_INVALID_PORT by checkURI, encodeURIString and decodeURIString. Adds an internal isPort guard and RFC-cited tests.
Per RFC-3986 §3.2.1 the userinfo is delimited by the last '@' before the host; per §3.2.2/§3.2.3 the port follows the last ':' of a non-IPv6 authority. Splitting on the first occurrence silently truncated the host (a host-confusion hazard) for inputs such as "user:pa@ss@example.com" or "a:b:8042". Parsing now uses the last delimiter, with RFC-cited tests.
Per RFC-3986 §5.3 a present-but-empty query or fragment (the '' from a bare '?' or '#') is distinct from an absent one and must round-trip. parseURI now keeps '' (present) separate from null (absent); recomposeURI emits the delimiter whenever the component is defined, including ''. encodeURIString and decodeURIString carry the distinction through, and a non-empty component that fails to decode is still ignored per the documented decode contract. parse → recompose is now idempotent for http://h/? and http://h/#. RFC-cited tests added.
The Sitemaps XML protocol requires all five XML entities to be escaped; only & and ' were. Adds " > < (" > <), so encodeSitemapURL produces XML-safe URLs and decodeSitemapURL inverts them. The protocol also caps a URL at strictly less than 2,048 characters, so the bound is now exclusive (a 2,048-character URL is rejected). RFC/spec-cited tests added.
Per RFC 6874 an IPv6 zone identifier inside a URI must use the percent-encoded delimiter "%25"; a bare "%" is invalid. checkURISyntax (so checkURI and the encoders/decoders) now rejects a bare-"%" zone with URI_INVALID_HOST. The standalone isIPv6 literal validator stays lenient on the delimiter by design. RFC-cited tests added.
Reference resolution was missing. removeDotSegments implements the RFC-3986 §5.2.4 ordered loop verbatim; resolveURI implements the §5.2.2 strict transform with §5.2.3 merge and recomposes per §5.3, requiring an absolute base (§5.2.1). Both are exported from the public entry point. Tests cover every RFC-3986 §5.4 normal and abnormal example and the §5.2.4 worked traces.
The construct-as-cast / set .code / throw triplet was repeated 32 times across the checkers, encoders and decoders. A single internal fail(code, message, cause?) helper replaces it; the thrown value is still instanceof URIError with the same stable .code strings, so behavior is unchanged (the full suite asserts every code). Adds optional Error.cause support for future wrapping.
isIP, isIPv4 and isIPv6 rebuilt their RegExp on every call, and the sitemap decoder rebuilt its alternation regexp on every sitemap decode. All four are now compiled once at module load. The IP patterns are stateless and the decoder regexp is used only through String.prototype.replace (which resets lastIndex), so reuse is safe and behavior is unchanged.
Completes the resolveURI / removeDotSegments feature: the previous commit wired the public export and docs but the implementation file and its test were not staged. src/resolver/index.ts implements the RFC-3986 §5.2 transform verbatim; tests/resolver.test.ts covers every §5.4 example.
Enable exactOptionalPropertyTypes, erasableSyntaxOnly and isolatedDeclarations on top of strict. Option-bag optional properties now read `?: T | undefined` so callers can forward possibly-undefined values (non-breaking, exactOptional-friendly), and the computed sitemap constants carry explicit annotations for isolatedDeclarations. tsdown pins `platform: 'node'` and explicit tree-shaking. The dual ESM/CJS types matrix is correct by construction (separate .d.mts and .d.cts, types-first conditions). No runtime behavior change.
Adds tests/uri.property.test.ts (fast-check): parseURI totality, parse → recompose idempotence, removeDotSegments idempotence and no-dot-segment invariant, resolveURI empty-reference and totality, component encode/decode round-trip — 1000 runs each. The vitest coverage threshold is now 100% on every metric. To make the gate honest, fail() drops its unused cause parameter, and the handful of guards that are unreachable by construction (indices bounded by their array length, the Appendix-B regexp always capturing a string, a resolved target always having a scheme) are marked with explained v8 ignore comments rather than fabricated tests. biome excludes the generated coverage directory.
.github/workflows/ci.yml calls the shared coroboros/ci/.github/workflows/javascript-npm-packages.yml@v0 workflow (preflight on branch/PR, publish on tag, security always) via OIDC — no npm token, no extra config. package.json gains a `bench` script and the mitata dev dependency, and CLAUDE.md documents the release/publish flow and the benchmark regression budget.
Swaps the placeholder branch badge for the CI status badge. Adds API reference for resolveURI and removeDotSegments, and a Compliance section stating the RFCs implemented, the behavior worth knowing (empty query/fragment, strict ports, last-delimiter authority split, case-insensitive percent hex, IPv6 %25 zones, Sitemap escaping), and the non-goals (no WHATWG leniency, no RFC 5952 canonicalization). The `lowercase` option notes are corrected: only scheme and host are lowercased for RFC normalization; lowercasing path/query/fragment is a Sitemap convenience, not RFC behavior.
bench/uri.bench.mjs measures parse, validate, encode/decode, IP and reference-resolution throughput across representative URI shapes, shown next to native URL for scale (a different, WHATWG model). bench/baseline.md records the 1.0.0 numbers, the bundle size, and the going-forward budget: no regression > 10 % on any bucket at a fixed feature set.
checkURI accepted an empty or malformed IPv6 zone identifier: the "%25" delimiter was verified but the ZoneID after it was not. RFC 6874 §2 defines ZoneID = 1*( unreserved / pct-encoded ), so an empty zone ([fe80::1%25]) or out-of-set bytes are invalid in a URI. Reject both; valid zones such as [fe80::1%25eth0] are unaffected. Stricter host validation — enumerated as a pre-1.0.0 change in PR #10.
Add an explicit RFC-3986 §3.2.3 empty-port case (port = *DIGIT): 'http://example.com:/path' keeps the port present-but-empty (''), distinct from an absent port (null), and is not an error. Strengthen the removeDotSegments property generator with up to eight leading '../' so the §5.2.4 climb-above-root path is exercised; idempotence and the no-dot-segment invariant still hold.
Map every RFC-3986 operation to its section in the Compliance section — parse (Appendix B), recompose (§5.3), reference resolution (§5.2), percent-encoding (§2.1, §6.2.2.1), character validation (§3.1–§3.5). Tighten the RFC 6874 entry to the §2 ZoneID grammar now enforced, and document that resolveURI ignores a fragment on the base (RFC-3986 §5.1).
npm exposes no pre-publish Trusted Publisher form for a not-yet-existing scoped package, so the first 1.0.0 tag must publish via the org NPM_PACKAGE_REGISTRY_TOKEN. Forward NPM_EXTRA_CONFIG and NPM_PACKAGE_REGISTRY_TOKEN through to the reusable workflow; it auto-detects the token and routes the publish via npm token. Once 1.0.0 is live on npm and @coroboros/uri is configured as a Trusted Publisher of coroboros/uri (workflow ci.yml, environment empty), both secret lines will be dropped in a follow-up so 1.0.1+ publishes via OIDC + --provenance.
API section: each public function lives in its own <details> collapsible under a topical sub-header (Punycode, Parsing, Reference resolution, Validators, Checkers, Encoders, Decoders). Parameter blocks use markdown tables. The exported types (ParsedURI, URIComponents, CheckedURI) get their own entries. Compliance now sits between Usage and API, with the URI grammar diagrams shown once. A new Limitations section gathers the behavior caveats and non-goals as a single flat bullet list, each bullet RFC-cited. Tagline, package.json description, and keywords surface the full RFC stack: IDN (RFC-3987), IPv6 zone identifiers (RFC 6874), domain rules (RFC 1034 / 1123), and the Sitemap protocol.
Surface bench/baseline.md from the README so readers landing on the Compliance section can find the parse / validate / encode numbers against native URL and URL.canParse without scrolling to Contributing. The note states the tradeoff plainly: the toolkit is slower by design because it does full per-character validation, IDN handling, RFC 6874 zone identifiers, and explicit coded errors.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
userinfoand a non-IPv6 host/port split on the last@/:(§3.2); a present-but-empty query or fragment is preserved and round-trips (§5.3); the Sitemap XML entity table escapes all five entities and a Sitemap URL must be< 2048chars (sitemaps.org); an IPv6 zone identifier in a URI must use the%25delimiter with a non-emptyZoneIDofunreserved/pct-encodedcharacters (RFC 6874 §2).resolveURI(base, reference)andremoveDotSegments(path)— RFC-3986 §5.2 reference resolution implemented verbatim.exactOptionalPropertyTypes,erasableSyntaxOnly,isolatedDeclarations).resolveURIignores a base fragment (§5.1); add explicit empty-port (§3.2.3) and deep dot-segment (§5.2.4) test coverage.<details>collapsible, grouped under topical sub-headers (Punycode, Parsing, Reference resolution, Validators, Checkers, Encoders, Decoders), with markdown parameter tables. Document the exported types (ParsedURI,URIComponents,CheckedURI) as their own entries. Move Compliance between Usage and API; add a Limitations section that gathers behavior caveats and non-goals as one flat bullet list.package.json.description, the README tagline, the GitHub repo About, and the topics list: IDN (RFC-3987), IPv6 zone identifiers (RFC 6874), domain rules (RFC 1034 / 1123), Sitemap protocol.bench/baseline.mdfrom the Compliance section so the parse / validate / encode numbers are reachable from the README body, not only from Contributing.ci.ymlsecrets:block forwardsNPM_PACKAGE_REGISTRY_TOKENandNPM_EXTRA_CONFIGfor the first publish; both lines drop in a follow-up PR once the npm Trusted Publisher is configured.Test plan
pnpm lint && pnpm typecheck && pnpm test— 675 tests pass, lint and typecheck cleanpnpm test:coverage— 100% lines, branches, functions, statementspnpm build— ESM + CJS + types emit; the dual-types matrix resolves (types-first conditions, separate.d.mts/.d.cts)pnpm bench— within the documented regression budgetgh repo view coroboros/uri --json description,homepageUrl,repositoryTopics—descriptionmatches the README tagline word-for-word;topicsincluderfc-3987,rfc-1034,rfc-1123,idn(20 total)<details>blocks in the rendered README fold and unfold cleanly on GitHub; the anchored cross-references inside (e.g.[checkURI](#checkers)) resolveBreaking changes
ParsedURI.query/ParsedURI.fragmentare now''(werenull) for a present-but-empty?/#;recomposeURI,href, and the encoders/decoders emit the trailing delimiter accordingly.0x1Fare rejected withURI_INVALID_PORT(previously coerced).@or stray:split on the last delimiter, yielding a different (correct) host.encodeSitemapURL/decodeSitemapURLnow escape and round-trip<,>,"(previously only&,').< 2048).%25followed by a non-emptyZoneIDofunreserved/pct-encodedcharacters (RFC 6874 §2); a bare%, an empty zone ([fe80::1%25]), or out-of-set bytes are rejected withURI_INVALID_HOST. The standaloneisIPv6validator is unchanged.Upgrade notes
null) from a present-empty one ('') when readingparseURIoutput.%25with a non-empty RFC 6874 ZoneID.1.0.0is not published yet, so these refine the contract before its first release; the published1.0.0tag is the stable contract.