Skip to content

fix: end-to-end audit sweep across all SDKs, server, viewer, integrations, docs#1

Merged
ilanoh merged 8 commits into
mainfrom
fix/audit-sweep
May 28, 2026
Merged

fix: end-to-end audit sweep across all SDKs, server, viewer, integrations, docs#1
ilanoh merged 8 commits into
mainfrom
fix/audit-sweep

Conversation

@ilanoh

@ilanoh ilanoh commented May 28, 2026

Copy link
Copy Markdown
Contributor

End-to-end audit of every component. All fixes verified by running the real code paths, not just the test suites.

Verification (all green)

  • JS workspace (turbo): 9/9 tasks pass
  • Go: gofmt -l clean, go vet clean, go test ./... pass
  • Python: 66 passed / 1 skipped, mypy --strict clean, ruff clean
  • Docs build clean (marked now a ~36KB lazy chunk vs 516KB eager; pdf.js split out)
  • 7 malicious test vectors still rejected; JS + Python + non-ASCII fixtures all round-trip through the Go reader

Critical fixes

  1. Go Pack() silent data loss → returns a clear "writer not implemented in v0.1" error (Go writer is roadmapped for v0.2) instead of emitting a 0-payload file or panicking.
  2. Validator inline-action bypass (JS + Go) → recursive catalog/trailer graph walk catches inline /OpenAction JavaScript etc., not only indirect objects.
  3. Server served HTML to browsers*/* and browser text/html now get the visual PDF as documented.
  4. Chunker offsets (JS + Python) → now UTF-8 byte offsets per spec §5.1, so vectors map back correctly across SDKs on non-ASCII resumes.

Also fixed

  • viewer-web: portable pdf.js worker (was Vite-only ?url), real lazy loading, crawler text in light DOM, language selection, hardened fetch, demo sample + file picker.
  • JS SDK: /Params /CheckSum, portable filename validation, decode-path validation, predictor rejection, newer-version warning.
  • Python SDK: byte-based integration slicing, CheckSum, valid PDF date zone, robust encryption detection (no uncaught crash), ValueError wrapping, strict typing; plus a latent bug where pack() never emitted the cv:embeddings XMP summary.
  • Server: honors q=0, sanitized Content-Disposition + no error leak, defaultFormat as fallback, Hono header parity, ETag/Last-Modified/304, Fastify pathname guard.
  • Integrations: all three packages now load embeddings + per-chunk vectors (single SDK source of truth), pins widened, fixtures regenerated with embeddings + a non-ASCII variant.
  • docs/tools/spec: accessible pickers, lazy/sanitized marked, corrected CLI/brew instructions, cv-detector RDF attribute-form fallback, pinned veraPDF image, corrected spec §6.3 wording.

Deliberate non-change

cv:version kept at 0.1 (every package is a coherent 0.1.0 pre-release; readers already accept 1.0 gracefully). Flip the three constants together at the real 1.0 cut.

Commits are grouped per component for review.

ilanoh added 7 commits May 28, 2026 23:35
… harden HF/search

Reader path bugs and one critical writer bug found in the end-to-end audit:

* Pack() previously returned no error while pdfcpu's WriteContext silently
  dropped every newly added payload object, emitting a file that passed
  IsCvFile but had zero payloads and failed its own validator (and panicked
  on minimal PDFs). The Go writer is roadmapped for v0.2, so Pack now returns
  a clear "writer not implemented in v0.1" error before any mutation and the
  broken write path is removed. No code path can emit a corrupt file or panic.
* scanForbiddenConstructs only walked indirect objects, so an inline
  /OpenAction JavaScript (or /AA, annotation /A, AcroForm action) bypassed
  validation. Rewritten as a recursive catalog/trailer graph walk with a
  visited set, mirroring the Python implementation. All 7 malicious vectors
  still rejected; a new inline OpenAction test confirms the gap is closed.
* parseHFMatrix/meanPool used unchecked type assertions and panicked on a
  ragged HuggingFace response; now return descriptive errors.
* SearchSemantic panicked on a chunk vector whose length mismatched the space
  dimension; mismatched chunks are now skipped.
* validate now warns (newer-format-version) when cv:version MAJOR exceeds the
  known major (spec 8.3); still accepts 0.1 and 1.0.
* Dropped the fragile last-4KiB /Encrypt byte scan in favor of the
  authoritative parsed trailer Encrypt check.
* cv version no longer prints the spec version twice; reports a distinct CLI
  version. extract help documents that --format defaults to pdf.
* gofmt across the module.
…e filenames and decode path

* scanForbiddenConstructs now recursively walks the catalog/trailer object
  graph (resolving refs, visited set) instead of only enumerating indirect
  objects, so inline /OpenAction JavaScript and similar are caught. New
  fixes.test.ts injects an inline action and asserts rejection.
* Embedded-file /Params now carries the spec-mandated /CheckSum (md5 of the
  unwrapped payload bytes) via a new md5 in digest.ts; Info and Params dates
  are written with a UTC offset to match the XMP dateTime (PDF/A-3).
* pack() rejects non-portable payload names (outside [A-Za-z0-9._/-], or with
  . / .. segments); validate() flags filename-not-portable on read.
* CBOR decode (fromCborSpace) now applies the same scalar validation as the
  encode path so readers get the same guarantees against malformed input.
* decodeStream rejects unsupported /DecodeParms predictors instead of
  inflating to garbage.
* validate emits newer-format-version warning for MAJOR >= 2 (accepts 0.1/1.0).
* Removed the last-4KiB /Encrypt byte prefilter; rely on the parsed
  trailerInfo.Encrypt check (parse with ignoreEncryption).
…js, light-DOM crawler text

embed-js:
* chunk.ts now computes text-offset/text-length as UTF-8 byte offsets (encode
  once, track a byte cursor, decode byte slices) per spec 5.1, instead of
  UTF-16 code-unit offsets that disagreed with the Go/Python SDKs on any
  non-ASCII resume. New multibyte test covers accents, CJK, emoji.

viewer-web:
* Replaced the Vite-only `?url` pdf.js worker import (which broke every
  non-Vite consumer) with a portable new URL(..., import.meta.url) plus a
  configurable worker-src; verified the compiled dist no longer contains ?url.
* Enabled tsup code splitting and removed the static render-pdf import so
  pdf.js is now a lazily loaded chunk (entry dropped from ~17KB pulling pdf.js
  to a small shell).
* The extracted clean text is now projected into the light DOM (a
  visually-hidden child + cleanText getter) so crawlers can index it, instead
  of being buried in the shadow root.
* Language-aware payload selection (pickPayloadByLanguage), mirroring the SDK.
* Hardened src fetch: credentials omit, redirect follow, timeout, size cap.
* Demo: sample.cv moved under public/ so it ships (was a 404), fetch guarded
  on res.ok, native file input replaced with a hidden-input + styled-button.
* render-markdown: dropped the misleading partial DOMPurify blocklist (relies
  on the safe html profile) and adds rel="noopener noreferrer" to links.
* Added vitest + happy-dom test setup for viewer-web (new payload-selection
  and light-DOM tests); lockfile updated for the test devDeps.
…y adapters

* Negotiation rewritten: a browser request (*/* or text/html alongside a
  wildcard) now gets the visual PDF as the README promises, instead of the
  HTML extract. Markdown is served only as an explicit top non-wildcard
  preference; text/html alone (no wildcard) is a deliberate HTML fetch; PDF is
  the fallback. Verified live across Express, Fastify, Hono.
* parseAccept drops q<=0 entries (RFC 9110) and clamps q to [0,1]; a malformed
  q skips the type.
* Content-Disposition filename is sanitized (strips CR/LF, quotes, control
  chars) with RFC 5987 filename* for non-ASCII; the 500 handler no longer
  echoes internal error messages and guards headersSent.
* defaultFormat is now the final negotiation fallback rather than a forced
  format that overrode an explicit Accept.
* New shared response.ts builds the negotiated body + identical headers for
  every adapter, so Hono now sets Content-Length and Content-Disposition too.
* Added weak ETag (keyed per negotiated format) + Last-Modified +
  If-None-Match/If-Modified-Since 304 handling.
* Fastify guard checks the parsed pathname ends in .cv (no longer matches
  /foo?x=.cv).
* Dropped the non-standard cv-language media-type parameter; language travels
  in the standard Content-Language header.
…n, emit embeddings summary

* embed/_chunk.py computes UTF-8 byte offsets (operates on encoded bytes) per
  spec 5.1, replacing code-point offsets; the bundled langchain/llamaindex
  integrations slice chunk text on encoded bytes. New multibyte tests.
* pack() previously never wrote the cv:embeddings XMP summary (inspect showed
  none); it now accepts an EmbeddingsPayload and derives per-space summaries,
  matching the JS SDK. New embed/_resolve.py centralizes chunk resolution.
* Embedded-file /Params gains the spec-mandated /CheckSum (md5 of payload).
* PDF dates emit the valid +00'00' UTC offset instead of a trailing Z.
* validate() detects encryption via the parsed reader (is_encrypted/trailer)
  and broadens the except so a malformed encrypted file yields a
  pdf-parse-failed issue instead of an uncaught KeyError; all 7 malicious
  vectors still rejected. Adds the newer-format-version warning.
* extract()/inspect() wrap pypdf-internal exceptions as ValueError.
* mypy --strict and ruff now pass clean (proper Literal typing on the
  validation level, narrowed pypdf object types, optional-stub overrides).
…ctor attribute form, spec wording

docs:
* Replaced the bare native file inputs on /view and /create with the
  accessible styled-label + visually-hidden-input pattern (drag and drop
  preserved).
* /create lazy-imports marked (was a 516KB eager bundle, now a ~36KB on-demand
  chunk) and escapes raw HTML in the markdown to HTML conversion before
  embedding it as resume.html (spec 7.3).
* Corrected the create-page instructions: the cv CLI is reader-only (removed
  the nonexistent `cv pack --embed-with`; point to the Python embed path), and
  the brew commands are now consistent with the install page.

tools:
* cv-detector (go, python, typescript) now also recognizes the RDF
  attribute-form cv:version, not only the element form, so the three agree.
  New attribute-form test in each.
* verapdf-runner pins a tagged veraPDF image and defaults to the host arch
  (overridable) instead of latest + forced linux/amd64.

spec:
* 6.3 corrected: cv:alternates/integrity/embeddings are XMP Text holding a
  JSON-encoded array (as all SDKs implement and declare in the PDF/A extension
  schema), not rdf:Bag of struct.
…regenerate fixtures

* The standalone langchain, llama-index, and haystack packages only emitted
  text payloads and silently dropped embeddings.cbor, so RAG users got no
  precomputed vectors (the format's main selling point). They now expose a
  chunks mode that loads per-chunk vectors, delegating to the SDK's
  resolve_embedding_chunks (one source of truth) with UTF-8 byte-offset
  slicing.
* Widened the cvfile dependency pin from <1 to <2 so a future 1.0 SDK installs.
* Dropped the per-alternate primary-language fallback that mislabeled
  alternates lacking an explicit language.
* Regenerated the shared python-produced.cv fixture so it contains an
  embeddings.cbor payload (the old one had none, so no test exercised the
  vector path) and added a non-ASCII unicode.cv fixture; interop.test.ts
  asserts the embeddings summary surfaces.
@vercel

vercel Bot commented May 28, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
cvfile-docs Ignored Ignored Preview May 28, 2026 8:45pm

The server handler imported assert_never from typing, which only exists in
Python 3.11+, breaking test collection on 3.10 (the package declares
requires-python >=3.10). Guard the import behind sys.version_info with a
NoReturn fallback for 3.10. pytest/mypy/ruff all clean.
@ilanoh ilanoh merged commit 109ae04 into main May 28, 2026
10 checks passed
@ilanoh ilanoh deleted the fix/audit-sweep branch May 28, 2026 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant