fix: end-to-end audit sweep across all SDKs, server, viewer, integrations, docs#1
Merged
Conversation
… harden HF/search Reader path bugs and one critical writer bug found in the end-to-end audit: * Pack() previously returned no error while pdfcpu's WriteContext silently dropped every newly added payload object, emitting a file that passed IsCvFile but had zero payloads and failed its own validator (and panicked on minimal PDFs). The Go writer is roadmapped for v0.2, so Pack now returns a clear "writer not implemented in v0.1" error before any mutation and the broken write path is removed. No code path can emit a corrupt file or panic. * scanForbiddenConstructs only walked indirect objects, so an inline /OpenAction JavaScript (or /AA, annotation /A, AcroForm action) bypassed validation. Rewritten as a recursive catalog/trailer graph walk with a visited set, mirroring the Python implementation. All 7 malicious vectors still rejected; a new inline OpenAction test confirms the gap is closed. * parseHFMatrix/meanPool used unchecked type assertions and panicked on a ragged HuggingFace response; now return descriptive errors. * SearchSemantic panicked on a chunk vector whose length mismatched the space dimension; mismatched chunks are now skipped. * validate now warns (newer-format-version) when cv:version MAJOR exceeds the known major (spec 8.3); still accepts 0.1 and 1.0. * Dropped the fragile last-4KiB /Encrypt byte scan in favor of the authoritative parsed trailer Encrypt check. * cv version no longer prints the spec version twice; reports a distinct CLI version. extract help documents that --format defaults to pdf. * gofmt across the module.
…e filenames and decode path * scanForbiddenConstructs now recursively walks the catalog/trailer object graph (resolving refs, visited set) instead of only enumerating indirect objects, so inline /OpenAction JavaScript and similar are caught. New fixes.test.ts injects an inline action and asserts rejection. * Embedded-file /Params now carries the spec-mandated /CheckSum (md5 of the unwrapped payload bytes) via a new md5 in digest.ts; Info and Params dates are written with a UTC offset to match the XMP dateTime (PDF/A-3). * pack() rejects non-portable payload names (outside [A-Za-z0-9._/-], or with . / .. segments); validate() flags filename-not-portable on read. * CBOR decode (fromCborSpace) now applies the same scalar validation as the encode path so readers get the same guarantees against malformed input. * decodeStream rejects unsupported /DecodeParms predictors instead of inflating to garbage. * validate emits newer-format-version warning for MAJOR >= 2 (accepts 0.1/1.0). * Removed the last-4KiB /Encrypt byte prefilter; rely on the parsed trailerInfo.Encrypt check (parse with ignoreEncryption).
…js, light-DOM crawler text embed-js: * chunk.ts now computes text-offset/text-length as UTF-8 byte offsets (encode once, track a byte cursor, decode byte slices) per spec 5.1, instead of UTF-16 code-unit offsets that disagreed with the Go/Python SDKs on any non-ASCII resume. New multibyte test covers accents, CJK, emoji. viewer-web: * Replaced the Vite-only `?url` pdf.js worker import (which broke every non-Vite consumer) with a portable new URL(..., import.meta.url) plus a configurable worker-src; verified the compiled dist no longer contains ?url. * Enabled tsup code splitting and removed the static render-pdf import so pdf.js is now a lazily loaded chunk (entry dropped from ~17KB pulling pdf.js to a small shell). * The extracted clean text is now projected into the light DOM (a visually-hidden child + cleanText getter) so crawlers can index it, instead of being buried in the shadow root. * Language-aware payload selection (pickPayloadByLanguage), mirroring the SDK. * Hardened src fetch: credentials omit, redirect follow, timeout, size cap. * Demo: sample.cv moved under public/ so it ships (was a 404), fetch guarded on res.ok, native file input replaced with a hidden-input + styled-button. * render-markdown: dropped the misleading partial DOMPurify blocklist (relies on the safe html profile) and adds rel="noopener noreferrer" to links. * Added vitest + happy-dom test setup for viewer-web (new payload-selection and light-DOM tests); lockfile updated for the test devDeps.
…y adapters * Negotiation rewritten: a browser request (*/* or text/html alongside a wildcard) now gets the visual PDF as the README promises, instead of the HTML extract. Markdown is served only as an explicit top non-wildcard preference; text/html alone (no wildcard) is a deliberate HTML fetch; PDF is the fallback. Verified live across Express, Fastify, Hono. * parseAccept drops q<=0 entries (RFC 9110) and clamps q to [0,1]; a malformed q skips the type. * Content-Disposition filename is sanitized (strips CR/LF, quotes, control chars) with RFC 5987 filename* for non-ASCII; the 500 handler no longer echoes internal error messages and guards headersSent. * defaultFormat is now the final negotiation fallback rather than a forced format that overrode an explicit Accept. * New shared response.ts builds the negotiated body + identical headers for every adapter, so Hono now sets Content-Length and Content-Disposition too. * Added weak ETag (keyed per negotiated format) + Last-Modified + If-None-Match/If-Modified-Since 304 handling. * Fastify guard checks the parsed pathname ends in .cv (no longer matches /foo?x=.cv). * Dropped the non-standard cv-language media-type parameter; language travels in the standard Content-Language header.
…n, emit embeddings summary * embed/_chunk.py computes UTF-8 byte offsets (operates on encoded bytes) per spec 5.1, replacing code-point offsets; the bundled langchain/llamaindex integrations slice chunk text on encoded bytes. New multibyte tests. * pack() previously never wrote the cv:embeddings XMP summary (inspect showed none); it now accepts an EmbeddingsPayload and derives per-space summaries, matching the JS SDK. New embed/_resolve.py centralizes chunk resolution. * Embedded-file /Params gains the spec-mandated /CheckSum (md5 of payload). * PDF dates emit the valid +00'00' UTC offset instead of a trailing Z. * validate() detects encryption via the parsed reader (is_encrypted/trailer) and broadens the except so a malformed encrypted file yields a pdf-parse-failed issue instead of an uncaught KeyError; all 7 malicious vectors still rejected. Adds the newer-format-version warning. * extract()/inspect() wrap pypdf-internal exceptions as ValueError. * mypy --strict and ruff now pass clean (proper Literal typing on the validation level, narrowed pypdf object types, optional-stub overrides).
…ctor attribute form, spec wording docs: * Replaced the bare native file inputs on /view and /create with the accessible styled-label + visually-hidden-input pattern (drag and drop preserved). * /create lazy-imports marked (was a 516KB eager bundle, now a ~36KB on-demand chunk) and escapes raw HTML in the markdown to HTML conversion before embedding it as resume.html (spec 7.3). * Corrected the create-page instructions: the cv CLI is reader-only (removed the nonexistent `cv pack --embed-with`; point to the Python embed path), and the brew commands are now consistent with the install page. tools: * cv-detector (go, python, typescript) now also recognizes the RDF attribute-form cv:version, not only the element form, so the three agree. New attribute-form test in each. * verapdf-runner pins a tagged veraPDF image and defaults to the host arch (overridable) instead of latest + forced linux/amd64. spec: * 6.3 corrected: cv:alternates/integrity/embeddings are XMP Text holding a JSON-encoded array (as all SDKs implement and declare in the PDF/A extension schema), not rdf:Bag of struct.
…regenerate fixtures * The standalone langchain, llama-index, and haystack packages only emitted text payloads and silently dropped embeddings.cbor, so RAG users got no precomputed vectors (the format's main selling point). They now expose a chunks mode that loads per-chunk vectors, delegating to the SDK's resolve_embedding_chunks (one source of truth) with UTF-8 byte-offset slicing. * Widened the cvfile dependency pin from <1 to <2 so a future 1.0 SDK installs. * Dropped the per-alternate primary-language fallback that mislabeled alternates lacking an explicit language. * Regenerated the shared python-produced.cv fixture so it contains an embeddings.cbor payload (the old one had none, so no test exercised the vector path) and added a non-ASCII unicode.cv fixture; interop.test.ts asserts the embeddings summary surfaces.
|
The latest updates on your projects. Learn more about Vercel for GitHub. 1 Skipped Deployment
|
The server handler imported assert_never from typing, which only exists in Python 3.11+, breaking test collection on 3.10 (the package declares requires-python >=3.10). Guard the import behind sys.version_info with a NoReturn fallback for 3.10. pytest/mypy/ruff all clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
End-to-end audit of every component. All fixes verified by running the real code paths, not just the test suites.
Verification (all green)
gofmt -lclean,go vetclean,go test ./...passmypy --strictclean,ruffcleanCritical fixes
Pack()silent data loss → returns a clear "writer not implemented in v0.1" error (Go writer is roadmapped for v0.2) instead of emitting a 0-payload file or panicking./OpenActionJavaScript etc., not only indirect objects.*/*and browsertext/htmlnow get the visual PDF as documented.Also fixed
?url), real lazy loading, crawler text in light DOM, language selection, hardened fetch, demo sample + file picker./Params /CheckSum, portable filename validation, decode-path validation, predictor rejection, newer-version warning.ValueErrorwrapping, strict typing; plus a latent bug wherepack()never emitted thecv:embeddingsXMP summary.q=0, sanitizedContent-Disposition+ no error leak,defaultFormatas fallback, Hono header parity, ETag/Last-Modified/304, Fastify pathname guard.Deliberate non-change
cv:versionkept at0.1(every package is a coherent0.1.0pre-release; readers already accept1.0gracefully). Flip the three constants together at the real 1.0 cut.Commits are grouped per component for review.