Conversation
Uses on-device VLM (FastVLM-0.5B via WebGPU) to read book spines from camera and match them against the want-to-read list. Works offline after initial model download. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
onnxruntime-node (143MB) and @img/sharp (16MB) are transitive deps of @huggingface/transformers that are only needed for Node.js inference, not browser WebGPU. Post-build cleanup removes them from the Nitro function output to stay under Vercel's 250MB limit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use ref callback to attach media stream when video element mounts, fixing AbortError when camera starts after model download - Add per-file download progress (filename + percentage) to model loading UI via HuggingFace progress_callback Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
FastVLM-0.5B crashed browser tabs (OOM at ~605MB). Switch to tesseract.js — a lightweight WASM-based OCR engine (~6MB total) that runs on any browser. Serverless function drops from 250MB+ to 13.5MB. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
11ff0ab to
9dcf0b7
Compare
- Init Tesseract worker on page mount instead of during scan - Split scanBookshelf into captureFrame + recognizeFrame so frame is captured before React re-render swaps the video element - Share single <video> element across camera-active and scanning states to prevent unmount/remount losing video dimensions - Add no-force-push constraint to CLAUDE.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tesseract struggles with rotated/angled book spine text. Switch to PP-OCRv4 via @gutenye/ocr-browser which uses a DB text detection model that handles text at any angle. Returns structured results with confidence scores instead of raw text. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rewrite bookshelf scanner from single-shot to continuous scanning that accumulates de-duped results. Add character bigram similarity for better OCR typo tolerance. Fix iOS Chrome hang by disabling WASM multi-threading. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous numThreads fix was in onnxOptions (session-level), but onnxruntime-web decides whether to spawn a web worker at import time. On iOS, the blob-URL worker can't fetch same-origin WASM files (CORS). Setting env.wasm.numThreads=1 globally prevents the worker entirely. Also cache .wasm files in the service worker. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap globalThis.fetch during Ocr.create() to intercept model downloads and pipe them through a progress-tracking ReadableStream. Data flows through once — no extra memory copies. UI shows per-model download progress (e.g. "Downloading recognition model... 5.2/10.0 MB (52%)"). Also cache .wasm and .onnx files in the service worker for faster subsequent loads on production. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Camera: 1280x720 instead of 1920x1080 (less video memory) - Frame capture: 640px max instead of 1280px (plenty for OCR text) - Blob URL instead of base64 data URL (eliminates 33% encoding overhead) - JPEG at 0.8 quality instead of PNG (much smaller frame blobs) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nail Disable WASM memory arena and allocation pattern caching to lower peak memory during OCR inference. Drop capture resolution to 480px and camera to 640x480. Replace full-width video with compact thumbnail so results are visible while scanning. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
/books/scantab that uses the device camera + an on-device Vision Language Model (FastVLM-0.5B via WebGPU) to read book spines and match them against the want-to-read listNew files
lib/vlm-scanner.ts— VLM model loading/inference with dynamic imports and WebGPU support checklib/book-matcher.ts— Jaccard-similarity fuzzy matching of extracted titles against want-to-read listsrc/app/books/scan.tsx— Scan route with camera lifecycle, hydration guard, and match displayModified files
src/app/books.tsx— 5th "Scan" tabpublic/sw.js— Preserve HuggingFace transformers caches on SW updatepackage.json—@huggingface/transformers,@webgpu/typestsconfig.json,eslint.config.js— WebGPU types and browser globalsTest plan
/books/scan— idle state with "Enable Camera" and "Pre-load Model" buttonsnpm run buildsucceeds,npm run lintpasses🤖 Generated with Claude Code