Skip to content

Add bookshelf scanner tab#27

Open
bechols wants to merge 11 commits intomainfrom
feat/bookshelf-scanner
Open

Add bookshelf scanner tab#27
bechols wants to merge 11 commits intomainfrom
feat/bookshelf-scanner

Conversation

@bechols
Copy link
Copy Markdown
Owner

@bechols bechols commented Feb 16, 2026

Summary

  • Adds /books/scan tab that uses the device camera + an on-device Vision Language Model (FastVLM-0.5B via WebGPU) to read book spines and match them against the want-to-read list
  • Works fully offline after initial ~300MB model download — model cached in browser Cache Storage, want-to-read list cached via React Query
  • State machine UI: idle → loading-model → camera-active → scanning → results, with streaming text output as the model reads spines

New files

  • lib/vlm-scanner.ts — VLM model loading/inference with dynamic imports and WebGPU support check
  • lib/book-matcher.ts — Jaccard-similarity fuzzy matching of extracted titles against want-to-read list
  • src/app/books/scan.tsx — Scan route with camera lifecycle, hydration guard, and match display

Modified files

  • src/app/books.tsx — 5th "Scan" tab
  • public/sw.js — Preserve HuggingFace transformers caches on SW update
  • package.json@huggingface/transformers, @webgpu/types
  • tsconfig.json, eslint.config.js — WebGPU types and browser globals

Test plan

  • Navigate to /books/scan — idle state with "Enable Camera" and "Pre-load Model" buttons
  • Tap "Pre-load Model" — model downloads with progress feedback
  • Tap "Enable Camera" — rear camera preview appears
  • Point at bookshelf, tap scan button — frame freezes, text streams in
  • Results show matched want-to-read books (green) vs unmatched (gray)
  • "Scan Again" returns to camera preview
  • npm run build succeeds, npm run lint passes

🤖 Generated with Claude Code

Uses on-device VLM (FastVLM-0.5B via WebGPU) to read book spines from
camera and match them against the want-to-read list. Works offline after
initial model download.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel bot commented Feb 16, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
bechols-dotcom Ready Ready Preview, Comment Feb 17, 2026 4:29am

onnxruntime-node (143MB) and @img/sharp (16MB) are transitive deps of
@huggingface/transformers that are only needed for Node.js inference,
not browser WebGPU. Post-build cleanup removes them from the Nitro
function output to stay under Vercel's 250MB limit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use ref callback to attach media stream when video element mounts,
  fixing AbortError when camera starts after model download
- Add per-file download progress (filename + percentage) to model
  loading UI via HuggingFace progress_callback

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
FastVLM-0.5B crashed browser tabs (OOM at ~605MB). Switch to tesseract.js
— a lightweight WASM-based OCR engine (~6MB total) that runs on any browser.
Serverless function drops from 250MB+ to 13.5MB.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Init Tesseract worker on page mount instead of during scan
- Split scanBookshelf into captureFrame + recognizeFrame so frame is
  captured before React re-render swaps the video element
- Share single <video> element across camera-active and scanning states
  to prevent unmount/remount losing video dimensions
- Add no-force-push constraint to CLAUDE.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tesseract struggles with rotated/angled book spine text. Switch to
PP-OCRv4 via @gutenye/ocr-browser which uses a DB text detection model
that handles text at any angle. Returns structured results with
confidence scores instead of raw text.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rewrite bookshelf scanner from single-shot to continuous scanning that
accumulates de-duped results. Add character bigram similarity for better
OCR typo tolerance. Fix iOS Chrome hang by disabling WASM multi-threading.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous numThreads fix was in onnxOptions (session-level), but
onnxruntime-web decides whether to spawn a web worker at import time.
On iOS, the blob-URL worker can't fetch same-origin WASM files (CORS).
Setting env.wasm.numThreads=1 globally prevents the worker entirely.

Also cache .wasm files in the service worker.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap globalThis.fetch during Ocr.create() to intercept model downloads
and pipe them through a progress-tracking ReadableStream. Data flows
through once — no extra memory copies. UI shows per-model download
progress (e.g. "Downloading recognition model... 5.2/10.0 MB (52%)").

Also cache .wasm and .onnx files in the service worker for faster
subsequent loads on production.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Camera: 1280x720 instead of 1920x1080 (less video memory)
- Frame capture: 640px max instead of 1280px (plenty for OCR text)
- Blob URL instead of base64 data URL (eliminates 33% encoding overhead)
- JPEG at 0.8 quality instead of PNG (much smaller frame blobs)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nail

Disable WASM memory arena and allocation pattern caching to lower peak
memory during OCR inference. Drop capture resolution to 480px and camera
to 640x480. Replace full-width video with compact thumbnail so results
are visible while scanning.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant