Skip to content

feat(ingest): paste handler + Whisper env-config + paste detection#7

Merged
gorajing merged 1 commit into
mainfrom
feat/ingest-paste-audio
May 28, 2026
Merged

feat(ingest): paste handler + Whisper env-config + paste detection#7
gorajing merged 1 commit into
mainfrom
feat/ingest-paste-audio

Conversation

@gorajing
Copy link
Copy Markdown
Owner

Summary

Three long-pending ingest improvements (each useful, all sitting uncommitted) shipped together as one coherent unit:

  • Paste handlerlib/ingest/paste.ts (+ test) lets npm run ingest <file.txt> work on local text/markdown. Mirrors the pdf.ts pattern (copy raw, derive title from first H1 or filename, write a paste source).
  • Paste detectionlib/ingest/detect.ts (+ test) routes local text/markdown to the paste handler instead of erroring.
  • Whisper env-configlib/ingest/audio.ts becomes configurable (ZUHN_WHISPER_MODEL, ZUHN_WHISPER_TASK, ZUHN_WHISPER_LANGUAGE) and auto-detects language by default. Fixes a real bug: the previous hardcoded --model base --language en confabulated on non-English audio (forcing English on the weakest model produced fluent hallucination on Korean audio).

Note on overlap with PR #6

These 4 files also carry the KB_ROOT-import swap. PR #6 (the 39-file mechanical refactor) deliberately excluded them because they have substantive non-refactor changes that belong in their own PR. Independent diffs, no conflict.

Test plan

  • paste.test.ts: 4 tests
  • detect.test.ts: updated for paste detection
  • Audio fix verified by Whisper translate run on real Korean audio (Jongmin transcript, original session)
  • CI

🤖 Generated with Claude Code

Three coherent, long-pending ingest improvements that grew from the original
session work, plus the matching tests.

- lib/ingest/paste.ts (+ paste.test.ts) — ingest a local plain-text/markdown
  file as a "paste" source. Mirrors the pdf.ts pattern: copy the raw, derive
  the title from the first H1 or filename, write a "paste" source file via
  generateSourceId/slugify. No `url` field (paste sources have no scrapeable
  origin; gray-matter can't serialize undefined).
- lib/ingest/detect.ts (+ detect.test.ts) — detect local text/markdown files
  as type "paste" so `npm run ingest <file.txt>` routes through the new
  paste handler instead of erroring out.
- ingest.ts — wire the paste handler into the dispatch.
- lib/ingest/audio.ts — Whisper invocation is now ENV-CONFIGURABLE
  (ZUHN_WHISPER_MODEL, ZUHN_WHISPER_TASK, ZUHN_WHISPER_LANGUAGE) and
  auto-detects language by default. Fixes a real bug: the previous hardcoded
  `--model base --language en` confabulated on non-English audio (forcing
  English on the weakest model produced fluent hallucination + repetition
  loops on Korean audio).

Carries the KB_ROOT-import refactor for these 4 files; the same refactor
across the OTHER 39 scripts is in PR #6 — splitting because these 4 also
have substantive non-refactor changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gorajing gorajing merged commit 2f334be into main May 28, 2026
1 check passed
@gorajing gorajing deleted the feat/ingest-paste-audio branch May 28, 2026 06:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant