Skip to content

Add batch transcription mode with sidecar resume and concurrency#73

Merged
alexkroman merged 1 commit into
mainfrom
claude/gracious-newton-rydd05
Jun 11, 2026
Merged

Add batch transcription mode with sidecar resume and concurrency#73
alexkroman merged 1 commit into
mainfrom
claude/gracious-newton-rydd05

Conversation

@alexkroman

Copy link
Copy Markdown
Collaborator

Summary

Adds batch-mode transcription to assembly transcribe, enabling concurrent processing of multiple sources (directories, globs, or stdin lists) with automatic resume via per-source .aai.json sidecars. A re-run skips sources already transcribed, making it efficient to retry partial failures.

Key Changes

  • New transcribe_batch module (aai_cli/transcribe_batch.py): Core batch orchestration

    • Source expansion: directories (recursive, audio-only), glob patterns, stdin lists with deduplication
    • Sidecar resume: SHA-256 hash validation for local files; URL-based slug naming for remote sources
    • Concurrent execution via ThreadPoolExecutor with configurable worker count (default 4)
    • Per-source error handling: failures don't abort the batch; NotAuthenticated aborts with cancel_futures=True
    • Live progress table (human mode) and NDJSON output (JSON mode)
  • Updated transcribe command (aai_cli/commands/transcribe.py):

    • New flags: --from-stdin, --concurrency, --force
    • Batch mode triggers on directory/glob source or --from-stdin; single-source flags (--out, -o, --llm, --show-code) are rejected in batch mode
    • Updated help text and examples to document batch workflows
  • New batch option helpers (aai_cli/options.py):

    • batch_from_stdin_option(), batch_concurrency_option(), batch_force_option()
    • Grouped under new OPT_BATCH help panel
  • Comprehensive test coverage (tests/test_transcribe_batch.py, tests/test_transcribe_batch_sources.py):

    • Sidecar resume logic: hash matching, file changes, corruption, non-completed states
    • URL sidecars: slug generation, hash-based naming, no source_sha256 field
    • Concurrency: default 4, configurable, verified with threading barriers
    • Failure modes: partial failures exit 1 with retry guidance, auth failures abort with cancel_futures=True
    • Output modes: human table + summary, JSON NDJSON, quiet suppression
    • Source expansion: glob/directory/stdin edge cases, deduplication, sidecar skipping
  • Documentation updates:

    • README: batch mode mentioned in transcribe command description
    • Help text: batch mode explanation, new examples (folder, glob)
    • Snapshot tests updated for new help output

Implementation Details

  • Sidecar format: Two-space indented JSON with trailing newline; includes source, id, status, transcript (full payload), and source_sha256 (local files only)
  • Resume logic: Skips sources with completed sidecars matching the current file hash; --force re-transcribes everything
  • URL sidecars: Named <slug>-<8-char-hash>.aai.json in working directory (no local file to hash); slug truncated to 64 chars
  • Error handling: Per-source failures recorded with error message; batch continues; exit code 1 on any failure; exit code 4 on auth rejection
  • Concurrency: ThreadPoolExecutor with as_completed for live progress; first exception triggers pool.shutdown(cancel_futures=True) to drop queued work
  • Output: Live table updates at 4 Hz (human mode); NDJSON per source on completion (JSON mode); final summary suppressed in JSON or --quiet modes

https://claude.ai/code/session_01P6wNaLLhB3uDLj8ojAHo38

@alexkroman alexkroman enabled auto-merge (squash) June 11, 2026 22:12
}
if digest is not None:
record["source_sha256"] = digest
sidecar.write_text(json.dumps(record, indent=2, default=str) + "\n")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential file inclusion attack via reading file - medium severity
If an attacker can control the input leading into the open function, they might be able to read sensitive files and launch further attacks with that information.

Show fix

Remediation: Ignore this issue only after you've verified or sanitized the input going into this function.

Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AikidoSec ignore: false positive — this is a local CLI reading audio files and .aai.json sidecars at paths the invoking user supplied themselves (command-line argument, glob, or their own stdin list). The process runs with the user's own privileges, so there is no privilege boundary to cross; reading the file the user named is the command's purpose, the same trust model as the existing assembly transcribe <file> / --config-file paths. URL-derived sidecar names are slug-sanitized, so no path separators or traversal segments can reach the filesystem from remote input.


Generated by Claude Code

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Based on your feedback, we ignored this issue because of the following reason:

false positive — this is a local CLI reading audio files and .aai.json sidecars at paths the invoking user supplied themselves (command-line argument, glob, or their own stdin list). The process runs with the user's own privileges, so there is no privilege boundary to cross; reading the file the user named is the command's purpose, the same trust model as the existing assembly transcribe <file> / --config-file paths. URL-derived sidecar names are slug-sanitized, so no path separators or traversal segments can reach the filesystem from remote input.


Generated by Claude Code

assembly transcribe now accepts a directory, a glob pattern, or --from-stdin
(one path/URL per line) and transcribes the sources concurrently
(--concurrency, default 4) behind a live progress table. Each source gets a
<name>.aai.json sidecar with the full transcript; the sidecar doubles as the
resume marker (hash-checked for local files), so a re-run skips finished work
and --force re-transcribes. Under --json, batch mode emits one NDJSON record
per source; exit code 1 when any source failed.

https://claude.ai/code/session_01P6wNaLLhB3uDLj8ojAHo38
@alexkroman alexkroman force-pushed the claude/gracious-newton-rydd05 branch from bbd7f19 to 9d98576 Compare June 11, 2026 22:14
@alexkroman alexkroman merged commit 2243814 into main Jun 11, 2026
9 checks passed
@alexkroman alexkroman deleted the claude/gracious-newton-rydd05 branch June 11, 2026 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants