Add batch transcription mode with sidecar resume and concurrency#73
Conversation
| } | ||
| if digest is not None: | ||
| record["source_sha256"] = digest | ||
| sidecar.write_text(json.dumps(record, indent=2, default=str) + "\n") |
There was a problem hiding this comment.
Potential file inclusion attack via reading file - medium severity
If an attacker can control the input leading into the open function, they might be able to read sensitive files and launch further attacks with that information.
Show fix
Remediation: Ignore this issue only after you've verified or sanitized the input going into this function.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info
There was a problem hiding this comment.
@AikidoSec ignore: false positive — this is a local CLI reading audio files and .aai.json sidecars at paths the invoking user supplied themselves (command-line argument, glob, or their own stdin list). The process runs with the user's own privileges, so there is no privilege boundary to cross; reading the file the user named is the command's purpose, the same trust model as the existing assembly transcribe <file> / --config-file paths. URL-derived sidecar names are slug-sanitized, so no path separators or traversal segments can reach the filesystem from remote input.
Generated by Claude Code
There was a problem hiding this comment.
✅ Based on your feedback, we ignored this issue because of the following reason:
false positive — this is a local CLI reading audio files and
.aai.jsonsidecars at paths the invoking user supplied themselves (command-line argument, glob, or their own stdin list). The process runs with the user's own privileges, so there is no privilege boundary to cross; reading the file the user named is the command's purpose, the same trust model as the existingassembly transcribe <file>/--config-filepaths. URL-derived sidecar names are slug-sanitized, so no path separators or traversal segments can reach the filesystem from remote input.
Generated by Claude Code
assembly transcribe now accepts a directory, a glob pattern, or --from-stdin (one path/URL per line) and transcribes the sources concurrently (--concurrency, default 4) behind a live progress table. Each source gets a <name>.aai.json sidecar with the full transcript; the sidecar doubles as the resume marker (hash-checked for local files), so a re-run skips finished work and --force re-transcribes. Under --json, batch mode emits one NDJSON record per source; exit code 1 when any source failed. https://claude.ai/code/session_01P6wNaLLhB3uDLj8ojAHo38
bbd7f19 to
9d98576
Compare
Summary
Adds batch-mode transcription to
assembly transcribe, enabling concurrent processing of multiple sources (directories, globs, or stdin lists) with automatic resume via per-source.aai.jsonsidecars. A re-run skips sources already transcribed, making it efficient to retry partial failures.Key Changes
New
transcribe_batchmodule (aai_cli/transcribe_batch.py): Core batch orchestrationThreadPoolExecutorwith configurable worker count (default 4)NotAuthenticatedaborts withcancel_futures=TrueUpdated
transcribecommand (aai_cli/commands/transcribe.py):--from-stdin,--concurrency,--force--from-stdin; single-source flags (--out,-o,--llm,--show-code) are rejected in batch modeNew batch option helpers (
aai_cli/options.py):batch_from_stdin_option(),batch_concurrency_option(),batch_force_option()OPT_BATCHhelp panelComprehensive test coverage (
tests/test_transcribe_batch.py,tests/test_transcribe_batch_sources.py):cancel_futures=TrueDocumentation updates:
Implementation Details
source,id,status,transcript(full payload), andsource_sha256(local files only)--forcere-transcribes everything<slug>-<8-char-hash>.aai.jsonin working directory (no local file to hash); slug truncated to 64 charsas_completedfor live progress; first exception triggerspool.shutdown(cancel_futures=True)to drop queued work--quietmodeshttps://claude.ai/code/session_01P6wNaLLhB3uDLj8ojAHo38