Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 89 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,33 @@
[![License](https://img.shields.io/badge/license-MIT-D6402E)](https://github.com/AssemblyAI/cli/blob/main/LICENSE)
[![Docs](https://img.shields.io/badge/docs-assemblyai-D6402E)](https://www.assemblyai.com/docs)

The AssemblyAI CLI (`assembly`) brings speech AI to your terminal: transcribe files, URLs, and YouTube/podcast pages, stream live audio, talk to a two-way voice agent, prompt the LLM Gateway, benchmark speech models, and scaffold ready-to-deploy starter apps.
The AssemblyAI CLI (`assembly`) brings speech AI directly into your terminal: transcribe files, URLs, and YouTube/podcast pages, stream live audio, talk to a two-way voice agent, prompt the LLM Gateway, benchmark speech models, and scaffold ready-to-deploy starter apps.

<p align="center">
<img src="assets/welcome.png" alt="The assembly CLI welcome screen, listing command groups for transcription, streaming, voice agents, app scaffolding, and account management" width="820">
</p>

Learn more about the platform in the [AssemblyAI docs](https://www.assemblyai.com/docs).

## ⚡ Quickstart

Install on macOS or Linux with Homebrew:

```sh
brew tap assemblyai/cli https://github.com/AssemblyAI/cli
brew trust assemblyai/cli # only needed when HOMEBREW_REQUIRE_TAP_TRUST is set; harmless otherwise
brew install assembly
```

Sign in (stores your API key in the OS keyring) and run your first transcription:

```sh
assembly login
assembly transcribe --sample
```

That's it. Run `assembly onboard` for a guided tour, or see [Installation](#-installation) for pipx/uv and other options.

## 🚀 Why the AssemblyAI CLI?

- **🎯 One command for everything**: transcription, real-time streaming, voice agents, LLM prompts, and WER benchmarking — no SDK boilerplate.
Expand All @@ -19,9 +40,35 @@ The AssemblyAI CLI (`assembly`) brings speech AI to your terminal: transcribe fi
- **🤖 Agent-ready**: `assembly setup install` wires your coding agent up with the AssemblyAI docs MCP server and skills.
- **📖 Open source**: MIT licensed.

## 📋 Features at a glance

| Command | What it does |
| :--- | :--- |
| `assembly transcribe` | Transcribe files, URLs, YouTube/podcast pages, directories, globs, or bucket storage (`s3://`, `gs://`, `az://`) — with speaker labels, PII redaction, summarization, SRT/VTT captions, and resumable batch runs |
| `assembly stream` | Real-time transcription from your microphone, a file, or a URL — on macOS it can capture system audio too |
| `assembly dictate` | Push-to-talk dictation: press Enter to record, Enter again for instant text (Sync STT API, up to 120 s per utterance) |
| `assembly agent` | Full-duplex spoken conversation with a voice agent, right in your terminal |
| `assembly speak` | Synthesize text to speech over the streaming-TTS WebSocket (sandbox-only) |
| `assembly llm` | Prompt the LLM Gateway over a transcript, stdin, or a live stream |
| `assembly clip` | Cut audio/video with ffmpeg by diarized speaker, text match, LLM pick, or time range (`--video` keeps the picture for URL sources) — clip boundaries snap into nearby silence |
| `assembly dub` | Re-voice an audio/video file or URL in another language: transcription, LLM translation, per-speaker TTS, ffmpeg track-swap (sandbox-only) |
| `assembly caption` | Burn always-visible captions into a video: transcribe (or reuse a transcript), fetch SRT, ffmpeg burns it in — audio untouched |
| `assembly eval` | Benchmark WER against Hugging Face datasets (built-in aliases: `librispeech`, `tedlium`, …) or local manifests |
| `assembly webhooks listen` | Open a public dev URL that prints webhook deliveries and can forward them to your local app |
| `assembly init` / `dev` / `share` / `deploy` | Scaffold a FastAPI + HTML starter app, run it locally, expose it on a public URL, ship it to Vercel / Railway / Fly.io |
| `assembly setup` | Wire a coding agent up with the AssemblyAI docs MCP server and skills |
| `assembly doctor` | Check your environment: API key, network, ffmpeg, microphone |
| `assembly transcripts` / `sessions` | Browse and fetch past transcripts and streaming sessions |
| `assembly keys` / `balance` / `usage` / `limits` / `audit` | Account self-service via browser login |

Add `--show-code` to `transcribe` / `stream` / `agent` to print the equivalent Python SDK script instead of running — the built-in path from CLI experiment to SDK code.

## ✨ Things you can do with it

A few one-liners that show what `assembly` can do. These are the fun ones; the everyday basics live under **Quick examples** below.
A few one-liners that show what `assembly` can do. The everyday basics live under [Getting started](#-getting-started) below.

> [!NOTE]
> `speak` and `dub` are sandbox-only today — that's why the examples below pass `--sandbox`.

**Recreate a scene with synthetic voices** — transcribe and diarize a YouTube clip, then pipe it straight into TTS with a different voice per speaker:

Expand All @@ -30,15 +77,15 @@ assembly transcribe "https://www.youtube.com/watch?v=awmCtXzFsJo" --speaker-labe
| assembly --sandbox speak --voice A=jane --voice B=mary --out scene.wav
```

`speak` auto-detects `Speaker A:` labels, merges each speaker's turns, and rotates voices. (`speak` is sandbox-only today, hence `--sandbox`.)
`speak` auto-detects `Speaker A:` labels, merges each speaker's turns, and rotates voices.

**Dub a video into another language** — the whole platform in one command: transcription with utterance timestamps, per-utterance LLM translation, TTS for each line (one voice per speaker), and ffmpeg laying the new track over the original video. A great demo is the first YouTube video ever, "Me at the zoo" — it's 19 seconds long, a single clear English speaker, and instantly recognizable, so the dub finishes fast and the before/after is obvious:

```sh
assembly --sandbox dub "https://www.youtube.com/watch?v=jNQXAC9IVRw" -l de --video
```

The video stream is copied untouched; each dubbed line lands at its original start time. (Sandbox-only, like `speak`.)
The video stream is copied untouched; each dubbed line lands at its original start time.

**Turn a podcast into audio** — Apple and Spotify podcast pages work too (yt-dlp ingestion):

Expand Down Expand Up @@ -114,7 +161,8 @@ assembly eval librispeech --speech-model universal-3-pro --limit 50

Requires Python 3.12+ (Homebrew brings its own; for pipx/uv see the `--python` hint below).

> ⚠️ The `assemblyai-cli` package on PyPI is **not** this project — install with one of the
> [!WARNING]
> The `assemblyai-cli` package on PyPI is **not** this project — install with one of the
> commands below, not `pip install assemblyai-cli`.

### Homebrew (recommended — macOS / Linux)
Expand All @@ -138,39 +186,55 @@ uv tool install "git+https://github.com/AssemblyAI/cli.git"
If your default interpreter is older than Python 3.12, add `--python python3.12` (pipx) or
`--python 3.12` (uv) to the install command.

<details>
<summary>System dependencies for the live-audio commands (pipx/uv installs only)</summary>

Only the live-audio commands need anything extra: `stream`, `dictate`, and `agent` use PortAudio for
microphone capture (Debian/Ubuntu: `sudo apt-get install libportaudio2`; Fedora:
`sudo dnf install portaudio`) and [`ffmpeg`](https://ffmpeg.org) on `PATH` to stream
non-WAV audio. Plain `transcribe` uploads your file directly and needs neither.
microphone capture and [`ffmpeg`](https://ffmpeg.org) on `PATH` to stream non-WAV audio.
Plain `transcribe` uploads your file directly and needs neither.

- Debian/Ubuntu: `sudo apt-get install libportaudio2 ffmpeg`
- Fedora: `sudo dnf install portaudio ffmpeg`
- macOS (Homebrew): `brew install portaudio ffmpeg`

</details>

## 🔐 Authentication

New to AssemblyAI? Create a free account at
[assemblyai.com/dashboard](https://www.assemblyai.com/dashboard) to get an API key.

The easiest path is browser login, which stores your API key in the OS keyring
(Keychain / Credential Manager / Secret Service):
### Option 1: Browser login (recommended)

**✨ Best for:** day-to-day use on your own machine.

Browser login stores your API key in the OS keyring (Keychain / Credential Manager / Secret Service) — nothing lands in a dotfile, and it unlocks the account commands (`keys`, `balance`, `usage`, `limits`, `sessions`, `audit`):

```sh
assembly login
```

In CI — or anywhere a browser isn't an option — set the key as an environment variable
instead. It's checked before the keyring, and nothing is written to disk:
### Option 2: API key environment variable

**✨ Best for:** CI, containers, and anywhere a browser isn't an option.

The environment variable is checked before the keyring, and nothing is written to disk:

```sh
export ASSEMBLYAI_API_KEY="YOUR_API_KEY"
```

## 🚀 Getting started

For a guided tour — sign in, run a first transcription, start building:
### Guided tour

Sign in, run a first transcription, start building:

```sh
assembly onboard
```

Or jump straight in:
### Basic usage

```sh
assembly transcribe --sample # transcribe the hosted sample file
Expand All @@ -181,23 +245,6 @@ assembly agent # talk to a voice agent (use headphones)
assembly init # scaffold a starter app
```

## 📋 Key features

- **Transcription**: `assembly transcribe` handles files, URLs, and YouTube/podcast pages, with flags for speaker labels, PII redaction, summarization, sentiment, chapters, and more.
- **Batch transcription**: point `assembly transcribe` at a directory or glob — local, or in bucket storage (`"s3://bucket/calls/*.mp3"`, `gs://`, `az://`, …, with the matching fsspec backend such as `s3fs` installed) — or pipe paths with `--from-stdin` to transcribe everything concurrently, with sidecar files that make re-runs resumable. Add `--llm "prompt"` to run an LLM prompt over each finished transcript, saved into the sidecars.
- **Real-time streaming**: `assembly stream` transcribes the microphone, a file, or a URL live — on macOS it can capture system audio too.
- **Dictation**: `assembly dictate` is push-to-talk for your terminal — press Enter to record, Enter again to get the utterance back instantly from the Sync API (up to 120 s per utterance).
- **Voice agent**: `assembly agent` runs a full-duplex spoken conversation in your terminal.
- **LLM Gateway**: `assembly llm` prompts an LLM over a transcript, stdin, or a live stream (`assembly stream --llm "summarize as I talk"`).
- **Transcript-driven clipping**: `assembly clip` cuts an audio/video file (or a YouTube/podcast URL — add `--video` to download the full video so the clips keep the picture) with ffmpeg by diarized speaker (`--speaker A`), text match (`--search "pricing"`), LLM pick (`--llm "the three best moments"`), or explicit time range (`--range 1:30-2:45`) — transcribing on the fly, reusing a finished transcript with `-t ID`, or reading one from a pipe (`assembly transcribe x.mp4 --speaker-labels --json | assembly clip x.mp4 -t - --llm "…"`). Clip boundaries snap into nearby silence (ffmpeg `silencedetect`) so cuts don't land mid-word; `--no-snap` cuts at the exact selected times.
- **Dubbing**: `assembly dub` re-voices an audio/video file or URL in another language (`assembly --sandbox dub talk.mp4 -l de`): diarized transcription, per-utterance LLM translation, streaming TTS per speaker, and an ffmpeg track-swap that leaves the video untouched. Sandbox-only today, like `speak`.
- **Captioning**: `assembly caption` burns always-visible captions into a video (`assembly caption talk.mp4`) — it transcribes the file (or reuses a transcript with `-t ID`), fetches the SRT export, and renders it into the picture with ffmpeg, leaving the audio untouched; `--chars-per-caption` and `--font-size` shape the captions.
- **Model evaluation**: `assembly eval` transcribes a Hugging Face dataset (with built-in aliases for common benchmarks: `assembly eval tedlium`) or a local `.csv`/`.jsonl` manifest and scores WER against its references — handy for picking a speech model.
- **Starter apps**: `assembly init` scaffolds a self-contained FastAPI + HTML app (`audio-transcription`, `live-captions`, `voice-agent`); `assembly dev` runs it, `assembly share` exposes it on a public URL, and `assembly deploy` ships it to Vercel, Railway, or Fly.io.
- **Webhook testing**: `assembly webhooks listen` opens a public dev URL (cloudflared quick tunnel) that prints webhook deliveries as they arrive and can forward them to your local app with `--forward-to`.
- **Code generation**: add `--show-code` to `transcribe`/`stream`/`agent` to print the equivalent Python SDK script instead of running.
- **Account self-service**: `assembly keys` / `balance` / `usage` / `limits` / `sessions` / `audit` via browser login.

### Quick examples

Pull exactly the output you need:
Expand All @@ -223,18 +270,18 @@ ffmpeg -i talk.mp4 -f wav - | assembly transcribe -
git log --oneline -30 | assembly llm "write release notes grouped by feature/fix"
```

Graduate to the SDK — print the equivalent Python script instead of running:

```sh
assembly transcribe --sample --speaker-labels --show-code
```

## 📚 Documentation

### In the terminal

- Run `assembly --help` or `assembly <command> --help` for flags and examples.
- Run `assembly doctor` to check your environment (API key, network, ffmpeg, microphone).
- [AssemblyAI docs](https://www.assemblyai.com/docs)
- [API reference](https://www.assemblyai.com/docs/api-reference)

### Resources

- [**AssemblyAI docs**](https://www.assemblyai.com/docs) — guides for every model and feature.
- [**API reference**](https://www.assemblyai.com/docs/api-reference) — the REST and streaming APIs the CLI drives.
- [**Dashboard**](https://www.assemblyai.com/dashboard) — manage your account and API keys.

## 🤝 Contributing

Expand All @@ -250,4 +297,6 @@ See [AGENTS.md](AGENTS.md) for development conventions and architecture notes.

## 📄 Legal

Released under the [MIT license](LICENSE).
- **License**: released under the [MIT license](LICENSE).
- **Privacy**: [AssemblyAI privacy policy](https://www.assemblyai.com/legal/privacy-policy) — the CLI's anonymous usage telemetry is opt-out (`assembly telemetry disable`, `AAI_TELEMETRY_DISABLED=1`, or `DO_NOT_TRACK=1`).
- **Terms**: [AssemblyAI terms of service](https://www.assemblyai.com/legal/terms-of-service).
Loading