Audio Language Interface is a contract-first natural-language audio editing platform for LLMs.
The project is built around one core idea: a language model should be able to inspect audio, reason about requested changes, build an explicit edit plan, execute deterministic transforms, and verify whether the result moved toward the request.
This repo is not:
- a DAW clone
- a beatmaker
- a music-generation system
- a pile of hidden heuristics behind one opaque prompt
It is an audio-editing runtime and planning stack for model-driven workflows.
License: MIT. See LICENSE.
Most LLM audio workflows collapse too many responsibilities into one step:
- vague language interpretation
- capability guessing
- audio execution
- result evaluation
That makes them brittle and hard to debug.
This repository splits those concerns cleanly so a model can work against explicit artifacts instead of hidden behavior.
The key architectural rule is:
- the audio runtime owns deterministic execution
- the intent layer owns semantics and planning
- the capability manifest is the contract between them
- optional provider-backed request interpretation stays above the deterministic core
- tools and orchestration are adapters, not the core system
Today, the repository supports a real single-file editing loop:
- import common local audio files into workspace storage, normalizing the internal version to WAV
- analyze the current file
- derive conservative semantic descriptors
- translate a user request into an explicit
EditPlan - apply deterministic FFmpeg-backed transforms
- render previews, loudness-matched A/B previews, or exports
- compare before and after
- record provenance in a
SessionGraph
That loop is exposed both through modules and through a thin tool surface.
There is now also a narrow alpha CLI surface over that same validated path:
pnpm ali -- edit ./path/to/input.wav "Make this darker and less harsh."pnpm ali -- edit ./path/to/input.wav "Make this less distorted." --best-effortpnpm ali -- follow-up ./ali-session-2026-04-27T18-00-00 "Undo."- each run writes an explicit session directory with:
- a reusable
workspace/ - numbered
runs/run-0001/,runs/run-0002/, ... session.json- rendered output copies plus
EditPlan, comparison, interpretation, and session-graph artifacts
- a reusable
- the CLI keeps state explicit and local; it does not add hidden persistence or extra planner breadth by default
--best-effortis an explicit opt-in planner policy for CLI calls that lets subjective texture wording fall back to a conservative tonal-softening proxy instead of refusing when direct artifact evidence is missing
There is now also an optional interpretation layer for open-ended language:
modules/interpretationcan call OpenAI, Google, or the local Codex CLI to normalize a raw request into a boundedIntentInterpretation- callers can choose
conservativeorbest_effortambiguity handling when they opt into the interpretation layer - the richer interpretation artifact now includes explicit
interpretation_policy,next_action, evidence-linked descriptor hypotheses, structured constraints, optional region-intent proposals, alternate candidates, and follow-up interpretation metadata - deterministic planning remains authoritative and may still reject unsupported or weakly grounded interpretations
- explicit numeric region wording such as
the first 0.5 secondsorfrom 0.2s to 0.7scan now ground into realtime_rangeplanner targets for a narrow first cohort of region-safe operations; vague named regions such asintrostill stay clarification or refusal territory - in
conservativemode, orchestration can now return a first-class clarification result and carry the pending clarification state forward explicitly inSessionGraph.metadata.pending_clarification - callers can use the standalone
interpret_requesttool or enable LLM-assisted interpretation insiderun_request_cycle
On top of the one-shot loop, orchestration now supports early iterative follow-up behavior for:
moremake it morelessmake it lessundorevert to previous versiontry another versionretry
Those follow-ups stay explicit: orchestration resolves them against recorded session history and version provenance instead of inventing hidden state.
The published tool surface now also exposes a first-class orchestration entrypoint for explicit request-cycle execution:
run_request_cyclesupports both initial import-driven runs and session-aware follow-up requests- follow-up calls stay explicit at the adapter boundary by requiring the caller to provide the current
SessionGraph - revert-style and alternate-version flows also require any needed historical
AudioVersionartifacts to be materialized explicitly instead of being resolved from hidden tool-layer state - clarification answers use that same explicit session graph path: the next request can resume from
pending_clarificationwithout any hidden adapter-managed conversation state
The application-facing SDK exposes the same engine path without requiring product code to reach into module internals:
import { createAudioLanguageSession } from "@audio-language-interface/sdk";
const session = await createAudioLanguageSession({
workspaceDir: "./ali-workspace",
});
// Upload/import a local source into explicit workspace state.
const imported = await session.importAudio({ input: "./loop.wav" });
// Run one deterministic edit and read canonical artifacts.
const edit = await session.edit({
input: "./loop.wav",
request: "make it warmer and less harsh",
});
if (edit.resultKind === "applied") {
console.log(edit.outputVersion.version_id);
console.log(edit.renderArtifact.output.path);
console.log(edit.comparisonReport.summary.plain_text);
}
// Generate deterministic A/B/C alternatives when the request has meaningful strength range.
const variants = await session.edit({
input: "./loop.wav",
request: "make this darker but keep the punch",
variants: 3,
});
if (variants.resultKind === "variants_generated") {
for (const variant of variants.variants) {
console.log(variant.label, variant.strategy, variant.rank, variant.rationale);
console.log(variant.comparisonPreview.originalPreview.artifact.output.path);
console.log(variant.comparisonPreview.editedPreview.artifact.output.path);
console.log(variant.comparisonPreview.loudnessMatchedOriginalPreview.artifact.output.path);
console.log(variant.comparisonPreview.loudnessMatchedEditedPreview.artifact.output.path);
}
// Some requests collapse to fewer meaningful variants instead of returning fake choices.
console.log(variants.skippedVariants, variants.warnings);
}
// Continue from explicit session history, render a selected/current version, and compare versions.
const followUp = await session.followUp({ request: "make it a little less harsh" });
const rendered = await session.render({ kind: "final" });
const comparison = await session.compare();
console.log(imported.version.version_id, followUp.resultKind, rendered.output.path, comparison.summary.plain_text);The current cleanup slice is now analysis-backed instead of purely prompt-driven:
analysisemits explicithum,click, and clipping annotations plus file-level artifact fields such ashum_detected,hum_fundamental_hz,click_detected,click_count,clipped_frame_count, andclipping_severitysemanticscan assignhum_presentandclicks_presentwhen that evidence is strong enoughsemanticsnow also carries a small deterministic texture vocabulary forrelaxed,aggressive,distorted, andcrunchy, with the actual descriptor truth still grounded in measured dynamics, spectral, and artifact evidencecomparereportsevaluation_basisso downstream callers can see whether structured verification, heuristic goal alignment, or raw deltas are driving quality interpretationbenchmarksincludes a tiny committed fixture-backed cleanup corpus under fixtures/audio/phase-1
The current benchmarked planner surface also includes conservative compound-edit handling:
- explicit 2-step and 3-step tonal compounds such as
make this warmer and airier,make this warmer but clean up the low mids, andmake this darker, less harsh, and less muddy - a narrow cross-family compound slice such as
speed up by 10% and tame the sibilance,tame the sibilance and make it darker,center this more and make it wider, and the current tradeoff-stylemake this a little tighter and more controlled, and darker - explicit operation-phase ordering instead of prompt-order guesswork
- structured multi-goal verification rollups that keep requested-target success and regression-guard outcomes separate, including honest partial-success reporting when only part of a compound request lands
- explicit contradiction or refusal failures for prompt pairs such as
make it brighter and darker,make it faster and slower, ormake it wider and narrower, plus one-pass safety refusals for mixes such as brightening-plus-de-essing when the baseline planner cannot justify the sequence conservatively - a first explicit numeric region-targeting slice, currently benchmarked around planner-emitted
time_rangetargets for localized tonal cleanup and explicit refusal of vague region wording
The benchmark layer now also has an opt-in live interpretation evaluation path:
modules/benchmarkscan call OpenAI, Google, or Codex CLI through the realinterpretRequest(...)surface and score the returnedIntentInterpretationagainst the curated interpretation corpus- that live eval path is intentionally separate from
pnpm run cibecause it depends on real provider keys, network behavior, latency, and API cost
The repo is organized into five groups.
contractsmodules/coremodules/historymodules/capabilities
This layer owns canonical artifacts, schema contracts, IDs, provenance, and published runtime capability metadata.
modules/iomodules/analysismodules/transformsmodules/rendermodules/compare
This layer owns deterministic import, inspection, execution, rendering, loudness-matched A/B preview generation, and before/after evaluation.
modules/semanticsmodules/planning
This layer owns interpretation of measurable audio evidence and conversion of user requests into explicit edit plans.
modules/climodules/interpretationmodules/servermodules/toolsmodules/orchestration
This layer exposes stable integration surfaces over the runtime and intent modules without redefining their responsibilities. modules/interpretation is the optional provider-backed request-normalization adapter; it does not replace deterministic planning.
For web backends, modules/server provides a framework-light adapter over the SDK for one-upload sessions, async edit jobs, variant artifact manifests, follow-ups, render/compare calls, and cleanup hooks. It intentionally does not bundle an HTTP framework, auth, accounts, billing, or product UI state.
modules/benchmarks
This layer owns prompt suites, scoring harnesses, and repeatable evaluation workflows.
For the full dependency and boundary rules, see docs/architecture.md.
The repository converges on a small set of canonical artifacts:
AudioAssetAudioVersionAnalysisReportSemanticProfileIntentInterpretationEditPlanTransformRecordRenderArtifactComparisonReportSessionGraphToolRequestToolResponseRuntimeCapabilityManifest
These are published under contracts/schemas with matching examples under contracts/examples.
The current runtime can execute:
gainnormalizetrimtrim_silencefadepitch_shiftparametric_eqhigh_pass_filterlow_pass_filterhigh_shelflow_shelfnotch_filtertilt_eqcompressorlimitertransient_shaperclippergatetime_stretchreversemono_sumpanchannel_swapchannel_remapstereo_balance_correctionmid_side_eqstereo_widthdenoisede_esserdeclickdeclipdehumreverbdelayechobitcrushdistortionsaturationflangerphaser
The baseline planner is intentionally narrower. It currently plans only against operations marked planner_supported in the published capability manifest.
At the moment, that includes:
gaintrimtrim_silencefadenormalizepitch_shiftparametric_eqhigh_pass_filterlow_pass_filterhigh_shelflow_shelfnotch_filtertilt_eqcompressorlimitertime_stretchstereo_balance_correctionstereo_widthdenoisede_esserdeclickdeclipdehum
The baseline planner now includes a conservative timing-edit slice for explicit boundary-silence trimming, pitch-preserving time stretching, and semitone pitch shifting on pitched material, plus a narrow stereo/spatial slice for widening, narrowing, and centering already-stereo material when the measured image is safe to adjust conservatively. pan, channel-utility and broader stereo-routing operations, the broader transient/control operations, and the newer creative effect operations remain runtime-available without being baseline-planner-selected. The transient-shaper surface is currently a compand-based, transient-biased runtime primitive rather than a full transient-designer model.
Published tool entrypoints:
describe_runtime_capabilitiesload_audioanalyze_audiointerpret_requestplan_editsapply_edit_planrender_previewcompare_versionsrun_request_cycle
The tool layer is intentionally small. It exists to expose stable contracts and capability discovery to external callers, not to replace the underlying module boundaries.
The current baseline is strongest on conservative cleanup and corrective-edit prompts when they are backed by explicit evidence:
analysiscan now publish steady mains-hum evidence and sparse click evidence directly inAnalysisReportplanningkeeps hum/click cleanup conservative and still requires explicit restoration intent rather than widening genericclean it upphrasing automaticallycompareprefers structured verification targets when they exist and exposesevaluation_basisinComparisonReportbenchmarksnow include curated compare cases, an interpretation-only corpus for the richerIntentInterpretationartifact, a small fixture-backed request-cycle corpus that executes the real orchestration path across tonal cleanup, restoration, timing edits, stereo/spatial edits, peak-control, benchmarked louder-and-controlled prompts, and explicit filter/trim/fade/denoise prompts, plus a planner-supported operation verification matrix that records request-cycle coverage, planner-only coverage, and explicit gaps
The current system is strongest on conservative editing requests such as:
- darker
- less harsh
- more relaxed
- slightly cleaner
- explicit loudness normalization
- airier, warmer, less muddy, or warmer-plus-low-mid cleanup through conservative surgical EQ
- texture wording such as
more relaxedorless aggressivewhen it can be grounded honestly as a conservative tonal-softening move - CLI-only
--best-efforttexture fallbacks for subjective phrases such asless distorted,less aggressive,less sharp,less gritty,less fuzzy, orless intense; these stay labeled as proxy tonal-softening edits rather than claimed artifact repair - tame sibilance, remove explicitly specified
50 Hzor60 Hzhum, and clean up clicks - explicit
less distorted,repair clipping, ordeclipwording when the source has direct clipping evidence; this is narrow hard-clipping repair, not general distortion removal - more controlled
- control peaks
- widen or narrow slightly when stereo evidence supports it
- center the image more or fix the stereo imbalance when measured balance supports it
- reduce steady broadband noise conservatively
This repo is usable today for technical experimentation and module-level integration work. It is not yet a polished end-user application.
- import is local-file based, with support for common music-user containers such as WAV, FLAC, MP3, AIFF, M4A/MP4, and raw AAC when the local FFmpeg stack supports it
- analysis currently requires WAV files on disk, so non-WAV sources must be imported through the IO normalization path before analysis
- semantic coverage is intentionally conservative
- compare now prefers structured verification targets, with heuristic goal alignment kept only as a backward-compatible fallback
- compare can verify explicit trim duration, fade boundary envelopes, and the first numeric
time_rangelevel/spectral checks from workspace-local WAV evidence - hum and click comparison now prefers direct
AnalysisReport.artifactsevidence when it exists, with low-band or clipped-sample proxies kept only as conservative fallbacks - there is now a narrow alpha CLI entrypoint for local single-file editing and explicit follow-ups, but there is still no broader GUI or service surface
- the public SDK can now request deterministic edit variants for fresh imports, returning
subtle,balanced, and/orstrongercandidates with canonical artifacts, loudness-matched A/B preview sets, skipped-variant warnings for duplicate plans, and conservative ranking - the render engine and SDK/orchestration variant flow can now produce fair before/after preview sets with original, edited, loudness-matched original, and loudness-matched edited artifacts
- the baseline planner does not yet auto-select
pan,mid_side_eq, channel remapping, or the broader Layer 1 runtime-effect surface - pure
more controlledrequests may now refuse on already tightly controlled material instead of silently degrading it, while companion tonal edits can proceed with an explicit note that redundant compression was skipped - benchmark coverage now includes a tiny committed cleanup, grounded texture, timing, stereo/spatial, filter, restoration, and control corpus, plus an operation-by-operation verification matrix; it is still light compared with the long-term goal, but every current planner-supported operation now has fixture-backed request-cycle outcome coverage
.
|-- AGENTS.md
|-- README.md
|-- contracts/
| |-- examples/
| `-- schemas/
|-- docs/
|-- fixtures/
| `-- audio/
|-- modules/
| |-- analysis/
| |-- benchmarks/
| |-- capabilities/
| |-- compare/
| |-- core/
| |-- history/
| |-- interpretation/
| |-- io/
| |-- orchestration/
| |-- planning/
| |-- render/
| |-- sdk/
| |-- semantics/
| |-- server/
| |-- tools/
| `-- transforms/
`-- tests/
`-- integration/
- Install the prerequisites in docs/system-dependencies.md.
- Run
pnpm install. - Run the validation loop:
pnpm validate:schemas
pnpm lint
pnpm typecheck
pnpm testOr run the full CI-equivalent command:
pnpm run ciFor contributors and agents, use this order:
- AGENTS.md
- docs/architecture.md
- docs/implementation-plan.md
- docs/current-capabilities.md
- docs/contributor-guide.md
- docs/repository-map.md
- the target module's
agents.md - the target module's
docs/overview.md - the relevant contracts under
contracts/schemas/
- modules/capabilities/src/index.ts for runtime capability metadata
- modules/tools/src/index.ts for stable tool execution
- modules/orchestration/src/index.ts for thin composed workflows
- modules/server/src/index.ts for backend upload/session/job/artifact workflows over the SDK
The architecture split is complete enough to build forward on:
- runtime and intent are separated
- planning is grounded by published capability metadata instead of transform internals
- contracts and docs reflect the split
- the repository validation loop is green
The next work is not more restructuring. It is deeper behavior:
- better capability metadata
- stronger planning
- better compare and verification
- a thin CLI or app adapter on top of the current modules