Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ jobs:
- name: Run clang-format (check mode)
run: |
find . \
\( -path './.git' -o -path './ggml' -o -path './build' \) -prune -o \
\( -path './.git' -o -path './ggml' -o -path './build' -o -path './vendor' -o -path './mp3' \) -prune -o \
-type f \( -name '*.c' -o -name '*.h' -o -name '*.cc' -o -name '*.cpp' -o -name '*.hpp' \) \
-print0 | xargs -0 clang-format --dry-run --Werror

Expand Down
41 changes: 40 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -258,6 +258,35 @@ EOF
--vae models/vae-BF16.gguf
```

**Lego** (`"lego"` in JSON + `--src-audio`):
generates a new instrument track layered over an existing backing track.
Only the **base model** (`acestep-v15-base`) supports lego mode.
See `examples/lego.json` and `examples/lego.sh`.

```bash
cat > /tmp/lego.json << 'EOF'
{
"caption": "electric guitar riff, funk guitar, house music, instrumental",
"lyrics": "[Instrumental]",
"lego": "guitar",
"inference_steps": 50,
"guidance_scale": 7.0,
"shift": 1.0
}
EOF

./build/dit-vae \
--src-audio backing-track.wav \
--request /tmp/lego.json \
--text-encoder models/Qwen3-Embedding-0.6B-Q8_0.gguf \
--dit models/acestep-v15-base-Q8_0.gguf \
--vae models/vae-BF16.gguf \
--wav
```

Available track names: `vocals`, `backing_vocals`, `drums`, `bass`, `guitar`,
`keyboard`, `percussion`, `strings`, `synth`, `fx`, `brass`, `woodwinds`.

## Request JSON reference

Only `caption` is required. All other fields default to "unset" which means
Expand Down Expand Up @@ -285,7 +314,8 @@ the LLM fills them, or a sensible runtime default is applied.
"shift": 3.0,
"audio_cover_strength": 0.5,
"repainting_start": -1,
"repainting_end": -1
"repainting_end": -1,
"lego": ""
}
```

Expand Down Expand Up @@ -353,6 +383,15 @@ the DiT regenerates the `[start, end)` time region while preserving everything
else. `-1` on start means 0s (beginning), `-1` on end means source duration
(end). Error if end <= start after resolve. `audio_cover_strength` is ignored.

**`lego`** (string, default `""` = inactive)
Track name for lego mode. Requires `--src-audio` and the **base model**.
Valid names: `vocals`, `backing_vocals`, `drums`, `bass`, `guitar`,
`keyboard`, `percussion`, `strings`, `synth`, `fx`, `brass`, `woodwinds`.
When set, passes the source audio to the DiT as context and builds the
instruction `"Generate the {TRACK} track based on the audio context:"`.
`audio_cover_strength` is forced to 1.0 (all steps see the source audio).
Use `inference_steps=50`, `guidance_scale=7.0`, `shift=1.0` for base model.

### LM sampling (ace-qwen3)

**`lm_temperature`** (float, default `0.85`)
Expand Down
8 changes: 8 additions & 0 deletions examples/lego.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"caption": "",
"lyrics": "[Instrumental]",
"lego": "guitar",
"inference_steps": 50,
"guidance_scale": 7.0,
"shift": 1.0
}
30 changes: 30 additions & 0 deletions examples/lego.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
#!/bin/bash
# Lego test: three-step self-contained pipeline.
#
# step zero: download the base DiT model if not already present
# (lego requires acestep-v15-base; turbo/sft do not support it)
# step one: generate a track from the simple prompt
# step two: apply lego guitar to that generated track

set -eu

# Step 1: generate a source track with the simple prompt
../build/ace-qwen3 \
--request simple.json \
--model ../models/acestep-5Hz-lm-4B-Q8_0.gguf

../build/dit-vae \
--request simple0.json \
--text-encoder ../models/Qwen3-Embedding-0.6B-Q8_0.gguf \
--dit ../models/acestep-v15-turbo-Q8_0.gguf \
--vae ../models/vae-BF16.gguf \
--wav

# Step 2: lego guitar on the generated track (base model required)
../build/dit-vae \
--src-audio simple00.wav \
--request lego.json \
--text-encoder ../models/Qwen3-Embedding-0.6B-Q8_0.gguf \
--dit ../models/acestep-v15-base-Q8_0.gguf \
--vae ../models/vae-BF16.gguf \
--wav
9 changes: 9 additions & 0 deletions src/request.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ void request_init(AceRequest * r) {
r->audio_cover_strength = 0.5f;
r->repainting_start = -1.0f;
r->repainting_end = -1.0f;
r->lego = "";
}

// JSON string escape / unescape
Expand Down Expand Up @@ -321,6 +322,8 @@ bool request_parse(AceRequest * r, const char * path) {
r->repainting_start = (float) atof(v.c_str());
} else if (k == "repainting_end") {
r->repainting_end = (float) atof(v.c_str());
} else if (k == "lego") {
r->lego = v;
}
}

Expand Down Expand Up @@ -356,6 +359,9 @@ bool request_write(const AceRequest * r, const char * path) {
fprintf(f, " \"audio_cover_strength\": %.2f,\n", r->audio_cover_strength);
fprintf(f, " \"repainting_start\": %.1f,\n", r->repainting_start);
fprintf(f, " \"repainting_end\": %.1f,\n", r->repainting_end);
if (!r->lego.empty()) {
fprintf(f, " \"lego\": \"%s\",\n", json_escape(r->lego).c_str());
}
// audio_codes last (no trailing comma)
fprintf(f, " \"audio_codes\": \"%s\"\n", json_escape(r->audio_codes).c_str());
fprintf(f, "}\n");
Expand All @@ -380,5 +386,8 @@ void request_dump(const AceRequest * r, FILE * f) {
if (r->repainting_start >= 0.0f || r->repainting_end >= 0.0f) {
fprintf(f, " repaint: start=%.1f end=%.1f\n", r->repainting_start, r->repainting_end);
}
if (!r->lego.empty()) {
fprintf(f, " lego: %s\n", r->lego.c_str());
}
fprintf(f, " audio_codes: %s\n", r->audio_codes.empty() ? "(none)" : "(present)");
}
6 changes: 6 additions & 0 deletions src/request.h
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,12 @@ struct AceRequest {
// -1 on start means 0s, -1 on end means source duration.
float repainting_start; // -1
float repainting_end; // -1

// lego mode (requires --src-audio, base model only)
// Track name from TRACK_NAMES: vocals, backing_vocals, drums, bass, guitar,
// keyboard, percussion, strings, synth, fx, brass, woodwinds.
// Empty = not lego. Sets instruction and forces full-range repaint.
std::string lego; // ""
};

// Initialize all fields to defaults (matches Python GenerationParams defaults)
Expand Down
68 changes: 59 additions & 9 deletions tools/dit-vae.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
#include "vae-enc.h"
#include "vae.h"

#include <cctype>
#include <cstdio>
#include <cstdlib>
#include <cstring>
Expand Down Expand Up @@ -283,11 +284,43 @@ int main(int argc, char ** argv) {
fprintf(stderr, "[Request] ERROR: failed to parse %s, skipping\n", rpath);
continue;
}
if (req.caption.empty()) {
if (req.caption.empty() && req.lego.empty()) {
fprintf(stderr, "[Request] ERROR: caption is empty in %s, skipping\n", rpath);
continue;
}

// Lego mode validation (base model only, requires --src-audio)
bool is_lego = !req.lego.empty();
if (is_lego) {
if (!src_audio_path) {
fprintf(stderr, "[Lego] ERROR: lego requires --src-audio\n");
return 1;
}
if (is_turbo) {
fprintf(stderr, "[Lego] ERROR: lego requires the base DiT model (turbo detected)\n");
return 1;
}
// Reference project: TRACK_NAMES (constants.py)
static const char * allowed[] = {
"vocals", "backing_vocals", "drums", "bass", "guitar", "keyboard",
"percussion", "strings", "synth", "fx", "brass", "woodwinds",
};
bool valid = false;
for (int k = 0; k < 12; k++) {
if (req.lego == allowed[k]) {
valid = true;
break;
}
}
if (!valid) {
fprintf(stderr, "[Lego] ERROR: '%s' is not a valid track name\n", req.lego.c_str());
fprintf(stderr,
" Valid: vocals, backing_vocals, drums, bass, guitar, keyboard,\n"
" percussion, strings, synth, fx, brass, woodwinds\n");
return 1;
}
}

// Extract params
const char * caption = req.caption.c_str();
const char * lyrics = req.lyrics.c_str();
Expand Down Expand Up @@ -406,19 +439,36 @@ int main(int argc, char ** argv) {
}

// 2. Build formatted prompts
// Reference project uses opposite-sounding instructions (constants.py):
// Reference project instruction templates (constants.py TASK_INSTRUCTIONS):
// text2music = "Fill the audio semantic mask..."
// cover = "Generate audio semantic tokens..."
// repaint = "Repaint the mask area..."
// lego = "Generate the {TRACK_NAME} track based on the audio context:"
// Auto-switches to cover when audio_codes are present
bool is_cover = have_cover || !codes_vec.empty();
const char * instruction = is_repaint ? "Repaint the mask area based on the given conditions:" :
is_cover ? "Generate audio semantic tokens based on the given conditions:" :
"Fill the audio semantic mask based on the given conditions:";
char metas[512];
bool is_cover = have_cover || !codes_vec.empty();
std::string instruction_str;
if (is_lego) {
// Lego mode: force audio_cover_strength=1.0 so all DiT steps see the source audio
req.audio_cover_strength = 1.0f;
fprintf(stderr, "[Lego] track=%s, cover path, strength=1.0\n", req.lego.c_str());
// Reference project (task_utils.py:86): track name is UPPERCASE
std::string track_upper = req.lego;
for (char & c : track_upper) {
c = (char) toupper((unsigned char) c);
}
instruction_str = "Generate the " + track_upper + " track based on the audio context:";
} else if (is_repaint) {
instruction_str = "Repaint the mask area based on the given conditions:";
} else if (is_cover) {
instruction_str = "Generate audio semantic tokens based on the given conditions:";
} else {
instruction_str = "Fill the audio semantic mask based on the given conditions:";
}

char metas[512];
snprintf(metas, sizeof(metas), "- bpm: %s\n- timesignature: %s\n- keyscale: %s\n- duration: %d seconds\n", bpm,
timesig, keyscale, (int) duration);
std::string text_str = std::string("# Instruction\n") + instruction + "\n\n" + "# Caption\n" + caption +
std::string text_str = std::string("# Instruction\n") + instruction_str + "\n\n" + "# Caption\n" + caption +
"\n\n" + "# Metas\n" + metas + "<|endoftext|>\n";

std::string lyric_str = std::string("# Languages\n") + language + "\n\n# Lyric\n" + lyrics + "<|endoftext|>";
Expand Down Expand Up @@ -536,7 +586,7 @@ int main(int argc, char ** argv) {
}

// Build context: [T, ctx_ch] = src_latents[64] + chunk_mask[64]
// Cover: src = cover_latents, mask = 1.0 everywhere
// Cover/Lego: src = cover_latents, mask = 1.0 everywhere
// Repaint: src = silence in region / cover outside, mask = 1.0 in region / 0.0 outside
// Passthrough: detokenized FSQ codes + silence padding, mask = 1.0
// Text2music: silence only, mask = 1.0
Expand Down
Loading