perf: optimize Phase 2 batch generation with dynamic compaction by 3-12% by jdluzen · Pull Request #20 · ServeurpersoCom/acestep.cpp

jdluzen · 2026-03-10T01:25:36Z

Tested with --batch 4

Summary by CodeRabbit

Performance
- Optimized batch processing logic to reduce memory footprint and decrease GPU computational overhead.

coderabbitai · 2026-03-10T01:25:59Z

📝 Walkthrough

Walkthrough

The pull request refactors the Batched Phase 2 logic in ace-qwen3.cpp, replacing a per-step two-pass forward strategy with dynamic, compacted batching. It introduces active-to-original mapping, compact logits extraction, and CPU-side sampling to reduce GPU compute and memory footprint.

Changes

Cohort / File(s)	Summary
Batched Phase 2 Logic Refactor `tools/ace-qwen3.cpp`	Reworked batching strategy with dynamic compaction: introduced max_N2, batch_tokens, batch_sets, batch_logits variables; added active_to_orig mapping and audio_code_offset; migrated CFG application to targeted logits subset; implemented CPU-side compact sampling reducing vocabulary from ~65k to audio codes + EOS; eliminated full-vocabulary masking and large logits buffer operations.

Sequence Diagram(s)

sequenceDiagram
    participant Batch as Dynamic Batch<br/>(Phase 2)
    participant GPU as GPU Forward<br/>Pass
    participant Logits as Logits<br/>Processing
    participant CPU as CPU Sampling
    participant Map as ID Mapper

    rect rgba(100, 150, 200, 0.5)
        Note over Batch,Map: New Optimized Flow
        Batch->>Batch: Collect active sequences<br/>via active_to_orig mapping
        Batch->>GPU: Forward pass on compact<br/>batch (actual_batch_size)
        GPU-->>Logits: Return logits tensor
        Logits->>Logits: Extract compact subset<br/>(EOS + audio codes)
        alt CFG Enabled
            Logits->>Logits: Apply CFG scale to<br/>compact logits only
        end
        Logits->>CPU: Pass compact_logits<br/>to CPU
        CPU->>CPU: Sample from compact<br/>vocabulary
        CPU-->>Map: Local sampled IDs
        Map->>Map: Map local IDs to<br/>global audio codes
        Map-->>Batch: Global token IDs
        Batch->>Batch: Update sequences &<br/>compute next n_active
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 The batches now compress with care,
No lengthy logits fill the air—
Just EOS and codes we need,
The GPU works with faster speed!
Where once two passes wandered by,
Now compact paths illuminate the sky. ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check	✅ Passed	The PR title accurately describes the main change: optimizing Phase 2 batch generation with dynamic compaction and expected performance improvements. The title is specific, clear, and directly reflects the primary objective.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

tools/ace-qwen3.cpp (1)
619-623: Keep n_active and total_codes incremental in the hot path.

This loop already knows when a sequence flips to done, so the extra full for (i = 0; i < N; ++i) pass to rebuild next_active_count is avoidable. total_codes is also recomputed on every step even though it is only emitted every 50 steps.

Also applies to: 626-635
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/ace-qwen3.cpp` around lines 619 - 623, The loop that checks tok against
TOKEN_IM_END should update the running counters in-place instead of recomputing
them later: when seqs[orig_i].done transitions to true, decrement n_active
immediately (and adjust any next_active_count tracking used later); when pushing
an audio code (seqs[orig_i].audio_codes.push_back(tok - AUDIO_CODE_BASE))
increment total_codes immediately so you only recompute totals when needed for
the 50-step emission; remove the subsequent full for-loop used to rebuild
next_active_count/total_codes and ensure any logic that relied on that pass now
reads the updated n_active and total_codes. Apply the same incremental updates
in the analogous block around the code at lines 626-635.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tools/ace-qwen3.cpp`:
- Around line 619-623: The loop that checks tok against TOKEN_IM_END should
update the running counters in-place instead of recomputing them later: when
seqs[orig_i].done transitions to true, decrement n_active immediately (and
adjust any next_active_count tracking used later); when pushing an audio code
(seqs[orig_i].audio_codes.push_back(tok - AUDIO_CODE_BASE)) increment
total_codes immediately so you only recompute totals when needed for the 50-step
emission; remove the subsequent full for-loop used to rebuild
next_active_count/total_codes and ensure any logic that relied on that pass now
reads the updated n_active and total_codes. Apply the same incremental updates
in the analogous block around the code at lines 626-635.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 28f44bdd-e2cc-4a6a-a0f3-2d1c6e9db697

📥 Commits

Reviewing files that changed from the base of the PR and between 1d57065 and 876fef1.

📒 Files selected for processing (1)

tools/ace-qwen3.cpp

ServeurpersoCom · 2026-03-10T10:50:22Z

That's interesting! I try it and merge it

ServeurpersoCom · 2026-03-10T11:47:48Z

RTX PRO 6000 Blackwell | 4B Q8_0 | batch=4 | seed=42 | CFG=2.0

                     BASELINE      PR #20       DELTA
Phase2 Decode        5070ms        4583ms       -9.6%
tok/s (all 4 act.)   350           360          +2.8%    <- targeted CFG + compact sampling
tok/s (after compact) 348          380          +9.0%    <- compaction kicks in
Total                9834ms        9354ms       -4.9%

ServeurpersoCom · 2026-03-10T12:27:43Z

Bonus: compact sampling computes softmax over only the 2049 valid tokens (EOS + audio codes) instead of 65k, eliminating probability mass leakage to impossible text tokens and producing a sharper, more faithful distribution at every decode step

…12% (#20) * perf: improve batch generation in step 1 by 3-12% * remove comments * remove comments

jdluzen added 3 commits March 9, 2026 21:16

perf: improve batch generation in step 1 by 3-12%

75e507b

remove comments

8125a7c

remove comments

876fef1

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

jdluzen changed the title ~~perf: improve batch generation in step 1 by 3-12%~~ perf: optimize Phase 2 batch generation with dynamic compaction by 3-12% Mar 10, 2026

ServeurpersoCom merged commit a56c9c6 into ServeurpersoCom:master Mar 10, 2026
3 of 4 checks passed

Copilot AI referenced this pull request in audiohacking/acestep.cpp Mar 10, 2026

perf: optimize Phase 2 batch generation with dynamic compaction by 3-…

3d646c9

…12% (#20) * perf: improve batch generation in step 1 by 3-12% * remove comments * remove comments

jdluzen deleted the perf/batch1 branch March 10, 2026 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize Phase 2 batch generation with dynamic compaction by 3-12%#20

perf: optimize Phase 2 batch generation with dynamic compaction by 3-12%#20
ServeurpersoCom merged 3 commits intoServeurpersoCom:masterfrom
jdluzen:perf/batch1

jdluzen commented Mar 10, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 10, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

ServeurpersoCom commented Mar 10, 2026

Uh oh!

Uh oh!

ServeurpersoCom commented Mar 10, 2026

Uh oh!

ServeurpersoCom commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jdluzen commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

ServeurpersoCom commented Mar 10, 2026

Uh oh!

Uh oh!

ServeurpersoCom commented Mar 10, 2026

Uh oh!

ServeurpersoCom commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jdluzen commented Mar 10, 2026 •

edited

Loading

coderabbitai bot commented Mar 10, 2026 •

edited

Loading