Skip to content

perf: optimize Phase 2 batch generation with dynamic compaction by 3-12%#20

Merged
ServeurpersoCom merged 3 commits intoServeurpersoCom:masterfrom
jdluzen:perf/batch1
Mar 10, 2026
Merged

perf: optimize Phase 2 batch generation with dynamic compaction by 3-12%#20
ServeurpersoCom merged 3 commits intoServeurpersoCom:masterfrom
jdluzen:perf/batch1

Conversation

@jdluzen
Copy link
Contributor

@jdluzen jdluzen commented Mar 10, 2026

Tested with --batch 4

Summary by CodeRabbit

  • Performance
    • Optimized batch processing logic to reduce memory footprint and decrease GPU computational overhead.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

📝 Walkthrough

Walkthrough

The pull request refactors the Batched Phase 2 logic in ace-qwen3.cpp, replacing a per-step two-pass forward strategy with dynamic, compacted batching. It introduces active-to-original mapping, compact logits extraction, and CPU-side sampling to reduce GPU compute and memory footprint.

Changes

Cohort / File(s) Summary
Batched Phase 2 Logic Refactor
tools/ace-qwen3.cpp
Reworked batching strategy with dynamic compaction: introduced max_N2, batch_tokens, batch_sets, batch_logits variables; added active_to_orig mapping and audio_code_offset; migrated CFG application to targeted logits subset; implemented CPU-side compact sampling reducing vocabulary from ~65k to audio codes + EOS; eliminated full-vocabulary masking and large logits buffer operations.

Sequence Diagram(s)

sequenceDiagram
    participant Batch as Dynamic Batch<br/>(Phase 2)
    participant GPU as GPU Forward<br/>Pass
    participant Logits as Logits<br/>Processing
    participant CPU as CPU Sampling
    participant Map as ID Mapper

    rect rgba(100, 150, 200, 0.5)
        Note over Batch,Map: New Optimized Flow
        Batch->>Batch: Collect active sequences<br/>via active_to_orig mapping
        Batch->>GPU: Forward pass on compact<br/>batch (actual_batch_size)
        GPU-->>Logits: Return logits tensor
        Logits->>Logits: Extract compact subset<br/>(EOS + audio codes)
        alt CFG Enabled
            Logits->>Logits: Apply CFG scale to<br/>compact logits only
        end
        Logits->>CPU: Pass compact_logits<br/>to CPU
        CPU->>CPU: Sample from compact<br/>vocabulary
        CPU-->>Map: Local sampled IDs
        Map->>Map: Map local IDs to<br/>global audio codes
        Map-->>Batch: Global token IDs
        Batch->>Batch: Update sequences &<br/>compute next n_active
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 The batches now compress with care,
No lengthy logits fill the air—
Just EOS and codes we need,
The GPU works with faster speed!
Where once two passes wandered by,
Now compact paths illuminate the sky. ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The PR title accurately describes the main change: optimizing Phase 2 batch generation with dynamic compaction and expected performance improvements. The title is specific, clear, and directly reflects the primary objective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tools/ace-qwen3.cpp (1)

619-623: Keep n_active and total_codes incremental in the hot path.

This loop already knows when a sequence flips to done, so the extra full for (i = 0; i < N; ++i) pass to rebuild next_active_count is avoidable. total_codes is also recomputed on every step even though it is only emitted every 50 steps.

Also applies to: 626-635

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/ace-qwen3.cpp` around lines 619 - 623, The loop that checks tok against
TOKEN_IM_END should update the running counters in-place instead of recomputing
them later: when seqs[orig_i].done transitions to true, decrement n_active
immediately (and adjust any next_active_count tracking used later); when pushing
an audio code (seqs[orig_i].audio_codes.push_back(tok - AUDIO_CODE_BASE))
increment total_codes immediately so you only recompute totals when needed for
the 50-step emission; remove the subsequent full for-loop used to rebuild
next_active_count/total_codes and ensure any logic that relied on that pass now
reads the updated n_active and total_codes. Apply the same incremental updates
in the analogous block around the code at lines 626-635.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tools/ace-qwen3.cpp`:
- Around line 619-623: The loop that checks tok against TOKEN_IM_END should
update the running counters in-place instead of recomputing them later: when
seqs[orig_i].done transitions to true, decrement n_active immediately (and
adjust any next_active_count tracking used later); when pushing an audio code
(seqs[orig_i].audio_codes.push_back(tok - AUDIO_CODE_BASE)) increment
total_codes immediately so you only recompute totals when needed for the 50-step
emission; remove the subsequent full for-loop used to rebuild
next_active_count/total_codes and ensure any logic that relied on that pass now
reads the updated n_active and total_codes. Apply the same incremental updates
in the analogous block around the code at lines 626-635.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 28f44bdd-e2cc-4a6a-a0f3-2d1c6e9db697

📥 Commits

Reviewing files that changed from the base of the PR and between 1d57065 and 876fef1.

📒 Files selected for processing (1)
  • tools/ace-qwen3.cpp

@jdluzen jdluzen changed the title perf: improve batch generation in step 1 by 3-12% perf: optimize Phase 2 batch generation with dynamic compaction by 3-12% Mar 10, 2026
@ServeurpersoCom
Copy link
Owner

That's interesting! I try it and merge it

@ServeurpersoCom ServeurpersoCom merged commit a56c9c6 into ServeurpersoCom:master Mar 10, 2026
3 of 4 checks passed
@ServeurpersoCom
Copy link
Owner

RTX PRO 6000 Blackwell | 4B Q8_0 | batch=4 | seed=42 | CFG=2.0

                     BASELINE      PR #20       DELTA
Phase2 Decode        5070ms        4583ms       -9.6%
tok/s (all 4 act.)   350           360          +2.8%    <- targeted CFG + compact sampling
tok/s (after compact) 348          380          +9.0%    <- compaction kicks in
Total                9834ms        9354ms       -4.9%

@ServeurpersoCom
Copy link
Owner

Bonus: compact sampling computes softmax over only the 2049 valid tokens (EOS + audio codes) instead of 65k, eliminating probability mass leakage to impossible text tokens and producing a sharper, more faithful distribution at every decode step

Copilot AI referenced this pull request in audiohacking/acestep.cpp Mar 10, 2026
…12% (#20)

* perf: improve batch generation in step 1 by 3-12%

* remove comments

* remove comments
@jdluzen jdluzen deleted the perf/batch1 branch March 10, 2026 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants