[Feature Request] Adapting Spectrum Acceleration for Wan 2.2 MoE Architecture (Discontinuity Issue)

Hi @hanjq17 (Jiaqi Han) and the Spectrum team,

First of all, thank you for the outstanding work on **Adaptive Spectral Feature Forecasting (Spectrum)**. The mathematical elegance of moving from local Taylor approximations to global Chebyshev polynomials, combined with the insight of only forecasting the final block, is a massive leap forward for DiT acceleration. It works flawlessly on standard dense models.

I am writing to request/discuss potential support for the newly released **Wan 2.2 architecture** (specifically the T2V-A14B and I2V-A14B models).

### The Challenge: MoE Split by Timesteps
Unlike Wan 2.1 or FLUX, Wan 2.2 introduces a unique **Mixture-of-Experts (MoE) architecture that separates the denoising process across timesteps**. 
According to their technical report:
*   A **High-noise Expert** is used for the early stages (layout and structure).
*   A **Low-noise Expert** is used for the later stages (details and textures).
*   The model dynamically switches completely from Expert A to Expert B mid-generation based on an SNR threshold ($t_{moe}$).

### Why the Current Spectrum Algorithm Breaks
Since Spectrum assumes a continuous trajectory across the entire denoising process:
1. The `warm_up_steps` (e.g., the first 5 steps) exclusively fit the Chebyshev polynomials to the feature space of the **High-noise Expert**.
2. When the solver crosses the $t_{moe}$ threshold, Wan 2.2 abruptly swaps in 14 billion different parameters (the Low-noise Expert).
3. This creates a **mathematical discontinuity**. The polynomials trained on Expert A attempt to predict features for Expert B, causing the approximation error to explode and completely destroying the output.

### Potential Solutions for Discussion
I was exploring how to implement this in ComfyUI and realized a few theoretical ways to bypass this, but I would love to hear your authoritative mathematical perspective:

1. **Dual Warm-up (Re-fitting):** Resetting the cache exactly at $t_{moe}$ and forcing a secondary `warm_up_steps` phase for the Low-noise Expert. (Though this sacrifices speed).
2. **High-Noise Only Acceleration:** Running Spectrum exclusively during the High-noise phase to establish geometry rapidly, then turning forecasting off entirely for the Low-noise phase to let the model refine textures without approximation errors.
3. **1-Step Bias Shift (Zero-shot projection):** Forcing exactly *one* compute step when the expert switches, calculating the delta/bias between the new expert's output and the polynomial's prediction ($\Delta = h_{new} - \hat{h}_{old}$), and simply shifting the existing Chebyshev curve into the new coordinate space without a full 5-step retraining.

Are there any official plans to support time-split MoE architectures like Wan 2.2?

How would you mathematically approach forecasting across an expert transition?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Adapting Spectrum Acceleration for Wan 2.2 MoE Architecture (Discontinuity Issue) #2

The Challenge: MoE Split by Timesteps

Why the Current Spectrum Algorithm Breaks

Potential Solutions for Discussion

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature Request] Adapting Spectrum Acceleration for Wan 2.2 MoE Architecture (Discontinuity Issue) #2

Description

The Challenge: MoE Split by Timesteps

Why the Current Spectrum Algorithm Breaks

Potential Solutions for Discussion

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions