Hi @hanjq17 (Jiaqi Han) and the Spectrum team,
First of all, thank you for the outstanding work on Adaptive Spectral Feature Forecasting (Spectrum). The mathematical elegance of moving from local Taylor approximations to global Chebyshev polynomials, combined with the insight of only forecasting the final block, is a massive leap forward for DiT acceleration. It works flawlessly on standard dense models.
I am writing to request/discuss potential support for the newly released Wan 2.2 architecture (specifically the T2V-A14B and I2V-A14B models).
The Challenge: MoE Split by Timesteps
Unlike Wan 2.1 or FLUX, Wan 2.2 introduces a unique Mixture-of-Experts (MoE) architecture that separates the denoising process across timesteps.
According to their technical report:
- A High-noise Expert is used for the early stages (layout and structure).
- A Low-noise Expert is used for the later stages (details and textures).
- The model dynamically switches completely from Expert A to Expert B mid-generation based on an SNR threshold ($t_{moe}$).
Why the Current Spectrum Algorithm Breaks
Since Spectrum assumes a continuous trajectory across the entire denoising process:
- The
warm_up_steps (e.g., the first 5 steps) exclusively fit the Chebyshev polynomials to the feature space of the High-noise Expert.
- When the solver crosses the $t_{moe}$ threshold, Wan 2.2 abruptly swaps in 14 billion different parameters (the Low-noise Expert).
- This creates a mathematical discontinuity. The polynomials trained on Expert A attempt to predict features for Expert B, causing the approximation error to explode and completely destroying the output.
Potential Solutions for Discussion
I was exploring how to implement this in ComfyUI and realized a few theoretical ways to bypass this, but I would love to hear your authoritative mathematical perspective:
-
Dual Warm-up (Re-fitting): Resetting the cache exactly at $t_{moe}$ and forcing a secondary
warm_up_steps phase for the Low-noise Expert. (Though this sacrifices speed).
-
High-Noise Only Acceleration: Running Spectrum exclusively during the High-noise phase to establish geometry rapidly, then turning forecasting off entirely for the Low-noise phase to let the model refine textures without approximation errors.
-
1-Step Bias Shift (Zero-shot projection): Forcing exactly one compute step when the expert switches, calculating the delta/bias between the new expert's output and the polynomial's prediction ($\Delta = h_{new} - \hat{h}_{old}$), and simply shifting the existing Chebyshev curve into the new coordinate space without a full 5-step retraining.
Are there any official plans to support time-split MoE architectures like Wan 2.2?
How would you mathematically approach forecasting across an expert transition?
Hi @hanjq17 (Jiaqi Han) and the Spectrum team,
First of all, thank you for the outstanding work on Adaptive Spectral Feature Forecasting (Spectrum). The mathematical elegance of moving from local Taylor approximations to global Chebyshev polynomials, combined with the insight of only forecasting the final block, is a massive leap forward for DiT acceleration. It works flawlessly on standard dense models.
I am writing to request/discuss potential support for the newly released Wan 2.2 architecture (specifically the T2V-A14B and I2V-A14B models).
The Challenge: MoE Split by Timesteps
Unlike Wan 2.1 or FLUX, Wan 2.2 introduces a unique Mixture-of-Experts (MoE) architecture that separates the denoising process across timesteps.
According to their technical report:
Why the Current Spectrum Algorithm Breaks
Since Spectrum assumes a continuous trajectory across the entire denoising process:
warm_up_steps(e.g., the first 5 steps) exclusively fit the Chebyshev polynomials to the feature space of the High-noise Expert.Potential Solutions for Discussion
I was exploring how to implement this in ComfyUI and realized a few theoretical ways to bypass this, but I would love to hear your authoritative mathematical perspective:
warm_up_stepsphase for the Low-noise Expert. (Though this sacrifices speed).Are there any official plans to support time-split MoE architectures like Wan 2.2?
How would you mathematically approach forecasting across an expert transition?