Skip to content

[Feature Request] Adapting Spectrum Acceleration for Wan 2.2 MoE Architecture (Discontinuity Issue) #2

@tvijas

Description

@tvijas

Hi @hanjq17 (Jiaqi Han) and the Spectrum team,

First of all, thank you for the outstanding work on Adaptive Spectral Feature Forecasting (Spectrum). The mathematical elegance of moving from local Taylor approximations to global Chebyshev polynomials, combined with the insight of only forecasting the final block, is a massive leap forward for DiT acceleration. It works flawlessly on standard dense models.

I am writing to request/discuss potential support for the newly released Wan 2.2 architecture (specifically the T2V-A14B and I2V-A14B models).

The Challenge: MoE Split by Timesteps

Unlike Wan 2.1 or FLUX, Wan 2.2 introduces a unique Mixture-of-Experts (MoE) architecture that separates the denoising process across timesteps.
According to their technical report:

  • A High-noise Expert is used for the early stages (layout and structure).
  • A Low-noise Expert is used for the later stages (details and textures).
  • The model dynamically switches completely from Expert A to Expert B mid-generation based on an SNR threshold ($t_{moe}$).

Why the Current Spectrum Algorithm Breaks

Since Spectrum assumes a continuous trajectory across the entire denoising process:

  1. The warm_up_steps (e.g., the first 5 steps) exclusively fit the Chebyshev polynomials to the feature space of the High-noise Expert.
  2. When the solver crosses the $t_{moe}$ threshold, Wan 2.2 abruptly swaps in 14 billion different parameters (the Low-noise Expert).
  3. This creates a mathematical discontinuity. The polynomials trained on Expert A attempt to predict features for Expert B, causing the approximation error to explode and completely destroying the output.

Potential Solutions for Discussion

I was exploring how to implement this in ComfyUI and realized a few theoretical ways to bypass this, but I would love to hear your authoritative mathematical perspective:

  1. Dual Warm-up (Re-fitting): Resetting the cache exactly at $t_{moe}$ and forcing a secondary warm_up_steps phase for the Low-noise Expert. (Though this sacrifices speed).
  2. High-Noise Only Acceleration: Running Spectrum exclusively during the High-noise phase to establish geometry rapidly, then turning forecasting off entirely for the Low-noise phase to let the model refine textures without approximation errors.
  3. 1-Step Bias Shift (Zero-shot projection): Forcing exactly one compute step when the expert switches, calculating the delta/bias between the new expert's output and the polynomial's prediction ($\Delta = h_{new} - \hat{h}_{old}$), and simply shifting the existing Chebyshev curve into the new coordinate space without a full 5-step retraining.

Are there any official plans to support time-split MoE architectures like Wan 2.2?

How would you mathematically approach forecasting across an expert transition?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions