context_mask under video_vae-used conditions

感谢您的工作，给了我很多灵感。
我想在其他场景中应用你们的方法，我发现context_mask的shape是（batch_size, n_tokens），这意味着假如我们使用了video_vae对视频的T维度(C,T,H,W) 进行了压缩，那么4帧（假设压缩率是4）将共用同一个noise_levels，这是不是意味着我们在推理模型时，必须连续4帧作为History-Guide