Bozhou Li, Yushuo Guan, Haolin Li, Bohan Zeng, Yiyan Ji, Yue Ding, Pengfei Wan, Kun Gai, Yuanxing Zhang, Wentao Zhang
Peking University | Kuaishou Technology | Fudan University | Nanjing University | UCAS
Preprint
If this repository helps you, please give it a ⭐ for updates!
SemanticRouting enhances text-to-image DiT models by introducing a dynamic semantic routing mechanism for multi-layer LLM features. Traditional methods rely on static or single-layer text conditioning, failing to account for the semantic hierarchy in LLMs and the non-stationary denoising dynamics of diffusion models (over time and network depth).
This repository implements a unified normalized convex fusion framework equipped with lightweight gates. We systematically explore:
-
Time-wise Fusion: Adapting fusion weights to the diffusion timestep
$t$ . -
Depth-wise Fusion: Adapting weights to the DiT block index
$d$ . - Joint Fusion: Combining both time and depth adaptivity.
Our findings establish Depth-wise Semantic Routing as the superior strategy, significantly improving text-image alignment and compositional generation (e.g., +9.97 on GenAI-Bench Counting), while purely time-wise fusion can suffer from trajectory mismatch issues during inference.
- 🚀 Enhanced Alignment: Delivers strong GenEval and GenAI-Bench results by leveraging hierarchical LLM semantics.
- 🧩 Unified Framework: Easily switch between fusion strategies (
uniform,static,time-wise,depth-wise,joint) via simple config files. - 🧠 Semantic Routing: Introduces learnable gating mechanisms to route information from appropriate LLM layers to specific DiT blocks.
We evaluated our fusion strategies against strong baselines including Penultimate layer (B1), Uniform averaging (B2), Static fusion (B3), and FuseDiT.
| Method | GenEval ↑ | GenAI-Bench ↑ | UnifiedReward ↑ |
|---|---|---|---|
| Baselines | |||
| B1: Penultimate | 64.54 | 74.96 | 3.02 |
| B2: Uniform | 66.51 | 76.82 | 3.06 |
| B3: Static | 64.77 | 76.31 | 3.05 |
| Deep-fusion Baseline | |||
| FuseDiT | 60.95 | 75.02 | 3.05 |
| Our Strategies | |||
| S1: Time | 63.41 | 76.20 | 2.97 |
| S2: Depth | 67.07 | 79.07 | 3.06 |
| S3: Joint | 66.05 | 77.44 | 3.06 |
Table 1: Comparison on GenEval, GenAI-Bench, and UnifiedReward. S2 (Depth-wise) achieves the best overall performance.
We introduce a unified formulation for multi-layer fusion. The final fused representation
Where the weights
We parameterize
-
Time-wise:
$z_{t,d} = g_{\psi}(\phi(t))$ (Time-Conditioned Fusion Gate). -
Depth-wise:
$z_{t,d} = \beta_{d}$ (Block-specific learnable weights). -
Joint:
$z_{t,d} = g_{\psi_d}(\phi(t))$ (Depth-specific TCFG).
1. Clone & create environment (Python 3.12)
git clone https://github.com/zooblastlbz/SemanticRouting.git
cd SemanticRouting
conda create -n semanticrouting python=3.12 -y
conda activate semanticroutingInstall deps:
pip install -r requirements.txtThe repository expects data in JSON or JSONL format. Default keys are image and text, which can be overridden in config via data.image_key and data.text_key.
image: File path (resolved relative todata.image_rootif provided).text: String or list of strings.
Example:
{
"image": "images/sample_0001.jpg",
"text": "A scenic mountain lake at sunrise"
}Choose a configuration file from configs/ to define your fusion strategy:
configs/uniform.yaml: Simple averaging of layers.configs/static.yaml: Learnable global weights.configs/time-wise.yaml: Time-dependent gating.configs/depth-wise.yaml: Depth-dependent gating (Recommended).configs/joint.yaml: Combined time and depth gating.
Run via the launcher script (set env vars as needed):
ACCELERATE_CONFIG=./accelerate_config.yaml \
CONFIG_FILE=./configs/depth-wise.yaml \
PYTHON_BIN=python \
MASTER_ADDR=127.0.0.1 \
MASTER_PORT=29500 \
bash scripts/train.shFirst, export the trained model and necessary components:
python utils/save_pipeline.py \
--checkpoint /path/to/checkpoint-dir \
--type adafusedit \
--vae /path/to/vae \
--scheduler /path/to/schedulerGenerate images from text prompts:
python inference.py \
--checkpoint /path/to/exported/pipeline \
--prompt "A city skyline at dusk" \
--resolution 512 \
--num_inference_steps 25 \
--guidance_scale 6.0Generate evaluation samples with Accelerate (for multi-GPU/multi-node) using the scripts in evaluation/:
- GenEval:
accelerate launch evaluation/sample_geneval.py evaluation/geneval.yaml
- GenAIBench:
accelerate launch evaluation/sample_genaibench.py evaluation/genaibench.yaml
- DrawBench:
accelerate launch evaluation/sample_drawbench.py evaluation/drawbench.yaml
If you use this code or the SemanticRouting paper, please cite:
@misc{li2026semanticroutingexploringmultilayer,
title={Semantic Routing: Exploring Multi-Layer LLM Feature Weighting for Diffusion Transformers},
author={Bozhou Li and Yushuo Guan and Haolin Li and Bohan Zeng and Yiyan Ji and Yue Ding and Pengfei Wan and Kun Gai and Yuanxing Zhang and Wentao Zhang},
year={2026},
eprint={2602.03510},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.03510},
}This codebase is adapted from and extends tang-bd/fuse-dit. We sincerely thank the original authors for their foundational work.
This project is licensed under the Apache-2.0 License - see the LICENSE file for details.