DynControlNet is a dynamic multi-condition ControlNet variant built on Stable Diffusion 1.5 using Hugging Face Diffusers.
DynControlNet extends ControlNet to support multiple condition inputs simultaneously (depth, normal, canny, segment) with uncertainty-aware encoding and time-dependent dynamic fusion.
- Multi-condition fusion: Simultaneously leverages depth, normal, canny, and segment conditions through a unified architecture, unlike standard ControlNet which handles only a single condition.
- Uncertainty-aware encoding: Each condition encoder produces reliability maps that allow the fusion module to down-weight noisy or uninformative conditions automatically.
- Time-dependent dynamic gating: Fusion weights adapt across diffusion timesteps, the model learns to rely on canny edges at fine scales and depth at coarse scales.
- Scale-aligned injection: 13 SoftZeroConv residuals inject control signals at matching resolutions into the frozen SD UNet, preserving generation quality.
![]() |
![]() |
Figure 1. Overall Architecture The SD 1.5 Unet is kept frozen. Dyncontrolnet processes condition images through encoding, dynamic fusion, and scale projection, then injects residuals into the Unet via SoftZeroConv.
Figure 2. UncertaintyAwareEncoder Each condition image is independently encoded by a CNN backbone and DoG frequency gate. The encoder outputs a feature map, uncertainty logits, and a reliability map used in fusion gating.
![]() |
![]() |
Figure 3. MultiScaleTimeDependentFusion Four condition features are fused at each of 4 levels using time-dependent gates modulated by reliability maps and temperature τ(t).
Figure 4. Scale Projection and Control Injection Fused features are projected to 4 scales (64×64, 32×32, 16×16, 8×8) and injected into the Control Branch. 13 SoftZeroConv residuals are passed to the frozen SD Unet.
All evaluations were conducted on 1,000 curated photo images with detailed Florence-2 captions, using 'control_scale=1.2', 'guidance_scale=7.5', 'steps=30', and 'seed=42'.
| Method | FID↓ | CLIP Score↑ | SSIM↑ | LPIPS↓ | Canny Corr↑ | Depth Corr↑ | MiDaS Corr↑ |
|---|---|---|---|---|---|---|---|
| DynControlNet (Ours) | 57.2 | 0.3356 | 0.3184 | 0.4810 | 0.1155 | 0.0928 | 0.9111 |
| SD 1.5 (no control) | 73.7 | 0.3284 | 0.2094 | 0.7905 | 0.0529 | 0.0126 | 0.5321 |
| ControlNet-Depth | 68.7 | 0.3326 | 0.2675 | 0.6479 | 0.0774 | 0.1572 | 0.9190 |
| ControlNet-Canny | 59.0 | 0.3336 | 0.3159 | 0.4922 | 0.2264 | 0.1803 | 0.8897 |
| ControlNet-Normal | 68.6 | 0.3309 | 0.2835 | 0.6383 | 0.0727 | 0.0888 | 0.8728 |
| ControlNet-Segment | 79.2 | 0.3205 | 0.2050 | 0.7304 | 0.0658 | 0.0526 | 0.6786 |
DynControlNet achieves the best FID (57.2) and best overall balance across all metrics compared to any single-condition baseline, demonstrating that dynamic multi-condition fusion produces higher-fidelity images with stronger structural coherence.
DynControlNet vs. SD 1.5 Vanilla:
- CLIP: 0.3284 → 0.3356 (+2.2% ↑)
- SSIM: 0.2094 → 0.3184 (+52.0% ↑)
- LPIPS: 0.7905 → 0.4810 (-39.1% ↑)
- Canny: 0.0529 → 0.1155 (+118.4% ↑)
| Method | Random FID↓ |
|---|---|
| DynControlNet (Ours) | 56.8 |
| SD 1.5 (no control) | 73.7 |
| ControlNet-Depth | 68.8 |
| ControlNet-Canny | 59.1 |
| ControlNet-Normal | 68.8 |
| ControlNet-Segment | 74.6 |
| Config | CLIP↑ | SSIM↑ | LPIPS↓ | Canny↑ | Depth↑ |
|---|---|---|---|---|---|
| DynControlNet (Ours) | 0.3384 | 0.3050 | 0.4782 | 0.1172 | 0.0921 |
| Depth only | 0.3389 | 0.2613 | 0.6452 | 0.0626 | 0.0647 |
| Normal only | 0.3334 | 0.2354 | 0.6385 | 0.0775 | 0.0382 |
| Canny only | 0.3369 | 0.2609 | 0.5987 | 0.0905 | 0.0371 |
| Segment only | 0.3357 | 0.2448 | 0.6851 | 0.0562 | 0.0352 |
All conditions use the same DynControlNet checkpoint, evaluated with 200 samples.
Multi-condition advantage:
- vs depth_only: ssim +16.7%
- vs depth_only: lpips -25.9%
- vs normal_only: ssim +29.6%
- vs normal_only: lpips -25.1%
- vs canny_only: ssim +16.9%
- vs canny_only: lpips -20.1%
- vs segment_only: ssim +24.6%
- vs segment_only: lpips -30.2%
| Seed | CLIP↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|
| 42 | 0.3340 | 0.3186 | 0.4763 |
| 123 | 0.3336 | 0.3248 | 0.4749 |
| 456 | 0.3364 | 0.3162 | 0.4827 |
| 789 | 0.3298 | 0.3257 | 0.4780 |
| 2026 | 0.3338 | 0.3259 | 0.4775 |
The fusion module learns distinct gating patterns across scales and time steps:
| Scale | Depth | Normal | Canny | Segment |
|---|---|---|---|---|
| 64x64 | 0.2085 | 0.2814 | 0.7775 | 0.3088 |
| 32x32 | 0.2954 | 0.3666 | 0.5083 | 0.2485 |
| 16x16 | 0.6409 | 0.3423 | 0.1806 | 0.2519 |
| 8x8 | 0.5219 | 0.2210 | 0.5435 | 0.2389 |
At the finest scale (64×64), the model strongly favors canny edges for precise boundary generation. At coarser scales (16×16, 8×8), depth becomes dominant for global spatial layout. All conditions show meaningful temporal variation across denoising steps.
git clone https://github.com/gihyuness/DynControlNet.git
cd DynControlNet
pip install -r requirements.txtimport torch
from diffusers import AutoencoderKL, UniPCMultistepScheduler
from safetensors.torch import load_file
from transformers import AutoTokenizer, CLIPTextModel
from PIL import Image
from models.dyncontrolnet import DynControlNetModel
from models.unet import UNet2DConditionModel
from pipeline.dyncontrolnet_pipeline import StableDiffusionDynControlNetPipeline
pretrained = "runwayml/stable-diffusion-v1-5"
tokenizer = AutoTokenizer.from_pretrained(pretrained, subfolder="tokenizer", use_fast=False)
text_encoder = CLIPTextModel.from_pretrained(pretrained, subfolder="text_encoder")
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16)
unet = UNet2DConditionModel.from_pretrained(pretrained, subfolder="unet")
dyncontrolnet = DynControlNetModel.from_unet(
unet, condition_types=["depth", "normal", "canny", "segment"],
cond_feat_channels=320, copy_unet_weights=False,
)
state_dict = load_file(
r"C:\Users\Gihyun\PycharmProjects\CDM\dyncontrolnet_model_github\checkpoint-33372\model.safetensors",
device="cpu",
)
dyncontrolnet.load_state_dict(state_dict, strict=False)
pipe = StableDiffusionDynControlNetPipeline.from_pretrained(
pretrained, vae=vae, text_encoder=text_encoder, tokenizer=tokenizer,
unet=unet, dyncontrolnet=dyncontrolnet,
safety_checker=None, torch_dtype=torch.float16,
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
condition_images = {
"depth": Image.open("data/depth.png"),
"normal": Image.open("data/normal.png"),
"canny": Image.open("data/canny.png"),
"segment": Image.open("data/segment.png"),
}
seed = 12345
generator = torch.Generator(device="cuda").manual_seed(seed)
with torch.autocast("cuda"):
output = pipe(
prompt="A realistic photo of a playful tabby and white "
"cat standing upright on its hind legs indoors, "
"reaching up with its front paws to grab a bright white cotton string, "
"above. Full body in frame with headroom, vertical composition, "
"subject centered, wooden floor and a cozy carpet in the background, "
"warm indoor lighting, sharp focus, detailed fur texture, "
"natural colors.",
negative_prompt="low quality, blurry",
condition_images=condition_images,
num_inference_steps=50,
guidance_scale=7.5,
control_scale=1.0,
generator=generator,
)
output.images[0].save("sample.png")DynControlNet was trained on a paired dataset of ~30,000 image-condition pairs over 3 days on a RTX 4060 Ti 16GB.
Due to hardware constraints (personal computer and GPU), large-scale training was not feasible, but the model demonstrates meaningful multi-condition control even under limited training.
| Setting | Value |
|---|---|
| Dataset size | ~30,000 paired samples |
| Training duration | ~3 days |
| Hardware | RTX 4060 Ti 16GB |
| Base model | Stable Diffusion 1.5 (frozen) |
Training data is stored in HDF5 format. Each entry contains:
entry_0/
├── image # Original RGB image (H, W, 3)
├── depth # Depth map (H, W, 3)
├── normal # Normal map (H, W, 3)
├── canny # Canny edge map (H, W, 3)
├── segment # Segmentation map (H, W, 3)
└── attrs:
└── caption # Text description
| Metric | What it Measures | Ideal |
|---|---|---|
| FID | Distribution-level image quality vs. real images | Lower is better |
| CLIP Score | Text-image semantic alignment | Higher is better |
| SSIM | Structural similarity to ground truth | Higher is better |
| LPIPS | Perceptual difference from ground truth | Lower is better |
| Canny Alignment | Edge map IoU between generated and condition | Higher is better |
| Depth Correlation | Structural alignment with depth condition (gradient / MiDaS) | Higher is better |
DynControlNet/
├── models/
│ ├── dyncontrolnet.py # DynControlNetModel — main model class
│ ├── encoders.py # UncertaintyAwareEncoder
│ ├── fusion.py # MultiScaleTimeDependentFusion
│ ├── gates.py # DoGFrequencyGate
│ ├── projections.py # SoftZeroConv, SoftProjection
│ └── unet/ # SD 1.5 UNet (from diffusers)
├── pipeline/
│ ├── dyncontrolnet_pipeline.py # Full inference pipeline
│ ├── condition_preprocess.py # Condition image preprocessing
│ ├── denoise_loop.py # Denoising loop with control injection
│ ├── latents.py # Latent preparation utilities
│ ├── prompt_encoder.py # CLIP text encoding
│ ├── safety_and_decode.py # VAE decode & safety checker
│ └── types.py # Type aliases
├── train/
│ ├── trainer.py # Main training loop
│ ├── cli.py # Argument parser
│ ├── data.py # HDF5Dataset & augmentations
│ ├── dropout.py # Structured condition dropout
│ ├── checkpoints.py # Checkpoint save/load
│ ├── validation.py # Validation image generation
│ ├── metrics.py # Training metrics logging
│ └── models.py # Model initialization helpers
├── scripts/
│ └── train_dyncontrolnet.py # Training entry point
├── data/ # Example condition images
├── requirements.txt
├── LICENSE
└── README.md
python -m scripts.train_dyncontrolnet \
--pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
--dataset_path="datasets/train_dataset_style_final.h5" \
--train_batch_size=2 --num_train_epochs=40 \
--output_dir="dyncontrolnet_model" \
--logging_dir="logs" \
--resolution=512 \
--learning_rate=5e-6 \
--condition_types depth normal canny segment \
--validation_conditions "data/depth_image.png" "data/normal_image.png"
"data/canny_image.png" "data/segment_image.png" \
--validation_prompt "A realistic photo of a playful tabby and white
cat standing upright on its hind legs indoors,
reaching up with its front paws to grab a bright white cotton string
above. Full body in frame with headroom, vertical composition,
subject centered, wooden floor and a cozy carpet in the background,
warm indoor lighting, sharp focus, detailed fur texture,
natural colors." \
--report_to="tensorboard" \
--checkpoints_total_limit=3 \
--enable_xformers_memory_efficient_attention \
--gradient_accumulation_steps=8 \
--lr_scheduler constant \
--mixed_precision fp16 \
--lr_warmup_steps=200 \
--set_grads_to_none \
--max_grad_norm=1.0 \
--noise_offset 0.05 \
--log_tb_images \
--gradient_checkpointing \| Feature | Description |
|---|---|
| Random Control Scale | control_scale is uniformly sampled from [scale_min, scale_max] each step, improving inference robustness. |
| Structured Condition Dropout | Each condition is independently dropped with probability p=0.1, and all conditions are dropped simultaneously with p=0.05, teaching the model graceful degradation. |
| Differential Learning Rates | Custom modules (encoders, fuser, projections) use 5× the base learning rate for faster convergence. |
| Synchronized Augmentation | Random flip and crop are applied identically to the original image and all condition maps, maintaining spatial correspondence. |
| Gradient Checkpointing | Reduces VRAM usage at the cost of ~20% slower training. |
This project is licensed under the MIT License.
- Stable Diffusion by CompVis / Stability AI
- ControlNet by Lvmin Zhang et al.
- Hugging Face Diffusers
- PyTorch
- Hugging Face Transformers
- Safetensors
- CLIP by OpenAI (used for evaluation)
- LPIPS
- Fréchet Inception Distance (FID)
- Depth Anything V2 (depth estimation)
- DSINE / DSINE-hub (normal map estimation)
- Mask2Former (semantic segmentation)
- controlnet_aux (edge extraction)
- MS COCO Dataset
- Florence-2 by Microsoft (used for dataset captioning)



















