DynControlNet

DynControlNet is a dynamic multi-condition ControlNet variant built on Stable Diffusion 1.5 using Hugging Face Diffusers.

DynControlNet extends ControlNet to support multiple condition inputs simultaneously (depth, normal, canny, segment) with uncertainty-aware encoding and time-dependent dynamic fusion.

Key Contributions

Multi-condition fusion: Simultaneously leverages depth, normal, canny, and segment conditions through a unified architecture, unlike standard ControlNet which handles only a single condition.
Uncertainty-aware encoding: Each condition encoder produces reliability maps that allow the fusion module to down-weight noisy or uninformative conditions automatically.
Time-dependent dynamic gating: Fusion weights adapt across diffusion timesteps, the model learns to rely on canny edges at fine scales and depth at coarse scales.
Scale-aligned injection: 13 SoftZeroConv residuals inject control signals at matching resolutions into the frozen SD UNet, preserving generation quality.

Architecture

Figure 1. Overall Architecture The SD 1.5 Unet is kept frozen. Dyncontrolnet processes condition images through encoding, dynamic fusion, and scale projection, then injects residuals into the Unet via SoftZeroConv.

Figure 2. UncertaintyAwareEncoder Each condition image is independently encoded by a CNN backbone and DoG frequency gate. The encoder outputs a feature map, uncertainty logits, and a reliability map used in fusion gating.

Figure 3. MultiScaleTimeDependentFusion Four condition features are fused at each of 4 levels using time-dependent gates modulated by reliability maps and temperature τ(t).

Figure 4. Scale Projection and Control Injection Fused features are projected to 4 scales (64×64, 32×32, 16×16, 8×8) and injected into the Control Branch. 13 SoftZeroConv residuals are passed to the frozen SD Unet.

Qualitative Results

Full Multi-Condition Generation

Condition-wise Generation Breakdown

Quantitative Results

All evaluations were conducted on 1,000 curated photo images with detailed Florence-2 captions, using 'control_scale=1.2', 'guidance_scale=7.5', 'steps=30', and 'seed=42'.

Method Comparison

Method	FID↓	CLIP Score↑	SSIM↑	LPIPS↓	Canny Corr↑	Depth Corr↑	MiDaS Corr↑
DynControlNet (Ours)	57.2	0.3356	0.3184	0.4810	0.1155	0.0928	0.9111
SD 1.5 (no control)	73.7	0.3284	0.2094	0.7905	0.0529	0.0126	0.5321
ControlNet-Depth	68.7	0.3326	0.2675	0.6479	0.0774	0.1572	0.9190
ControlNet-Canny	59.0	0.3336	0.3159	0.4922	0.2264	0.1803	0.8897
ControlNet-Normal	68.6	0.3309	0.2835	0.6383	0.0727	0.0888	0.8728
ControlNet-Segment	79.2	0.3205	0.2050	0.7304	0.0658	0.0526	0.6786

DynControlNet achieves the best FID (57.2) and best overall balance across all metrics compared to any single-condition baseline, demonstrating that dynamic multi-condition fusion produces higher-fidelity images with stronger structural coherence.

DynControlNet vs. SD 1.5 Vanilla:

CLIP: 0.3284 → 0.3356 (+2.2% ↑)
SSIM: 0.2094 → 0.3184 (+52.0% ↑)
LPIPS: 0.7905 → 0.4810 (-39.1% ↑)
Canny: 0.0529 → 0.1155 (+118.4% ↑)

Random FID (Unconditional Distribution Quality)

Method	Random FID↓
DynControlNet (Ours)	56.8
SD 1.5 (no control)	73.7
ControlNet-Depth	68.8
ControlNet-Canny	59.1
ControlNet-Normal	68.8
ControlNet-Segment	74.6

Multi-Condition Ablation Study

Config	CLIP↑	SSIM↑	LPIPS↓	Canny↑	Depth↑
DynControlNet (Ours)	0.3384	0.3050	0.4782	0.1172	0.0921
Depth only	0.3389	0.2613	0.6452	0.0626	0.0647
Normal only	0.3334	0.2354	0.6385	0.0775	0.0382
Canny only	0.3369	0.2609	0.5987	0.0905	0.0371
Segment only	0.3357	0.2448	0.6851	0.0562	0.0352

All conditions use the same DynControlNet checkpoint, evaluated with 200 samples.

Multi-condition advantage:

vs depth_only: ssim +16.7%
vs depth_only: lpips -25.9%
vs normal_only: ssim +29.6%
vs normal_only: lpips -25.1%
vs canny_only: ssim +16.9%
vs canny_only: lpips -20.1%
vs segment_only: ssim +24.6%
vs segment_only: lpips -30.2%

Seed Stability

Seed	CLIP↑	SSIM↑	LPIPS↓
42	0.3340	0.3186	0.4763
123	0.3336	0.3248	0.4749
456	0.3364	0.3162	0.4827
789	0.3298	0.3257	0.4780
2026	0.3338	0.3259	0.4775

Dynamic Weight Analysis

The fusion module learns distinct gating patterns across scales and time steps:

Scale	Depth	Normal	Canny	Segment
64x64	0.2085	0.2814	0.7775	0.3088
32x32	0.2954	0.3666	0.5083	0.2485
16x16	0.6409	0.3423	0.1806	0.2519
8x8	0.5219	0.2210	0.5435	0.2389

At the finest scale (64×64), the model strongly favors canny edges for precise boundary generation. At coarser scales (16×16, 8×8), depth becomes dominant for global spatial layout. All conditions show meaningful temporal variation across denoising steps.

Quick Start

Installation

git clone https://github.com/gihyuness/DynControlNet.git
cd DynControlNet
pip install -r requirements.txt

Inference

import torch
from diffusers import AutoencoderKL, UniPCMultistepScheduler
from safetensors.torch import load_file
from transformers import AutoTokenizer, CLIPTextModel
from PIL import Image
from models.dyncontrolnet import DynControlNetModel
from models.unet import UNet2DConditionModel
from pipeline.dyncontrolnet_pipeline import StableDiffusionDynControlNetPipeline

pretrained = "runwayml/stable-diffusion-v1-5"
tokenizer = AutoTokenizer.from_pretrained(pretrained, subfolder="tokenizer", use_fast=False)
text_encoder = CLIPTextModel.from_pretrained(pretrained, subfolder="text_encoder")
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16)
unet = UNet2DConditionModel.from_pretrained(pretrained, subfolder="unet")
dyncontrolnet = DynControlNetModel.from_unet(
    unet, condition_types=["depth", "normal", "canny", "segment"],
    cond_feat_channels=320, copy_unet_weights=False,
)
state_dict = load_file(
    r"C:\Users\Gihyun\PycharmProjects\CDM\dyncontrolnet_model_github\checkpoint-33372\model.safetensors",
    device="cpu",
)
dyncontrolnet.load_state_dict(state_dict, strict=False)
pipe = StableDiffusionDynControlNetPipeline.from_pretrained(
    pretrained, vae=vae, text_encoder=text_encoder, tokenizer=tokenizer,
    unet=unet, dyncontrolnet=dyncontrolnet,
    safety_checker=None, torch_dtype=torch.float16,
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")

condition_images = {
    "depth":   Image.open("data/depth.png"),
    "normal":  Image.open("data/normal.png"),
    "canny":   Image.open("data/canny.png"),
    "segment": Image.open("data/segment.png"),
}

seed = 12345
generator = torch.Generator(device="cuda").manual_seed(seed)

with torch.autocast("cuda"):
    output = pipe(
        prompt="A realistic photo of a playful tabby and white "
               "cat standing upright on its hind legs indoors, "
               "reaching up with its front paws to grab a bright white cotton string, "
               "above. Full body in frame with headroom, vertical composition, "
               "subject centered, wooden floor and a cozy carpet in the background, "
               "warm indoor lighting, sharp focus, detailed fur texture, "
               "natural colors.",
        negative_prompt="low quality, blurry",
        condition_images=condition_images,
        num_inference_steps=50,
        guidance_scale=7.5,
        control_scale=1.0,
        generator=generator,
    )
output.images[0].save("sample.png")

Training

DynControlNet was trained on a paired dataset of ~30,000 image-condition pairs over 3 days on a RTX 4060 Ti 16GB.

Due to hardware constraints (personal computer and GPU), large-scale training was not feasible, but the model demonstrates meaningful multi-condition control even under limited training.

Setting	Value
Dataset size	~30,000 paired samples
Training duration	~3 days
Hardware	RTX 4060 Ti 16GB
Base model	Stable Diffusion 1.5 (frozen)

Dataset Format

Training data is stored in HDF5 format. Each entry contains:

entry_0/
  ├── image        # Original RGB image (H, W, 3)
  ├── depth        # Depth map (H, W, 3)
  ├── normal       # Normal map (H, W, 3)
  ├── canny        # Canny edge map (H, W, 3)
  ├── segment      # Segmentation map (H, W, 3)
  └── attrs:
      └── caption  # Text description

Evaluation Metrics

Metric	What it Measures	Ideal
FID	Distribution-level image quality vs. real images	Lower is better
CLIP Score	Text-image semantic alignment	Higher is better
SSIM	Structural similarity to ground truth	Higher is better
LPIPS	Perceptual difference from ground truth	Lower is better
Canny Alignment	Edge map IoU between generated and condition	Higher is better
Depth Correlation	Structural alignment with depth condition (gradient / MiDaS)	Higher is better

Project Structure

DynControlNet/
├── models/
│   ├── dyncontrolnet.py        # DynControlNetModel — main model class
│   ├── encoders.py             # UncertaintyAwareEncoder
│   ├── fusion.py               # MultiScaleTimeDependentFusion
│   ├── gates.py                # DoGFrequencyGate
│   ├── projections.py          # SoftZeroConv, SoftProjection
│   └── unet/                   # SD 1.5 UNet (from diffusers)
├── pipeline/
│   ├── dyncontrolnet_pipeline.py   # Full inference pipeline
│   ├── condition_preprocess.py     # Condition image preprocessing
│   ├── denoise_loop.py            # Denoising loop with control injection
│   ├── latents.py                 # Latent preparation utilities
│   ├── prompt_encoder.py          # CLIP text encoding
│   ├── safety_and_decode.py       # VAE decode & safety checker
│   └── types.py                   # Type aliases
├── train/
│   ├── trainer.py              # Main training loop
│   ├── cli.py                  # Argument parser
│   ├── data.py                 # HDF5Dataset & augmentations
│   ├── dropout.py              # Structured condition dropout
│   ├── checkpoints.py          # Checkpoint save/load
│   ├── validation.py           # Validation image generation
│   ├── metrics.py              # Training metrics logging
│   └── models.py               # Model initialization helpers
├── scripts/
│   └── train_dyncontrolnet.py  # Training entry point
├── data/                       # Example condition images
├── requirements.txt
├── LICENSE
└── README.md

Training Command

python -m scripts.train_dyncontrolnet \
   --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
   --dataset_path="datasets/train_dataset_style_final.h5" \
   --train_batch_size=2 --num_train_epochs=40 \
   --output_dir="dyncontrolnet_model" \
   --logging_dir="logs" \
   --resolution=512 \
   --learning_rate=5e-6 \
   --condition_types depth normal canny segment \
   --validation_conditions "data/depth_image.png" "data/normal_image.png" 
   "data/canny_image.png" "data/segment_image.png" \
   --validation_prompt "A realistic photo of a playful tabby and white 
   cat standing upright on its hind legs indoors, 
   reaching up with its front paws to grab a bright white cotton string 
   above. Full body in frame with headroom, vertical composition, 
   subject centered, wooden floor and a cozy carpet in the background, 
   warm indoor lighting, sharp focus, detailed fur texture, 
   natural colors." \
   --report_to="tensorboard" \
   --checkpoints_total_limit=3 \
   --enable_xformers_memory_efficient_attention \
   --gradient_accumulation_steps=8 \
   --lr_scheduler constant \
   --mixed_precision fp16 \
   --lr_warmup_steps=200 \
   --set_grads_to_none \
   --max_grad_norm=1.0 \
   --noise_offset 0.05 \
   --log_tb_images \
   --gradient_checkpointing \

Training Features

Feature	Description
Random Control Scale	control_scale is uniformly sampled from [scale_min, scale_max] each step, improving inference robustness.
Structured Condition Dropout	Each condition is independently dropped with probability p=0.1, and all conditions are dropped simultaneously with p=0.05, teaching the model graceful degradation.
Differential Learning Rates	Custom modules (encoders, fuser, projections) use 5× the base learning rate for faster convergence.
Synchronized Augmentation	Random flip and crop are applied identically to the original image and all condition maps, maintaining spatial correspondence.
Gradient Checkpointing	Reduces VRAM usage at the cost of ~20% slower training.

License

This project is licensed under the MIT License.

Acknowledgments

Core Models & Frameworks

Stable Diffusion by CompVis / Stability AI
ControlNet by Lvmin Zhang et al.
Hugging Face Diffusers
PyTorch
Hugging Face Transformers
Safetensors

Evaluation

CLIP by OpenAI (used for evaluation)
LPIPS
Fréchet Inception Distance (FID)

Conditioning & Preprocessing

Depth Anything V2 (depth estimation)
DSINE / DSINE-hub (normal map estimation)
Mask2Former (semantic segmentation)
controlnet_aux (edge extraction)

Data

MS COCO Dataset
Florence-2 by Microsoft (used for dataset captioning)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DynControlNet

Key Contributions

Architecture

Qualitative Results

Full Multi-Condition Generation

Condition-wise Generation Breakdown

Quantitative Results

Method Comparison

Random FID (Unconditional Distribution Quality)

Multi-Condition Ablation Study

Seed Stability

Dynamic Weight Analysis

Quick Start

Installation

Inference

Training

Dataset Format

Evaluation Metrics

Project Structure

Training Command

Training Features

License

Acknowledgments

Core Models & Frameworks

Evaluation

Conditioning & Preprocessing

Data

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets/images		assets/images
data		data
models		models
pipeline		pipeline
scripts		scripts
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_image.py		generate_image.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

DynControlNet

Key Contributions

Architecture

Qualitative Results

Full Multi-Condition Generation

Condition-wise Generation Breakdown

Quantitative Results

Method Comparison

Random FID (Unconditional Distribution Quality)

Multi-Condition Ablation Study

Seed Stability

Dynamic Weight Analysis

Quick Start

Installation

Inference

Training

Dataset Format

Evaluation Metrics

Project Structure

Training Command

Training Features

License

Acknowledgments

Core Models & Frameworks

Evaluation

Conditioning & Preprocessing

Data

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages