Skip to content

gihyuness/DynControlNet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DynControlNet

DynControlNet is a dynamic multi-condition ControlNet variant built on Stable Diffusion 1.5 using Hugging Face Diffusers.

DynControlNet extends ControlNet to support multiple condition inputs simultaneously (depth, normal, canny, segment) with uncertainty-aware encoding and time-dependent dynamic fusion.

Key Contributions

  • Multi-condition fusion: Simultaneously leverages depth, normal, canny, and segment conditions through a unified architecture, unlike standard ControlNet which handles only a single condition.
  • Uncertainty-aware encoding: Each condition encoder produces reliability maps that allow the fusion module to down-weight noisy or uninformative conditions automatically.
  • Time-dependent dynamic gating: Fusion weights adapt across diffusion timesteps, the model learns to rely on canny edges at fine scales and depth at coarse scales.
  • Scale-aligned injection: 13 SoftZeroConv residuals inject control signals at matching resolutions into the frozen SD UNet, preserving generation quality.

Architecture



Figure 1. Overall Architecture The SD 1.5 Unet is kept frozen. Dyncontrolnet processes condition images through encoding, dynamic fusion, and scale projection, then injects residuals into the Unet via SoftZeroConv.

Figure 2. UncertaintyAwareEncoder Each condition image is independently encoded by a CNN backbone and DoG frequency gate. The encoder outputs a feature map, uncertainty logits, and a reliability map used in fusion gating.



Figure 3. MultiScaleTimeDependentFusion Four condition features are fused at each of 4 levels using time-dependent gates modulated by reliability maps and temperature τ(t).

Figure 4. Scale Projection and Control Injection Fused features are projected to 4 scales (64×64, 32×32, 16×16, 8×8) and injected into the Control Branch. 13 SoftZeroConv residuals are passed to the frozen SD Unet.

Qualitative Results

Full Multi-Condition Generation

Condition-wise Generation Breakdown

Quantitative Results

All evaluations were conducted on 1,000 curated photo images with detailed Florence-2 captions, using 'control_scale=1.2', 'guidance_scale=7.5', 'steps=30', and 'seed=42'.

Method Comparison

Method FID↓ CLIP Score↑ SSIM↑ LPIPS↓ Canny Corr↑ Depth Corr↑ MiDaS Corr↑
DynControlNet (Ours) 57.2 0.3356 0.3184 0.4810 0.1155 0.0928 0.9111
SD 1.5 (no control) 73.7 0.3284 0.2094 0.7905 0.0529 0.0126 0.5321
ControlNet-Depth 68.7 0.3326 0.2675 0.6479 0.0774 0.1572 0.9190
ControlNet-Canny 59.0 0.3336 0.3159 0.4922 0.2264 0.1803 0.8897
ControlNet-Normal 68.6 0.3309 0.2835 0.6383 0.0727 0.0888 0.8728
ControlNet-Segment 79.2 0.3205 0.2050 0.7304 0.0658 0.0526 0.6786

DynControlNet achieves the best FID (57.2) and best overall balance across all metrics compared to any single-condition baseline, demonstrating that dynamic multi-condition fusion produces higher-fidelity images with stronger structural coherence.

DynControlNet vs. SD 1.5 Vanilla:

  • CLIP: 0.3284 → 0.3356 (+2.2% ↑)
  • SSIM: 0.2094 → 0.3184 (+52.0% ↑)
  • LPIPS: 0.7905 → 0.4810 (-39.1% ↑)
  • Canny: 0.0529 → 0.1155 (+118.4% ↑)

Random FID (Unconditional Distribution Quality)

Method Random FID↓
DynControlNet (Ours) 56.8
SD 1.5 (no control) 73.7
ControlNet-Depth 68.8
ControlNet-Canny 59.1
ControlNet-Normal 68.8
ControlNet-Segment 74.6

Multi-Condition Ablation Study

Config CLIP↑ SSIM↑ LPIPS↓ Canny↑ Depth↑
DynControlNet (Ours) 0.3384 0.3050 0.4782 0.1172 0.0921
Depth only 0.3389 0.2613 0.6452 0.0626 0.0647
Normal only 0.3334 0.2354 0.6385 0.0775 0.0382
Canny only 0.3369 0.2609 0.5987 0.0905 0.0371
Segment only 0.3357 0.2448 0.6851 0.0562 0.0352

All conditions use the same DynControlNet checkpoint, evaluated with 200 samples.

Multi-condition advantage:

  • vs depth_only: ssim +16.7%
  • vs depth_only: lpips -25.9%
  • vs normal_only: ssim +29.6%
  • vs normal_only: lpips -25.1%
  • vs canny_only: ssim +16.9%
  • vs canny_only: lpips -20.1%
  • vs segment_only: ssim +24.6%
  • vs segment_only: lpips -30.2%

Seed Stability

Seed CLIP↑ SSIM↑ LPIPS↓
42 0.3340 0.3186 0.4763
123 0.3336 0.3248 0.4749
456 0.3364 0.3162 0.4827
789 0.3298 0.3257 0.4780
2026 0.3338 0.3259 0.4775

Dynamic Weight Analysis

The fusion module learns distinct gating patterns across scales and time steps:

Scale Depth Normal Canny Segment
64x64 0.2085 0.2814 0.7775 0.3088
32x32 0.2954 0.3666 0.5083 0.2485
16x16 0.6409 0.3423 0.1806 0.2519
8x8 0.5219 0.2210 0.5435 0.2389

At the finest scale (64×64), the model strongly favors canny edges for precise boundary generation. At coarser scales (16×16, 8×8), depth becomes dominant for global spatial layout. All conditions show meaningful temporal variation across denoising steps.

Quick Start

Installation

git clone https://github.com/gihyuness/DynControlNet.git
cd DynControlNet
pip install -r requirements.txt

Inference

import torch
from diffusers import AutoencoderKL, UniPCMultistepScheduler
from safetensors.torch import load_file
from transformers import AutoTokenizer, CLIPTextModel
from PIL import Image
from models.dyncontrolnet import DynControlNetModel
from models.unet import UNet2DConditionModel
from pipeline.dyncontrolnet_pipeline import StableDiffusionDynControlNetPipeline

pretrained = "runwayml/stable-diffusion-v1-5"
tokenizer = AutoTokenizer.from_pretrained(pretrained, subfolder="tokenizer", use_fast=False)
text_encoder = CLIPTextModel.from_pretrained(pretrained, subfolder="text_encoder")
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16)
unet = UNet2DConditionModel.from_pretrained(pretrained, subfolder="unet")
dyncontrolnet = DynControlNetModel.from_unet(
    unet, condition_types=["depth", "normal", "canny", "segment"],
    cond_feat_channels=320, copy_unet_weights=False,
)
state_dict = load_file(
    r"C:\Users\Gihyun\PycharmProjects\CDM\dyncontrolnet_model_github\checkpoint-33372\model.safetensors",
    device="cpu",
)
dyncontrolnet.load_state_dict(state_dict, strict=False)
pipe = StableDiffusionDynControlNetPipeline.from_pretrained(
    pretrained, vae=vae, text_encoder=text_encoder, tokenizer=tokenizer,
    unet=unet, dyncontrolnet=dyncontrolnet,
    safety_checker=None, torch_dtype=torch.float16,
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")

condition_images = {
    "depth":   Image.open("data/depth.png"),
    "normal":  Image.open("data/normal.png"),
    "canny":   Image.open("data/canny.png"),
    "segment": Image.open("data/segment.png"),
}

seed = 12345
generator = torch.Generator(device="cuda").manual_seed(seed)

with torch.autocast("cuda"):
    output = pipe(
        prompt="A realistic photo of a playful tabby and white "
               "cat standing upright on its hind legs indoors, "
               "reaching up with its front paws to grab a bright white cotton string, "
               "above. Full body in frame with headroom, vertical composition, "
               "subject centered, wooden floor and a cozy carpet in the background, "
               "warm indoor lighting, sharp focus, detailed fur texture, "
               "natural colors.",
        negative_prompt="low quality, blurry",
        condition_images=condition_images,
        num_inference_steps=50,
        guidance_scale=7.5,
        control_scale=1.0,
        generator=generator,
    )
output.images[0].save("sample.png")

Training

DynControlNet was trained on a paired dataset of ~30,000 image-condition pairs over 3 days on a RTX 4060 Ti 16GB.

Due to hardware constraints (personal computer and GPU), large-scale training was not feasible, but the model demonstrates meaningful multi-condition control even under limited training.

Setting Value
Dataset size ~30,000 paired samples
Training duration ~3 days
Hardware RTX 4060 Ti 16GB
Base model Stable Diffusion 1.5 (frozen)

Dataset Format

Training data is stored in HDF5 format. Each entry contains:

entry_0/
  ├── image        # Original RGB image (H, W, 3)
  ├── depth        # Depth map (H, W, 3)
  ├── normal       # Normal map (H, W, 3)
  ├── canny        # Canny edge map (H, W, 3)
  ├── segment      # Segmentation map (H, W, 3)
  └── attrs:
      └── caption  # Text description

Evaluation Metrics

Metric What it Measures Ideal
FID Distribution-level image quality vs. real images Lower is better
CLIP Score Text-image semantic alignment Higher is better
SSIM Structural similarity to ground truth Higher is better
LPIPS Perceptual difference from ground truth Lower is better
Canny Alignment Edge map IoU between generated and condition Higher is better
Depth Correlation Structural alignment with depth condition (gradient / MiDaS) Higher is better

Project Structure

DynControlNet/
├── models/
│   ├── dyncontrolnet.py        # DynControlNetModel — main model class
│   ├── encoders.py             # UncertaintyAwareEncoder
│   ├── fusion.py               # MultiScaleTimeDependentFusion
│   ├── gates.py                # DoGFrequencyGate
│   ├── projections.py          # SoftZeroConv, SoftProjection
│   └── unet/                   # SD 1.5 UNet (from diffusers)
├── pipeline/
│   ├── dyncontrolnet_pipeline.py   # Full inference pipeline
│   ├── condition_preprocess.py     # Condition image preprocessing
│   ├── denoise_loop.py            # Denoising loop with control injection
│   ├── latents.py                 # Latent preparation utilities
│   ├── prompt_encoder.py          # CLIP text encoding
│   ├── safety_and_decode.py       # VAE decode & safety checker
│   └── types.py                   # Type aliases
├── train/
│   ├── trainer.py              # Main training loop
│   ├── cli.py                  # Argument parser
│   ├── data.py                 # HDF5Dataset & augmentations
│   ├── dropout.py              # Structured condition dropout
│   ├── checkpoints.py          # Checkpoint save/load
│   ├── validation.py           # Validation image generation
│   ├── metrics.py              # Training metrics logging
│   └── models.py               # Model initialization helpers
├── scripts/
│   └── train_dyncontrolnet.py  # Training entry point
├── data/                       # Example condition images
├── requirements.txt
├── LICENSE
└── README.md

Training Command

python -m scripts.train_dyncontrolnet \
   --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
   --dataset_path="datasets/train_dataset_style_final.h5" \
   --train_batch_size=2 --num_train_epochs=40 \
   --output_dir="dyncontrolnet_model" \
   --logging_dir="logs" \
   --resolution=512 \
   --learning_rate=5e-6 \
   --condition_types depth normal canny segment \
   --validation_conditions "data/depth_image.png" "data/normal_image.png" 
   "data/canny_image.png" "data/segment_image.png" \
   --validation_prompt "A realistic photo of a playful tabby and white 
   cat standing upright on its hind legs indoors, 
   reaching up with its front paws to grab a bright white cotton string 
   above. Full body in frame with headroom, vertical composition, 
   subject centered, wooden floor and a cozy carpet in the background, 
   warm indoor lighting, sharp focus, detailed fur texture, 
   natural colors." \
   --report_to="tensorboard" \
   --checkpoints_total_limit=3 \
   --enable_xformers_memory_efficient_attention \
   --gradient_accumulation_steps=8 \
   --lr_scheduler constant \
   --mixed_precision fp16 \
   --lr_warmup_steps=200 \
   --set_grads_to_none \
   --max_grad_norm=1.0 \
   --noise_offset 0.05 \
   --log_tb_images \
   --gradient_checkpointing \

Training Features

Feature Description
Random Control Scale control_scale is uniformly sampled from [scale_min, scale_max] each step, improving inference robustness.
Structured Condition Dropout Each condition is independently dropped with probability p=0.1, and all conditions are dropped simultaneously with p=0.05, teaching the model graceful degradation.
Differential Learning Rates Custom modules (encoders, fuser, projections) use 5× the base learning rate for faster convergence.
Synchronized Augmentation Random flip and crop are applied identically to the original image and all condition maps, maintaining spatial correspondence.
Gradient Checkpointing Reduces VRAM usage at the cost of ~20% slower training.

License

This project is licensed under the MIT License.

Acknowledgments

Core Models & Frameworks

Evaluation

Conditioning & Preprocessing

Data

About

DynControlNet - modular dynamic multi-condition control for Stable Diffusion.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages