Skip to content

Provide easy-to-understand implementations of classic/popular generative models in computer vision

License

Notifications You must be signed in to change notification settings

Pearisli/VisGen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VisGen: Fast & Simple Vision Generative Models

VisGen is an accessible and educational project designed for users who want to quickly learn and experiment with classic vision generative models. With a minimalist yet production-quality PyTorch implementation, this project allows you to understand the fundamentals behind popular generative approaches while running experiments with ease.

Samples from DDPM trained from scratch

Overview

VisGen provides straightforward implementations of key generative models, including Autoencoder, VAE, VQ-VAE, GAN, and diffusion-based models like DDPM and FlowMatch. Each model is crafted to emphasize the core algorithmic concepts without unnecessary complexity—making it ideal for researchers, developers, or students eager to learn from hands-on coding experiments.

🛠️ Setup

Start by cloning the repository and setting up the environment. The project uses Python 3.11 and relies on PyTorch and associated libraries.

  1. Clone the repository (requires git):

    git clone https://github.com/Pearisli/VisGen.git
    cd VisGen-main
  2. Install dependencies (requires conda):

    conda create -n visgen python=3.11.11 -y
    conda activate visgen
    pip install -r requirements.txt

🎮 Basic Usage

1. Prepare Dataset

Download and extract your dataset, for example the Anime Face dataset (63K images).

2. Train a Model

Set the project path and launch training. Example for VAE:

export PYTHONPATH=/path/to/your/project:$PYTHONPATH
python examples/train_vae.py

Other model scripts include train_autoencoder.py, train_gan.py, train_ddpm.py, etc. Each script corresponds to a specific model variant.

📊 Experiments

Experiments have been conducted on a single NVIDIA A800-PCIe-80GB GPU under fixed settings (batch size: 128, resolution: 64×64). Please note that these modules use ResNet-based blocks by default; if your GPU memory is not enough, we also provide a Basic block alternative or reduce the channels and batch size.

Model Training Time Memory Usage
Autoencoder 15 minutes 4 GB
VAE 15 minutes 4 GB
VQ-VAE 15 minutes 4 GB
WGAN-GP 3 hours 10 GB
StyleGAN2 4 hours 15 GB
DDPM 4.6 hours 25 GB
FlowMatch 4.6 hours 25 GB

🚀 Advanced Training

In this section, we will dive deeper into training a Latent Diffusion Model (LDM) on larger, high-resolution datasets and explore text-conditioned generation to broaden the model's applications. Each step includes not only commands and parameter settings but also pedagogical notes to help you understand the underlying principles and best practices.

1. Install Extended Dependencies

Training LDMs requires additional libraries for high-performance data loading, multi-process management, and advanced optimization:

pip install -r requirements+.txt

2. Prepare Dataset and Captions

3. Launch Training with accelerate

The accelerate library simplifies multi-GPU, mixed-precision, and distributed setups:

accelerate launch \
    --num_machines=1 \
    --mixed_precision='no' \
    --num_processes=1 \
    --dynamo_backend='no' \
    examples/train_ldm.py \
    --config ./config/train_ldm.yaml
  • --mixed_precision=no: Default to full precision (FP32) to support all GPUs.
  • Enable bf16 on supported hardware (e.g., NVIDIA A100/A800, H100, or Google Cloud TPUs v4) by setting --mixed_precision=bf16.

4. Sampling Results

After training, generate samples with 200 DDIM steps:

Unconditional Generation

Samples from Latent Diffusion model trained from scratch with 200 DDIM steps

Text-Conditioned Generation


From Scratch

Samples from text-conditioned Latent Diffusion model trained from scratch with 200 DDIM steps


LoRA Fine-Tuning

Samples from Latent Diffusion model trained by LoRA with 200 DDIM steps

About

Provide easy-to-understand implementations of classic/popular generative models in computer vision

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages