VisGen: Fast & Simple Vision Generative Models

VisGen is an accessible and educational project designed for users who want to quickly learn and experiment with classic vision generative models. With a minimalist yet production-quality PyTorch implementation, this project allows you to understand the fundamentals behind popular generative approaches while running experiments with ease.

Overview

VisGen provides straightforward implementations of key generative models, including Autoencoder, VAE, VQ-VAE, GAN, and diffusion-based models like DDPM and FlowMatch. Each model is crafted to emphasize the core algorithmic concepts without unnecessary complexity—making it ideal for researchers, developers, or students eager to learn from hands-on coding experiments.

🛠️ Setup

Start by cloning the repository and setting up the environment. The project uses Python 3.11 and relies on PyTorch and associated libraries.

Clone the repository (requires git):

git clone https://github.com/Pearisli/VisGen.git
cd VisGen-main

Install dependencies (requires conda):

conda create -n visgen python=3.11.11 -y
conda activate visgen
pip install -r requirements.txt

🎮 Basic Usage

1. Prepare Dataset

Download and extract your dataset, for example the Anime Face dataset (63K images).

2. Train a Model

Set the project path and launch training. Example for VAE:

export PYTHONPATH=/path/to/your/project:$PYTHONPATH
python examples/train_vae.py

Other model scripts include train_autoencoder.py, train_gan.py, train_ddpm.py, etc. Each script corresponds to a specific model variant.

📊 Experiments

Experiments have been conducted on a single NVIDIA A800-PCIe-80GB GPU under fixed settings (batch size: 128, resolution: 64×64). Please note that these modules use ResNet-based blocks by default; if your GPU memory is not enough, we also provide a Basic block alternative or reduce the channels and batch size.

Model	Training Time	Memory Usage
Autoencoder	15 minutes	4 GB
VAE	15 minutes	4 GB
VQ-VAE	15 minutes	4 GB
WGAN-GP	3 hours	10 GB
StyleGAN2	4 hours	15 GB
DDPM	4.6 hours	25 GB
FlowMatch	4.6 hours	25 GB

🚀 Advanced Training

In this section, we will dive deeper into training a Latent Diffusion Model (LDM) on larger, high-resolution datasets and explore text-conditioned generation to broaden the model's applications. Each step includes not only commands and parameter settings but also pedagogical notes to help you understand the underlying principles and best practices.

1. Install Extended Dependencies

Training LDMs requires additional libraries for high-performance data loading, multi-process management, and advanced optimization:

pip install -r requirements+.txt

2. Prepare Dataset and Captions

Dataset: Use the Anime Faces 512×512 (140K images) to train on high-resolution images for finer details.
Captions: Generate textual tags using the pretrained DeepDanbooru-PyTorch model to provide rich text conditions.
Custom Dataset Structure: For more flexible folder layouts, refer to the Hugging Face Datasets ImageFolder guide.

3. Launch Training with `accelerate`

The accelerate library simplifies multi-GPU, mixed-precision, and distributed setups:

accelerate launch \
    --num_machines=1 \
    --mixed_precision='no' \
    --num_processes=1 \
    --dynamo_backend='no' \
    examples/train_ldm.py \
    --config ./config/train_ldm.yaml

--mixed_precision=no: Default to full precision (FP32) to support all GPUs.
Enable bf16 on supported hardware (e.g., NVIDIA A100/A800, H100, or Google Cloud TPUs v4) by setting --mixed_precision=bf16.

4. Sampling Results

After training, generate samples with 200 DDIM steps:

Unconditional Generation

Text-Conditioned Generation

From Scratch

LoRA Fine-Tuning

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets/images		assets/images
config		config
examples		examples
models		models
src		src
LICENSE		LICENSE
README.md		README.md
requirements+.txt		requirements+.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VisGen: Fast & Simple Vision Generative Models

Overview

🛠️ Setup

🎮 Basic Usage

1. Prepare Dataset

2. Train a Model

📊 Experiments

🚀 Advanced Training

1. Install Extended Dependencies

2. Prepare Dataset and Captions

3. Launch Training with `accelerate`

4. Sampling Results

Unconditional Generation

Text-Conditioned Generation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Pearisli/VisGen

Folders and files

Latest commit

History

Repository files navigation

VisGen: Fast & Simple Vision Generative Models

Overview

🛠️ Setup

🎮 Basic Usage

1. Prepare Dataset

2. Train a Model

📊 Experiments

🚀 Advanced Training

1. Install Extended Dependencies

2. Prepare Dataset and Captions

3. Launch Training with accelerate

4. Sampling Results

Unconditional Generation

Text-Conditioned Generation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

3. Launch Training with `accelerate`

Packages