VisGen is an accessible and educational project designed for users who want to quickly learn and experiment with classic vision generative models. With a minimalist yet production-quality PyTorch implementation, this project allows you to understand the fundamentals behind popular generative approaches while running experiments with ease.
VisGen provides straightforward implementations of key generative models, including Autoencoder, VAE, VQ-VAE, GAN, and diffusion-based models like DDPM and FlowMatch. Each model is crafted to emphasize the core algorithmic concepts without unnecessary complexity—making it ideal for researchers, developers, or students eager to learn from hands-on coding experiments.
Start by cloning the repository and setting up the environment. The project uses Python 3.11 and relies on PyTorch and associated libraries.
-
Clone the repository (requires git):
git clone https://github.com/Pearisli/VisGen.git cd VisGen-main -
Install dependencies (requires conda):
conda create -n visgen python=3.11.11 -y conda activate visgen pip install -r requirements.txt
Download and extract your dataset, for example the Anime Face dataset (63K images).
Set the project path and launch training. Example for VAE:
export PYTHONPATH=/path/to/your/project:$PYTHONPATH
python examples/train_vae.pyOther model scripts include train_autoencoder.py, train_gan.py, train_ddpm.py, etc. Each script corresponds to a specific model variant.
Experiments have been conducted on a single NVIDIA A800-PCIe-80GB GPU under fixed settings (batch size: 128, resolution: 64×64). Please note that these modules use ResNet-based blocks by default; if your GPU memory is not enough, we also provide a Basic block alternative or reduce the channels and batch size.
| Model | Training Time | Memory Usage |
|---|---|---|
| Autoencoder | 15 minutes | 4 GB |
| VAE | 15 minutes | 4 GB |
| VQ-VAE | 15 minutes | 4 GB |
| WGAN-GP | 3 hours | 10 GB |
| StyleGAN2 | 4 hours | 15 GB |
| DDPM | 4.6 hours | 25 GB |
| FlowMatch | 4.6 hours | 25 GB |
In this section, we will dive deeper into training a Latent Diffusion Model (LDM) on larger, high-resolution datasets and explore text-conditioned generation to broaden the model's applications. Each step includes not only commands and parameter settings but also pedagogical notes to help you understand the underlying principles and best practices.
Training LDMs requires additional libraries for high-performance data loading, multi-process management, and advanced optimization:
pip install -r requirements+.txt- Dataset: Use the Anime Faces 512×512 (140K images) to train on high-resolution images for finer details.
- Captions: Generate textual tags using the pretrained DeepDanbooru-PyTorch model to provide rich text conditions.
- Custom Dataset Structure: For more flexible folder layouts, refer to the Hugging Face Datasets ImageFolder guide.
The accelerate library simplifies multi-GPU, mixed-precision, and distributed setups:
accelerate launch \
--num_machines=1 \
--mixed_precision='no' \
--num_processes=1 \
--dynamo_backend='no' \
examples/train_ldm.py \
--config ./config/train_ldm.yaml--mixed_precision=no: Default to full precision (FP32) to support all GPUs.- Enable
bf16on supported hardware (e.g., NVIDIA A100/A800, H100, or Google Cloud TPUs v4) by setting--mixed_precision=bf16.
After training, generate samples with 200 DDIM steps:
From Scratch
LoRA Fine-Tuning



