Skip to content

Linum-AI/image-video-vae

Repository files navigation

Linum AI

🔧 Under the hood of our VAE

Image-Video-VAE

3D Convolutional VAE for encoding and decoding both images and video.

Properties

Property Value
Spatial compression 8x (256x256 to 32x32)
Temporal compression 4x (24 frames to 6 latent frames)
Latent channels 16
Parameters 346.6M (170.1M encoder, 176.5M decoder)

Quick Start

git clone https://github.com/Linum-AI/image-video-vae.git
cd image-video-vae
uv sync

Image

The VAE was trained on specific resolutions, so by default images are resized to the closest training-resolution size bucket for best reconstruction quality. The script saves both the resized original and the reconstruction so you can compare exactly what the VAE saw versus what it produced.

uv run python encode_decode.py \
  --mode image \
  --input examples/images/original/camel_closeup.jpg

Camel closeup

To skip resizing and encode at native resolution, use --no-resize (both dimensions must be at least 8 pixels):

uv run python encode_decode.py \
  --mode image \
  --input examples/images/original/camel_closeup.jpg \
  --no-resize

Video

Videos are sub-sampled to 24 FPS (videos below 24 FPS are not supported). Spatial dimensions are floored to the nearest multiple of 8. The script saves both the preprocessed original (at 24 FPS) and the reconstruction so you can compare them directly.

uv run python encode_decode.py \
  --mode video \
  --input examples/videos/original/woman_in_breeze.mp4
Woman.in.Breeze.mp4

Architecture

Architecture Diagram

Autoencoder (image_video_vae.autoencoder)

A pure tensor-in, tensor-out module that operates on normalized [0, 1] tensors of shape (B, 3, T, H, W). It does not handle file loading, resizing, denormalization, or saving to disk.

from image_video_vae.autoencoder import Autoencoder

model = Autoencoder.from_pretrained(checkpoint_path="vae.safetensors")

# x must be a [0, 1] normalized tensor of shape (B, 3, T, H, W)
dist_params = model.encode(x=x)          # -> (B, 32, T', H', W')
z = model.sample(distribution_params=dist_params)  # -> (B, 16, T', H', W')
decoded = model.decode(z=z)              # -> (B, 3, T, H, W), values in [0, 1]

# Use chunked encoding/decoding to fit in GPU memory
# 24-Frames for 180/360p videos; 12-Frames for 720p videos
dist_params = model.encode_chunked(x=x, chunk_frames=24)
decoded = model.decode_chunked(z=z, target_chunk_resolution=(24, H, W), chunk_frames=24)

I/O utilities (image_video_vae.io)

Handles conversion between files on disk and the normalized tensors the Autoencoder expects. Preprocessing loads images or videos, resizes to training-resolution buckets, and normalizes pixel values from [0, 255] to [0, 1]. Postprocessing reverses this and saves to JPEG or MP4.

from image_video_vae.io import preprocess_image, denormalize_pixels, save_video_as_mp4

# File -> [0, 1] tensor
tensor = preprocess_image(image_path="photo.jpg", size_bucket="XX_LARGE")

# [0, 1] tensor -> [0, 255] uint8 for saving
pixels = denormalize_pixels(frames=decoded).byte()

# [0, 1] video tensor -> MP4 file
save_video_as_mp4(video_tensor=decoded, output_path="output.mp4")

Model Weights

Weights are hosted on HuggingFace Hub:

Citation

@online{image_video_vae_2026,
  title = {VAE: Reconstruction vs. Generation},
  author = {Linum AI},
  year = {2026},
  url = {https://www.linum.ai/field-notes/vae-reconstruction-vs-generation}
}

License

This project is licensed under the Apache License 2.0 - see the LICENSE file.

About Linum

Linum is a team of two brothers building a tiny-yet-powerful AI research lab. We train our own generative media models from scratch.

Subscribe to Field Notes — technical deep dives on building generative video models from the ground up, plus updates on new releases from Linum.

Contact: hello@linum.ai

About

Linum Image-Video VAE

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages