3D Convolutional VAE for encoding and decoding both images and video.
| Property | Value |
|---|---|
| Spatial compression | 8x (256x256 to 32x32) |
| Temporal compression | 4x (24 frames to 6 latent frames) |
| Latent channels | 16 |
| Parameters | 346.6M (170.1M encoder, 176.5M decoder) |
git clone https://github.com/Linum-AI/image-video-vae.git
cd image-video-vae
uv syncThe VAE was trained on specific resolutions, so by default images are resized to the closest training-resolution size bucket for best reconstruction quality. The script saves both the resized original and the reconstruction so you can compare exactly what the VAE saw versus what it produced.
uv run python encode_decode.py \
--mode image \
--input examples/images/original/camel_closeup.jpgTo skip resizing and encode at native resolution, use --no-resize (both dimensions must be at
least 8 pixels):
uv run python encode_decode.py \
--mode image \
--input examples/images/original/camel_closeup.jpg \
--no-resizeVideos are sub-sampled to 24 FPS (videos below 24 FPS are not supported). Spatial dimensions are floored to the nearest multiple of 8. The script saves both the preprocessed original (at 24 FPS) and the reconstruction so you can compare them directly.
uv run python encode_decode.py \
--mode video \
--input examples/videos/original/woman_in_breeze.mp4Woman.in.Breeze.mp4
A pure tensor-in, tensor-out module that operates on normalized [0, 1] tensors of shape
(B, 3, T, H, W). It does not handle file loading, resizing, denormalization, or saving to disk.
from image_video_vae.autoencoder import Autoencoder
model = Autoencoder.from_pretrained(checkpoint_path="vae.safetensors")
# x must be a [0, 1] normalized tensor of shape (B, 3, T, H, W)
dist_params = model.encode(x=x) # -> (B, 32, T', H', W')
z = model.sample(distribution_params=dist_params) # -> (B, 16, T', H', W')
decoded = model.decode(z=z) # -> (B, 3, T, H, W), values in [0, 1]
# Use chunked encoding/decoding to fit in GPU memory
# 24-Frames for 180/360p videos; 12-Frames for 720p videos
dist_params = model.encode_chunked(x=x, chunk_frames=24)
decoded = model.decode_chunked(z=z, target_chunk_resolution=(24, H, W), chunk_frames=24)Handles conversion between files on disk and the normalized tensors the Autoencoder expects.
Preprocessing loads images or videos, resizes to training-resolution buckets, and normalizes
pixel values from [0, 255] to [0, 1]. Postprocessing reverses this and saves to JPEG or MP4.
from image_video_vae.io import preprocess_image, denormalize_pixels, save_video_as_mp4
# File -> [0, 1] tensor
tensor = preprocess_image(image_path="photo.jpg", size_bucket="XX_LARGE")
# [0, 1] tensor -> [0, 255] uint8 for saving
pixels = denormalize_pixels(frames=decoded).byte()
# [0, 1] video tensor -> MP4 file
save_video_as_mp4(video_tensor=decoded, output_path="output.mp4")Weights are hosted on HuggingFace Hub:
@online{image_video_vae_2026,
title = {VAE: Reconstruction vs. Generation},
author = {Linum AI},
year = {2026},
url = {https://www.linum.ai/field-notes/vae-reconstruction-vs-generation}
}This project is licensed under the Apache License 2.0 - see the LICENSE file.
Linum is a team of two brothers building a tiny-yet-powerful AI research lab. We train our own generative media models from scratch.
Subscribe to Field Notes — technical deep dives on building generative video models from the ground up, plus updates on new releases from Linum.
Contact: hello@linum.ai
