Image-Video-VAE

3D Convolutional VAE for encoding and decoding both images and video.

Properties

Property	Value
Spatial compression	8x (256x256 to 32x32)
Temporal compression	4x (24 frames to 6 latent frames)
Latent channels	16
Parameters	346.6M (170.1M encoder, 176.5M decoder)

Quick Start

git clone https://github.com/Linum-AI/image-video-vae.git
cd image-video-vae
uv sync

Image

The VAE was trained on specific resolutions, so by default images are resized to the closest training-resolution size bucket for best reconstruction quality. The script saves both the resized original and the reconstruction so you can compare exactly what the VAE saw versus what it produced.

uv run python encode_decode.py \
  --mode image \
  --input examples/images/original/camel_closeup.jpg

To skip resizing and encode at native resolution, use --no-resize (both dimensions must be at least 8 pixels):

uv run python encode_decode.py \
  --mode image \
  --input examples/images/original/camel_closeup.jpg \
  --no-resize

Video

Videos are sub-sampled to 24 FPS (videos below 24 FPS are not supported). Spatial dimensions are floored to the nearest multiple of 8. The script saves both the preprocessed original (at 24 FPS) and the reconstruction so you can compare them directly.

uv run python encode_decode.py \
  --mode video \
  --input examples/videos/original/woman_in_breeze.mp4

Woman.in.Breeze.mp4

Architecture

Architecture Diagram

Autoencoder (`image_video_vae.autoencoder`)

A pure tensor-in, tensor-out module that operates on normalized [0, 1] tensors of shape (B, 3, T, H, W). It does not handle file loading, resizing, denormalization, or saving to disk.

from image_video_vae.autoencoder import Autoencoder

model = Autoencoder.from_pretrained(checkpoint_path="vae.safetensors")

# x must be a [0, 1] normalized tensor of shape (B, 3, T, H, W)
dist_params = model.encode(x=x)          # -> (B, 32, T', H', W')
z = model.sample(distribution_params=dist_params)  # -> (B, 16, T', H', W')
decoded = model.decode(z=z)              # -> (B, 3, T, H, W), values in [0, 1]

# Use chunked encoding/decoding to fit in GPU memory
# 24-Frames for 180/360p videos; 12-Frames for 720p videos
dist_params = model.encode_chunked(x=x, chunk_frames=24)
decoded = model.decode_chunked(z=z, target_chunk_resolution=(24, H, W), chunk_frames=24)

I/O utilities (`image_video_vae.io`)

Handles conversion between files on disk and the normalized tensors the Autoencoder expects. Preprocessing loads images or videos, resizes to training-resolution buckets, and normalizes pixel values from [0, 255] to [0, 1]. Postprocessing reverses this and saves to JPEG or MP4.

from image_video_vae.io import preprocess_image, denormalize_pixels, save_video_as_mp4

# File -> [0, 1] tensor
tensor = preprocess_image(image_path="photo.jpg", size_bucket="XX_LARGE")

# [0, 1] tensor -> [0, 255] uint8 for saving
pixels = denormalize_pixels(frames=decoded).byte()

# [0, 1] video tensor -> MP4 file
save_video_as_mp4(video_tensor=decoded, output_path="output.mp4")

Model Weights

Weights are hosted on HuggingFace Hub:

Linum-AI/image-video-vae

Citation

@online{image_video_vae_2026,
  title = {VAE: Reconstruction vs. Generation},
  author = {Linum AI},
  year = {2026},
  url = {https://www.linum.ai/field-notes/vae-reconstruction-vs-generation}
}

License

This project is licensed under the Apache License 2.0 - see the LICENSE file.

About Linum

Linum is a team of two brothers building a tiny-yet-powerful AI research lab. We train our own generative media models from scratch.

Subscribe to Field Notes — technical deep dives on building generative video models from the ground up, plus updates on new releases from Linum.

Contact: hello@linum.ai

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
examples		examples
hf_model_cards		hf_model_cards
image_video_vae		image_video_vae
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
encode_decode.py		encode_decode.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image-Video-VAE

Properties

Quick Start

Image

Video

Architecture

Autoencoder (`image_video_vae.autoencoder`)

I/O utilities (`image_video_vae.io`)

Model Weights

Citation

License

About Linum

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Image-Video-VAE

Properties

Quick Start

Image

Video

Architecture

Autoencoder (image_video_vae.autoencoder)

I/O utilities (image_video_vae.io)

Model Weights

Citation

License

About Linum

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Autoencoder (`image_video_vae.autoencoder`)

I/O utilities (`image_video_vae.io`)

Packages