FastVLM: Research Implementation & Extensions

This repository contains our research implementation and extensions of FastVLM: Efficient Vision Encoding for Vision Language Models (CVPR 2025) by Apple Inc.

Overview

Original FastVLM introduces FastViTHD, a hybrid vision encoder that outputs ~100 tokens per image (vs 576 for CLIP).

Our Study:

Experiment Replications - Conduct 4 key experiments from the paper
Video Fine-tuning Adaptation - Extended FastVLM for video-text training
Windows Inference Engine - Real-time inference with live webcam support

Quick Start

Installation

conda create -n fastvlm python=3.10
conda activate fastvlm
pip install -e .

Download Models

bash get_models.sh  # Downloads to checkpoints/

1) Experiment Replications

We replicated 4 experiments from the FastVLM paper to validate the reported results.

Replicated Experiments

Experiment	Paper Ref	Description
Encoder Comparison	Table 3	ViT-L/14 vs ConvNeXt-L vs FastViT-HD latency
Token Efficiency	Table 4	Visual tokens across encoders & resolutions
Resolution Scaling	Table 5	FastViT-HD @ 256-1024px
Model Size Comparison	Table 11	FastVLM 0.5B vs 1.5B

Details: See experiments/README.md

2) Video Fine-tuning Extension

We adapted FastVLM to support video-text pair training using sparse temporal sampling.

Key Modifications

Video Frame Extraction: Uniform sampling of N frames using decord
Token Expansion: <image> → <image><image><image>... (N times)
Efficient Processing: ~800 tokens for 8 frames vs 4,608 for CLIP-based models

Usage

cd scripts
bash finetune_video.sh

Note: Full fine-tuning implementation available at EdgeVLM-Labs/fastvlm-adaptation as it modifies core training code in llava/.

Training metrics from video fine-tuning showing loss convergence and performance improvements

Details: See scripts/finetune.md

3) Windows Inference Engine

Interactive Gradio app with real-time webcam inference and optimized performance.

Features

Dual Modes: Chat (image upload) + Live (webcam streaming)
Performance Optimizations:
- Prompt caching for repeated queries
- Frame skipping (adjustable 1-10x)
- TF32 acceleration on Ampere+ GPUs
Real-time Metrics: TTFT, tokens/sec, avg/min/max latency, FPS

Run

python windows_app/app.py

Details: See windows_app/README.md

License & Attribution

Original Work

FastVLM is developed by Apple Inc. and released under the Apple Sample Code License.

Citation:

@InProceedings{fastvlm2025,
  author = {Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li,
            Cem Koc, Nate True, Albert Antony, Gokul Santhanam,
            James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari},
  title = {FastVLM: Efficient Vision Encoding for Vision Language Models},
  booktitle = {CVPR},
  year = {2025}
}

Our Research Extensions

The modifications in this repository (experiment replications, video fine-tuning adaptation, and Windows inference engine) are provided for research and educational purposes only, in accordance with the Apple Sample Code License which permits use, modification, and redistribution for non-commercial research.

Please refer to the LICENSE file for full terms.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
app		app
docs		docs
experiments		experiments
llava		llava
model_export		model_export
scripts		scripts
windows_app		windows_app
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
ACKNOWLEDGEMENTS		ACKNOWLEDGEMENTS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
FastVLM - 2412.13303v2.pdf		FastVLM - 2412.13303v2.pdf
LICENSE		LICENSE
LICENSE_MODEL		LICENSE_MODEL
README.md		README.md
get_models.sh		get_models.sh
mypy.ini		mypy.ini
predict.py		predict.py
pyproject.toml		pyproject.toml
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FastVLM: Research Implementation & Extensions

Overview

Quick Start

Installation

Download Models

1) Experiment Replications

Replicated Experiments

2) Video Fine-tuning Extension

Key Modifications

Usage

3) Windows Inference Engine

Features

Run

License & Attribution

Original Work

Our Research Extensions

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FastVLM: Research Implementation & Extensions

Overview

Quick Start

Installation

Download Models

1) Experiment Replications

Replicated Experiments

2) Video Fine-tuning Extension

Key Modifications

Usage

3) Windows Inference Engine

Features

Run

License & Attribution

Original Work

Our Research Extensions

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages