KerasFormers 🚀

📖 Introduction

KerasFormers is a collection of models with pretrained weights, built entirely with Keras 3. It supports a range of tasks, including classification, object detection (DETR, RT-DETR, RT-DETRv2, RF-DETR, D-FINE, OWL-ViT, OWLv2), segmentation (SAM, SAM2, SAM3, SegFormer, DeepLabV3, EoMT, MaskFormer, Mask2Former, MobileViT-DeepLabV3), monocular depth estimation (Depth Anything V1, Depth Anything V2), feature extraction (DINO, DINOv2, DINOv3), vision-language modeling (CLIP, SigLIP, SigLIP2, MetaCLIP 2), speech recognition (Whisper, Speech2Text), and more. It includes hybrid architectures like MaxViT alongside traditional CNNs and pure transformers. kerasformers includes custom layers and backbone support, providing flexibility and efficiency across various applications. For backbones, there are various weight variants like in1k, in21k, fb_dist_in1k, ms_in22k, fb_in22k_ft_in1k, ns_jft_in1k, aa_in1k, cvnets_in1k, augreg_in21k_ft_in1k, augreg_in21k, and many more.

⚡ Installation

From PyPI (recommended)

pip install -U kerasformers

From Source

pip install -U git+https://github.com/IMvision12/KerasFormers

📑 Documentation

Per-model guides - with architecture notes, usage examples, and available pretrained weights, live in the docs/ folder, one page per model across every supported task (classification, object detection, segmentation, depth estimation, feature extraction, vision-language, and speech recognition). Classification backbones share a single page since they all follow the same XModel / XImageClassify two-class structure; each other model has its own. Browse docs/ for the complete, always-up-to-date list.

📑 Models

📝 Text Models

Text LLMs (text → text)

🏷️ Model Name	📜 Reference Paper	📦 Source of Weights
Qwen2	Qwen2 Technical Report	on-the-fly `hf:`
Qwen3	Qwen3 Technical Report	on-the-fly `hf:`
Qwen3.5	Qwen3 Technical Report	on-the-fly `hf:`

👁️ Vision Models

Backbones

🏷️ Model Name	📜 Reference Paper	📦 Source of Weights
CaiT	Going deeper with Image Transformers	`timm`
ConvMixer	Patches Are All You Need?	`timm`
ConvNeXt	A ConvNet for the 2020s	`timm`
ConvNeXt V2	ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders	`timm`
DeiT	Training data-efficient image transformers & distillation through attention	`timm`
DenseNet	Densely Connected Convolutional Networks	`timm`
EfficientNet	EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks	`timm`
EfficientNet-Lite	EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks	`timm`
EfficientNetV2	EfficientNetV2: Smaller Models and Faster Training	`timm`
FlexiViT	FlexiViT: One Model for All Patch Sizes	`timm`
InceptionNeXt	InceptionNeXt: When Inception Meets ConvNeXt	`timm`
Inception-ResNet-v2	Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning	`timm`
Inception-v3	Rethinking the Inception Architecture for Computer Vision	`timm`
Inception-v4	Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning	`timm`
MaxViT	MaxViT: Multi-Axis Vision Transformer	`timm`
MiT	SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers	`transformers`
MLP-Mixer	MLP-Mixer: An all-MLP Architecture for Vision	`timm`
MobileNetV2	MobileNetV2: Inverted Residuals and Linear Bottlenecks	`timm`
MobileNetV3	Searching for MobileNetV3	`keras`
MobileViT	MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer	`transformers`
MobileViTV2	Separable Self-attention for Mobile Vision Transformers	`transformers`
NextViT	Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios	`timm`
PiT	Rethinking Spatial Dimensions of Vision Transformers	`timm`
PoolFormer	MetaFormer is Actually What You Need for Vision	`timm`
Res2Net	Res2Net: A New Multi-scale Backbone Architecture	`timm`
ResMLP	ResMLP: Feedforward networks for image classification with data-efficient training	`timm`
ResNet	Deep Residual Learning for Image Recognition	`timm`
ResNetV2	Identity Mappings in Deep Residual Networks	`timm`
ResNeXt	Aggregated Residual Transformations for Deep Neural Networks	`timm`
SENet	Squeeze-and-Excitation Networks	`timm`
Swin Transformer	Swin Transformer: Hierarchical Vision Transformer using Shifted Windows	`timm`
Swin Transformer V2	Swin Transformer V2: Scaling Up Capacity and Resolution	`timm`
VGG	Very Deep Convolutional Networks for Large-Scale Image Recognition	`timm`
ViT	An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale	`timm`
Xception	Xception: Deep Learning with Depthwise Separable Convolutions	`keras`

Object Detection

🏷️ Model Name	📜 Reference Paper	📦 Source of Weights
D-FINE	D-FINE: Redefine Regression Task of DETRs as Fine-grained Distribution Refinement	`transformers`
DETR	End-to-End Object Detection with Transformers	`transformers`
RT-DETR	DETRs Beat YOLOs on Real-time Object Detection	`transformers`
RT-DETRv2	RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformers	`transformers`
RF-DETR	RF-DETR: Neural Architecture Search for Real-Time Detection Transformers	`rfdetr`
OWL-ViT	Simple Open-Vocabulary Object Detection with Vision Transformers	`transformers`
OWLv2	Scaling Open-Vocabulary Object Detection	`transformers`

Segmentation

🏷️ Model Name	📜 Reference Paper	📦 Source of Weights
DeepLabV3	Rethinking Atrous Convolution for Semantic Image Segmentation	`torchvision`
EoMT	Your ViT is Secretly an Image Segmentation Model	`transformers`
MaskFormer	Per-Pixel Classification is Not All You Need for Semantic Segmentation	`transformers`
Mask2Former	Masked-attention Mask Transformer for Universal Image Segmentation	`transformers`
MobileViT	MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer	`transformers`
MobileViTV2	Separable Self-attention for Mobile Vision Transformers	`transformers`
SAM	Segment Anything	`transformers`
SAM2	SAM 2: Segment Anything in Images and Videos	`transformers`
SAM3	SAM 3: Segment Anything with Concepts	`transformers` (gated)
SegFormer	SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers	`transformers`

Feature Extraction

🏷️ Model Name	📜 Reference Paper	📦 Source of Weights
DINO	Emerging Properties in Self-Supervised Vision Transformers	`torch.hub`
DINOv2	DINOv2: Learning Robust Visual Features without Supervision	`transformers`
DINOv3	DINOv3: Self-Supervised Visual Representation Learning at Scale	`transformers` (gated)

Depth Estimation

🏷️ Model Name	📜 Reference Paper	📦 Source of Weights
Depth Anything V1	Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data	`transformers`
Depth Anything V2	Depth Anything V2	`transformers`

🖼️ Multimodal Models

Vision-Language Encoders

🏷️ Model Name	📜 Reference Paper	📦 Source of Weights
CLIP	Learning Transferable Visual Models From Natural Language Supervision	`transformers`
MetaCLIP 2	MetaCLIP 2: A Worldwide Scaling Recipe	`transformers`
SigLIP	Sigmoid Loss for Language Image Pre-Training	`transformers`
SigLIP2	SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features	`transformers`

Multimodal LLMs (image + text → text)

🏷️ Model Name	📜 Reference Paper	📦 Source of Weights
Qwen2-VL	Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution	on-the-fly `hf:`
Qwen2.5-VL	Qwen2.5-VL Technical Report	on-the-fly `hf:`
Qwen3-VL	Qwen3 Technical Report	on-the-fly `hf:`

🔊 Audio Models

Speech (speech → text)

🏷️ Model Name	📜 Reference Paper	📦 Source of Weights
Whisper	Robust Speech Recognition via Large-Scale Weak Supervision	`transformers`
Speech2Text	fairseq S2T: Fast Speech-to-Text Modeling with fairseq	`transformers`

📜 License

This project leverages timm and transformers for converting pretrained weights from PyTorch to Keras. For licensing details, please refer to the respective repositories.

🔖 kerasformers Code: This repository is licensed under the Apache 2.0 License.

🌟 Credits

The Keras team for their powerful and user-friendly deep learning framework
The Transformers library for its robust tools for loading and adapting pretrained models
The pytorch-image-models (timm) project for pioneering many computer vision model implementations
All contributors to the original papers and architectures implemented in this library

Citing

BibTeX

@misc{gc2025kerasformers,
  author = {Gitesh Chawda},
  title = {KerasFormers},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/IMvision12/KerasFormers}}

Name		Name	Last commit message	Last commit date
Latest commit History 312 Commits
.github/workflows		.github/workflows
assets		assets
docs		docs
kerasformers		kerasformers
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
dev-requirements.txt		dev-requirements.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KerasFormers 🚀

📖 Introduction

⚡ Installation

📑 Documentation

📑 Models

📝 Text Models

👁️ Vision Models

🖼️ Multimodal Models

🔊 Audio Models

📜 License

🌟 Credits

Citing

BibTeX

About

Uh oh!

Releases 47

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KerasFormers 🚀

📖 Introduction

⚡ Installation

📑 Documentation

📑 Models

📝 Text Models

👁️ Vision Models

🖼️ Multimodal Models

🔊 Audio Models

📜 License

🌟 Credits

Citing

BibTeX

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 47

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages