KerasFormers is a collection of models with pretrained weights, built entirely with Keras 3. It supports a range of tasks, including classification, object detection (DETR, RT-DETR, RT-DETRv2, RF-DETR, D-FINE, OWL-ViT, OWLv2), segmentation (SAM, SAM2, SAM3, SegFormer, DeepLabV3, EoMT, MaskFormer, Mask2Former, MobileViT-DeepLabV3), monocular depth estimation (Depth Anything V1, Depth Anything V2), feature extraction (DINO, DINOv2, DINOv3), vision-language modeling (CLIP, SigLIP, SigLIP2, MetaCLIP 2), speech recognition (Whisper, Speech2Text), and more. It includes hybrid architectures like MaxViT alongside traditional CNNs and pure transformers. kerasformers includes custom layers and backbone support, providing flexibility and efficiency across various applications. For backbones, there are various weight variants like in1k, in21k, fb_dist_in1k, ms_in22k, fb_in22k_ft_in1k, ns_jft_in1k, aa_in1k, cvnets_in1k, augreg_in21k_ft_in1k, augreg_in21k, and many more.
From PyPI (recommended)
pip install -U kerasformersFrom Source
pip install -U git+https://github.com/IMvision12/KerasFormersPer-model guides - with architecture notes, usage examples, and available pretrained weights, live in the docs/ folder, one page per model across every supported task (classification, object detection, segmentation, depth estimation, feature extraction, vision-language, and speech recognition). Classification backbones share a single page since they all follow the same XModel / XImageClassify two-class structure; each other model has its own. Browse docs/ for the complete, always-up-to-date list.
-
Text LLMs (text β text)
π·οΈ Model Name π Reference Paper π¦ Source of Weights Qwen2 Qwen2 Technical Report on-the-fly hf:Qwen3 Qwen3 Technical Report on-the-fly hf:Qwen3.5 Qwen3 Technical Report on-the-fly hf:
-
Backbones
-
Object Detection
π·οΈ Model Name π Reference Paper π¦ Source of Weights D-FINE D-FINE: Redefine Regression Task of DETRs as Fine-grained Distribution Refinement transformersDETR End-to-End Object Detection with Transformers transformersRT-DETR DETRs Beat YOLOs on Real-time Object Detection transformersRT-DETRv2 RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformers transformersRF-DETR RF-DETR: Neural Architecture Search for Real-Time Detection Transformers rfdetrOWL-ViT Simple Open-Vocabulary Object Detection with Vision Transformers transformersOWLv2 Scaling Open-Vocabulary Object Detection transformers
-
Segmentation
π·οΈ Model Name π Reference Paper π¦ Source of Weights DeepLabV3 Rethinking Atrous Convolution for Semantic Image Segmentation torchvisionEoMT Your ViT is Secretly an Image Segmentation Model transformersMaskFormer Per-Pixel Classification is Not All You Need for Semantic Segmentation transformersMask2Former Masked-attention Mask Transformer for Universal Image Segmentation transformersMobileViT MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer transformersMobileViTV2 Separable Self-attention for Mobile Vision Transformers transformersSAM Segment Anything transformersSAM2 SAM 2: Segment Anything in Images and Videos transformersSAM3 SAM 3: Segment Anything with Concepts transformers(gated)SegFormer SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers transformers
-
Feature Extraction
π·οΈ Model Name π Reference Paper π¦ Source of Weights DINO Emerging Properties in Self-Supervised Vision Transformers torch.hubDINOv2 DINOv2: Learning Robust Visual Features without Supervision transformersDINOv3 DINOv3: Self-Supervised Visual Representation Learning at Scale transformers(gated)
-
Depth Estimation
π·οΈ Model Name π Reference Paper π¦ Source of Weights Depth Anything V1 Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data transformersDepth Anything V2 Depth Anything V2 transformers
-
Vision-Language Encoders
π·οΈ Model Name π Reference Paper π¦ Source of Weights CLIP Learning Transferable Visual Models From Natural Language Supervision transformersMetaCLIP 2 MetaCLIP 2: A Worldwide Scaling Recipe transformersSigLIP Sigmoid Loss for Language Image Pre-Training transformersSigLIP2 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features transformers
-
Multimodal LLMs (image + text β text)
π·οΈ Model Name π Reference Paper π¦ Source of Weights Qwen2-VL Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution on-the-fly hf:Qwen2.5-VL Qwen2.5-VL Technical Report on-the-fly hf:Qwen3-VL Qwen3 Technical Report on-the-fly hf:
-
Speech (speech β text)
π·οΈ Model Name π Reference Paper π¦ Source of Weights Whisper Robust Speech Recognition via Large-Scale Weak Supervision transformersSpeech2Text fairseq S2T: Fast Speech-to-Text Modeling with fairseq transformers
This project leverages timm and transformers for converting pretrained weights from PyTorch to Keras. For licensing details, please refer to the respective repositories.
- π kerasformers Code: This repository is licensed under the Apache 2.0 License.
- The Keras team for their powerful and user-friendly deep learning framework
- The Transformers library for its robust tools for loading and adapting pretrained models
- The pytorch-image-models (timm) project for pioneering many computer vision model implementations
- All contributors to the original papers and architectures implemented in this library
@misc{gc2025kerasformers,
author = {Gitesh Chawda},
title = {KerasFormers},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/IMvision12/KerasFormers}}