Skip to content

mit1208/Document-AI

Repository files navigation

Document AI: Advanced Layout Analysis & Multimodal Understanding

This repository showcases a series of advanced projects in Document AI, focusing on document layout analysis, multimodal grounding, and high-fidelity segmentation. Each notebook is designed for experimentation and can be launched directly in Google Colab.


🚀 Key Features

  • Document Layout Analysis: Leveraging LayoutLMv3 and UDOP for structural understanding.
  • Segmentation: Utilizing the Segment Anything Model (SAM) for granular document element identification.
  • Multimodal Grounding: Fine-tuning KOSMOS-2 for vision-language tasks.
  • Hugging Face Integration: Built primarily with the transformers and datasets libraries.

📊 Datasets & Models

Component Description
Datasets Primarily DocLayNet
Models LayoutLMv3, KOSMOS-2, SAM, UDOP

📓 Notebooks

Project Description Link
LayoutLMv3 Fine-tuning Fine-tune LayoutLMv3 on DocLayNet using the HF Trainer. Colab
KOSMOS-2 Grounding Instruction tuning for multimodal grounding tasks. Colab
LayoutLMv3 Inference Rapid inference and visualization script for LayoutLMv3. Colab
SAM Segmentation Apply the Segment Anything Model to complex documents. Colab
UDOP Encoder Tuning Fine-tune the Universal Document Processing encoder. Colab
UDOP Inference Inference and structural analysis using UDOP. Colab

🛠️ Requirements

Typical requirements across these notebooks include:

  • transformers
  • datasets
  • torch
  • pillow
  • accelerate

Each notebook contains its own dependency installation cell for convenience.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors