This repository showcases a series of advanced projects in Document AI, focusing on document layout analysis, multimodal grounding, and high-fidelity segmentation. Each notebook is designed for experimentation and can be launched directly in Google Colab.
- Document Layout Analysis: Leveraging LayoutLMv3 and UDOP for structural understanding.
- Segmentation: Utilizing the Segment Anything Model (SAM) for granular document element identification.
- Multimodal Grounding: Fine-tuning KOSMOS-2 for vision-language tasks.
- Hugging Face Integration: Built primarily with the
transformersanddatasetslibraries.
| Component | Description |
|---|---|
| Datasets | Primarily DocLayNet |
| Models | LayoutLMv3, KOSMOS-2, SAM, UDOP |
Typical requirements across these notebooks include:
transformersdatasetstorchpillowaccelerate
Each notebook contains its own dependency installation cell for convenience.