Code and resources for the article “Enhancing Technical Question Answering Quality through Multimodal Document Segmentation”.
This module provides functionality for analyzing the structure of documents (images and PDFs) using computer vision and natural language processing. The main classes are LayoutExtractor (structure analysis) and ImageDescription (document element description).
A class for extracting structural elements from documents and processing them.
- Detecting bounding boxes of document elements using YOLOv10
- Processing PDFs and images
- Merging duplicate and overlapping bounding boxes
- Linking related elements (e.g., images with captions)
- Encoding images in base64
get_bboxes()— main method for retrieving bounding boxesmerge_duplicated()— merges duplicate bounding boxes_find_closest_bboxes()— finds related elements (e.g., image–caption pairs)_merge_related_bboxes()— merges related elements
- Titles, body text
- Images, tables, formulas
- Captions for images/tables/formulas
- Table footnotes
A class for describing document elements using language models.
- Generating text descriptions of document elements
- Recognizing text within bounding boxes
- Supporting both API mode (via OpenAI) and local models (transformers)
- Handling both individual bounding boxes and arbitrary regions
inference()— main method for obtaining an element description_parse_json()— post-processes model output
See run.py and /exasmple/segmentation_example.ipynb