This repository implements a signature, logo and stamp detection pipeline using the Qwen2.5-VL Vision-Language Model. It includes a complete workflow for dataset preparation, fine-tuning (optional), and robust evaluation.
Object_detection_in_documents/
├── utils/ # Utility scripts and helper functions
│ ├── extract_pdf.py # PDF to image extraction
│ ├── Qwen_25_LLM.py # Wrapper to handel loading and inference of the model for HuggingFace
│ ├── signature_benchmark.py # Wrapper to handel evaluating of the model on the test data
│ └── train_qwen.py # Script to train the base model
│
├── download_data/ # Dataset download and preparation scripts
│ └── download_signature_data.py # Downloads raw data from Roboflow
├── data_final/ # Final processed dataset (70% Train / 20% Valid / 10% Test) created after you run the script of donwload
├── pdf_to_image/ # Output folder for PDF page extractions
├── output/ # Model inference and evaluation results
├── Inference_Qwen_25_VL.ipynb # Main Jupyter notebook for inference
├── main.py # Main script entry point
├── pdf_sample.pdf # Sample PDF for testing
├── requirements.txt # Python dependencies
└── README.md # This file
utils/: Utility functions including PDF extraction and data processingdownload_data/: Scripts for downloading and preparing the datasetdata_final/: Processed dataset with train/validation/test splits and augmentationoutput/: Results from model inference and evaluation metrics
Create virtual env: use the requirement.txt to create and activate the python env
Before running any notebooks, you must run the data preparation script. This script handles:
- Downloading the raw dataset from Roboflow.
- Aggregating all images (ignoring original splits).
- Creating a custom 70% Train / 20% Valid / 10% Test split.
- Applying 3x Augmentation (Rotation, Noise, Perspective) only to the training set.
- Formatting annotations into the required JSONL format.
# Run this once to generate the 'data/' folder
python download_data/download_signature_data.pyOnce the data/ folder is generated, the Jupyter Notebook usage is straightforward. It will automatically load the test set and run the evaluation pipeline.
jupyter notebook Inference_Qwen_25_VL.ipynbThe SignatureBenchmark script evaluates the Qwen-VL model's performance across two primary dimensions: Localization (how accurately it finds objects) and Classification (how accurately it identifies what it found).
These metrics evaluate the model's ability to draw bounding boxes exactly where the ground-truth objects (signatures, logos, stamps) are located, independent of whether it guessed the correct label. A prediction is only considered a "True Positive" if its overlap with the ground truth exceeds the configured iou_threshold (default is 0.70).
-
IoU (Intersection over Union): The core spatial metric. It measures the overlap between the predicted bounding box and the ground truth box, divided by their total combined area.
- Mean IoU / Median IoU: The average and median overlap score calculated only for successfully matched boxes (True Positives). This indicates how "tight" the bounding boxes are when the model gets it right.
-
Precision: The percentage of predicted bounding boxes that correctly matched a ground truth box (
$\frac{TP}{TP + FP}$ ). A high precision means the model rarely hallucinates fake objects. -
Recall: The percentage of actual ground truth boxes that the model successfully found and bounded (
$\frac{TP}{TP + FN}$ ). A high recall means the model rarely misses real objects. -
F1-Score (
localization_f1_at_{threshold}): The harmonic mean of Precision and Recall. This is the primary single-number metric for overall detection performance at the specified IoU threshold. - True Negatives (Document-Level): Evaluates the model's ability to correctly stay silent. If a document contains none of the target objects, and the model predicts zero bounding boxes, it is counted as a True Negative.
These metrics are calculated strictly on the successfully localized boxes (True Positives). Once the model successfully draws a box around an object, these metrics evaluate if it assigned the correct text label (e.g., confusing a "stamp" for a "logo").
- Classification Accuracy: The percentage of successfully localized boxes that were assigned the correct ground-truth label.
- Confusion Matrix: A visual plot (
confusion_matrix.png) showing the exact misclassifications between labels (e.g., how many times a 'logo' was predicted as a 'stamp').
Because object detection datasets are often heavily imbalanced (e.g., many more signatures than logos), the script breaks down performance individually for every label.
- Per-Class Precision, Recall, & F1: The standard metrics calculated specifically for a single class (e.g., the F1 score strictly for identifying 'signatures').
- Class TP (True Positives): The model correctly localized and correctly labeled the specific class.
- Class FP (False Positives): The model either hallucinated a box for this class where none existed, or it found an object but assigned it this incorrect label.
- Class FN (False Negatives): The model failed to localize an object of this class, or it localized it but gave it the wrong label.
While IoU is the primary evaluation metric, the script also calculates center-point distances to help diagnose consistent offset or "sliding box" errors.
- IoP (Intersection over Prediction): Similar to IoU, but the denominator is only the area of the predicted box. Useful for identifying if predictions are consistently too large or too small.
- Center Distance (px): The absolute distance in pixels between the exact center of the predicted box and the center of the ground truth box.
- Normalized Center Distance: The center distance divided by the total diagonal length of the image, providing a resolution-agnostic measure of how far off the prediction's center point drifted.