Skip to content

shivanshagg18/OverLay-Dataset

Repository files navigation

OverLay++: Dense-Overlap Layout-to-Image

Generation Dataset

A pipeline for building a dataset of high-quality images containing multiple overlapping objects, annotated with bounding boxes, global captions, and per-object descriptions. It uses a Qwen3-VL vision-language model (served via vLLM) for detection, validation, and captioning.

Overview

Each sample in the produced dataset consists of:

  • A 1024×1024 image (center-cropped from a larger source image).
  • Per-object metadata: category, bbox, local_prompt, short_local_prompt.
  • Image-level captions: global_caption, short_global_caption.
  • Source URL (combined from the upstream aesthetics dataset metadata).

Bounding boxes are constrained to be reasonably sized and to overlap with at least one other box in the scene.

Pipeline

The scripts are intended to be run roughly in this order:

  1. extract_aesthetics_dataset.py — Downloads images and metadata from the upstream aesthetics parquet dataset, filtered by resolution (≥1024×1024) and aesthetic score. Saves images to images/ and one metadata_<part>.jsonl per parquet shard.
  2. process_images.py — Center-crops each image to 1024×1024, runs object detection with VLLMObjectDetector, validates detections (per-object yes/no check on the cropped region), and writes per-image *_metadata.json. Discards images that don't yield enough validated objects.
  3. add_global_caption.py — Adds a global_caption field to each metadata file.
  4. add_short_global_caption.py — Adds a short (<20 words) short_global_caption field.
  5. add_object_metadata.py — Adds a short_local_prompt (<20 words) for every detected object.
  6. combine_metadata_with_urls.py — Joins per-image metadata with source URLs from the aesthetics JSONL files and writes a single combined_metadata.jsonl. Only samples with all required caption/prompt fields are kept.
  7. annotate_images.py — Renders bounding boxes and labels onto each image using draw_bboxes.py, producing visualization images.

Supporting modules

  • detect_objects.py — Defines VLLMObjectDetector, a wrapper around vLLM that handles object detection, batched validation, and caption generation for Qwen-VL models.
  • draw_bboxes.py — Drawing utilities for overlapping bounding boxes with auto-placed, non-overlapping labels.

Installation

pip install -r requirements.txt

Requires CUDA-capable GPUs for vLLM inference. Multi-GPU is supported and auto-detected via tensor parallelism.

Usage

Most scripts hard-code paths inside their __main__ block (input image dir, metadata dir, model name, etc.). Edit those before running, then:

python extract_aesthetics_dataset.py [num_workers]
python process_images.py
python add_global_caption.py
python add_short_global_caption.py
python add_object_metadata.py
python combine_metadata_with_urls.py \
    --image-dir /path/to/images \
    --metadata-dir /path/to/metadata \
    --aesthetics-path /path/to/aesthetics_extracted \
    --output /path/to/combined_metadata.jsonl
python annotate_images.py

The default model is Qwen/Qwen3-VL-32B-Instruct. Adjust tensor_parallel_size, gpu_memory_utilization, and batch_size in each script to fit your hardware.

Output metadata format

Each per-image JSON file looks like:

{
  "detections": [
    {
      "category": "...",
      "bbox": [x1, y1, x2, y2],
      "local_prompt": "...",
      "short_local_prompt": "..."
    }
  ],
  "global_caption": "...",
  "short_global_caption": "..."
}

After running combine_metadata_with_urls.py, each line of combined_metadata.jsonl additionally contains image_filename and url

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages