OverLay++: Dense-Overlap Layout-to-Image

Generation Dataset

A pipeline for building a dataset of high-quality images containing multiple overlapping objects, annotated with bounding boxes, global captions, and per-object descriptions. It uses a Qwen3-VL vision-language model (served via vLLM) for detection, validation, and captioning.

Overview

Each sample in the produced dataset consists of:

A 1024×1024 image (center-cropped from a larger source image).
Per-object metadata: category, bbox, local_prompt, short_local_prompt.
Image-level captions: global_caption, short_global_caption.
Source URL (combined from the upstream aesthetics dataset metadata).

Bounding boxes are constrained to be reasonably sized and to overlap with at least one other box in the scene.

Pipeline

The scripts are intended to be run roughly in this order:

extract_aesthetics_dataset.py — Downloads images and metadata from the upstream aesthetics parquet dataset, filtered by resolution (≥1024×1024) and aesthetic score. Saves images to images/ and one metadata_<part>.jsonl per parquet shard.
process_images.py — Center-crops each image to 1024×1024, runs object detection with VLLMObjectDetector, validates detections (per-object yes/no check on the cropped region), and writes per-image *_metadata.json. Discards images that don't yield enough validated objects.
add_global_caption.py — Adds a global_caption field to each metadata file.
add_short_global_caption.py — Adds a short (<20 words) short_global_caption field.
add_object_metadata.py — Adds a short_local_prompt (<20 words) for every detected object.
combine_metadata_with_urls.py — Joins per-image metadata with source URLs from the aesthetics JSONL files and writes a single combined_metadata.jsonl. Only samples with all required caption/prompt fields are kept.
annotate_images.py — Renders bounding boxes and labels onto each image using draw_bboxes.py, producing visualization images.

Supporting modules

detect_objects.py — Defines VLLMObjectDetector, a wrapper around vLLM that handles object detection, batched validation, and caption generation for Qwen-VL models.
draw_bboxes.py — Drawing utilities for overlapping bounding boxes with auto-placed, non-overlapping labels.

Installation

pip install -r requirements.txt

Requires CUDA-capable GPUs for vLLM inference. Multi-GPU is supported and auto-detected via tensor parallelism.

Usage

Most scripts hard-code paths inside their __main__ block (input image dir, metadata dir, model name, etc.). Edit those before running, then:

python extract_aesthetics_dataset.py [num_workers]
python process_images.py
python add_global_caption.py
python add_short_global_caption.py
python add_object_metadata.py
python combine_metadata_with_urls.py \
    --image-dir /path/to/images \
    --metadata-dir /path/to/metadata \
    --aesthetics-path /path/to/aesthetics_extracted \
    --output /path/to/combined_metadata.jsonl
python annotate_images.py

The default model is Qwen/Qwen3-VL-32B-Instruct. Adjust tensor_parallel_size, gpu_memory_utilization, and batch_size in each script to fit your hardware.

Output metadata format

Each per-image JSON file looks like:

{
  "detections": [
    {
      "category": "...",
      "bbox": [x1, y1, x2, y2],
      "local_prompt": "...",
      "short_local_prompt": "..."
    }
  ],
  "global_caption": "...",
  "short_global_caption": "..."
}

After running combine_metadata_with_urls.py, each line of combined_metadata.jsonl additionally contains image_filename and url

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OverLay++: Dense-Overlap Layout-to-Image

Overview

Pipeline

Supporting modules

Installation

Usage

Output metadata format

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
add_global_caption.py		add_global_caption.py
add_object_metadata.py		add_object_metadata.py
add_short_global_caption.py		add_short_global_caption.py
annotate_images.py		annotate_images.py
combine_metadata_with_urls.py		combine_metadata_with_urls.py
detect_objects.py		detect_objects.py
draw_bboxes.py		draw_bboxes.py
extract_aesthetics_dataset.py		extract_aesthetics_dataset.py
process_images.py		process_images.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

OverLay++: Dense-Overlap Layout-to-Image

Overview

Pipeline

Supporting modules

Installation

Usage

Output metadata format

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages