Generation Dataset
A pipeline for building a dataset of high-quality images containing multiple overlapping objects, annotated with bounding boxes, global captions, and per-object descriptions. It uses a Qwen3-VL vision-language model (served via vLLM) for detection, validation, and captioning.
Each sample in the produced dataset consists of:
- A 1024×1024 image (center-cropped from a larger source image).
- Per-object metadata:
category,bbox,local_prompt,short_local_prompt. - Image-level captions:
global_caption,short_global_caption. - Source URL (combined from the upstream aesthetics dataset metadata).
Bounding boxes are constrained to be reasonably sized and to overlap with at least one other box in the scene.
The scripts are intended to be run roughly in this order:
extract_aesthetics_dataset.py— Downloads images and metadata from the upstream aesthetics parquet dataset, filtered by resolution (≥1024×1024) and aesthetic score. Saves images toimages/and onemetadata_<part>.jsonlper parquet shard.process_images.py— Center-crops each image to 1024×1024, runs object detection withVLLMObjectDetector, validates detections (per-object yes/no check on the cropped region), and writes per-image*_metadata.json. Discards images that don't yield enough validated objects.add_global_caption.py— Adds aglobal_captionfield to each metadata file.add_short_global_caption.py— Adds a short (<20 words)short_global_captionfield.add_object_metadata.py— Adds ashort_local_prompt(<20 words) for every detected object.combine_metadata_with_urls.py— Joins per-image metadata with source URLs from the aesthetics JSONL files and writes a singlecombined_metadata.jsonl. Only samples with all required caption/prompt fields are kept.annotate_images.py— Renders bounding boxes and labels onto each image usingdraw_bboxes.py, producing visualization images.
detect_objects.py— DefinesVLLMObjectDetector, a wrapper around vLLM that handles object detection, batched validation, and caption generation for Qwen-VL models.draw_bboxes.py— Drawing utilities for overlapping bounding boxes with auto-placed, non-overlapping labels.
pip install -r requirements.txtRequires CUDA-capable GPUs for vLLM inference. Multi-GPU is supported and auto-detected via tensor parallelism.
Most scripts hard-code paths inside their __main__ block (input image dir, metadata dir, model name, etc.). Edit those before running, then:
python extract_aesthetics_dataset.py [num_workers]
python process_images.py
python add_global_caption.py
python add_short_global_caption.py
python add_object_metadata.py
python combine_metadata_with_urls.py \
--image-dir /path/to/images \
--metadata-dir /path/to/metadata \
--aesthetics-path /path/to/aesthetics_extracted \
--output /path/to/combined_metadata.jsonl
python annotate_images.pyThe default model is Qwen/Qwen3-VL-32B-Instruct. Adjust tensor_parallel_size, gpu_memory_utilization, and batch_size in each script to fit your hardware.
Each per-image JSON file looks like:
{
"detections": [
{
"category": "...",
"bbox": [x1, y1, x2, y2],
"local_prompt": "...",
"short_local_prompt": "..."
}
],
"global_caption": "...",
"short_global_caption": "..."
}After running combine_metadata_with_urls.py, each line of combined_metadata.jsonl additionally contains image_filename and url