You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios
Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities
Introduction
The first LLaVA model [83] demonstrates impressive multimodal chat abilities, sometimes exhibiting the behaviors similar to GPT-4V on previously unseen images and instructions for the first time
chat 가능한 GPT-4V와 비슷한 첫 연구
LLaVA-1.5 [81] significantly expands and improves the capabilities by incorporating more academic related instruction data, achieving SoTA performance on a dozens of benchmarks with a data-efficient recipe
Academic-related instruction data로 튜닝해서 성능 좋게 거둠
LLaVA-NeXT [82] inherits this property, further pushing performance boundaries through three key techniques: AnyRes for handling high-resolution images, expanding high-quality instruction data, and utilizing the best open LLM available at the time.
AnyRes를 통해 고해상도 이미지도 핸들링하면서 양질의 데이터로 좋은 성능 냄 (NeXT도 비디오 커버 가능하긴했음)
The Video blog [169] shows that the image-only-trained LLaVA-NeXT model is surprisingly strong on video tasks with zero-shot modality transfer, due to the design of AnyRes to digest any vision signals as a sequence of images.
AnyRes 덕분에 어떤 비전 시그널이든 이미지의 시퀀스로 이해했기 때문에 잘했던듯
LLaVA-NeXT 블로그는 총 4개인데, Video, Stronger, Ablation, Interleave 이렇게 구성되어있음
contributions:
Large multimodal models. We develop LLaVA-OneVision, a family of open large multimodal models (LMMs) that improves the performance boundaries of open LMMs in three important vision settings, including single-image, multi-image, and video scenarios
Emerging Capabilities with Task Transfer. Our design in modeling and data representations allow task transfer across different scenarios, suggesting a simple approach to yield new emgerging capabilities. In particular, LLaVA-OneVision demonstrate strong video understanding through task transfer from images.
Open-source. the generated multimodal instruction data, the codebase, the model checkpoints, and a visual chat demo
조건부확률보면 항상 vision signal X_v를 넣고 있고, 이건 모든 답변은 vision feature에 그라운딩한다는걸 표현함
Visual Representations
It relates to two factors, the resolution in the raw pixel space and the number of tokens in the feature space
the visual input representation configuration (resolution, #token)
we observe that the scaling of resolution is more effective than that of token numbers, and recommend an AnyRes strategy with pooling (스케일이 토큰 개수보다 더 중요해서 AnyRes에 풀링 적용을 권장)
For AnyRes with a configuration of width a, height b, it divides the image into a×b crops, each with the shape (a,b). Each crop has the same resolution suitable for the vision encoder. Assuming there are T tokens per crop, the total number of visual tokens is L = (a×b+ 1) ×T,
하나의 crop당 비전인코더가 인코딩할수있고, 그 단위당 토큰이 T개 나온다고하면 axbT + 전체를보는 resize용 1T개가 나오게됨
We consider a threshold τ, and reduce the #token per crop, using bilinear interpolation if needed
Threshold를 기준으로 토큰 개수가 너무 많으면 Bilinear interpolation을 통해 Crop당 토큰 개수를 조절함
예를들면 하나의 패치당 토큰 개수가 30이고, ab가 31이고, Threshold가 100개인데,총 토큰 L이 120개((3*1+1)30)가 나왔다면, 100/(3+1) = 100/4 = 25개를 Crop당 토큰 개수로 변경해줘서, 254 = 100개가 나오도록 즉 최대 Threshold 개수만큼 조절해줌. a,b의 구성을 정의해놓으면 가장 작은 crop 수가 나오도록 선택해서 진행함
We illustratie the configuration in Figure 3, describe the detailed in Section C.1 and provide high-level encoding strategies as below
Single-image
consider a large maximum spatial configuration (a,b) for single-image representation to maintain the original image resolution without resizing
By representing an image with a long sequence that mimics video representation, we facilitate a smoother capability transfer from image to video understanding [169, 64]
Multi-image
Only the base image resolution is considered
eliminating the need for multi-crop of high resolution image and thus saving computational resources
Video
Each frame of the video is resized to the base image resolution and processed by the vision encoder to generate feature maps. Bilinear interpolation is employed to reduce the number of tokens, allowing the consideration of a larger number of frames by reducing tokens per frame
Data
quality over quantity
4.1 High-Quality Knowledge
The web-scale public image-text data is often of low quality, rendering the data scaling of multimodal pre-training less efficient
Instead, we recommend to focus on high-quality knowledge learning, given a limited compute budget. This approach acknowledges that the pre-trained LLMs and ViTs already possess a substantial knowledge base, and the goal is to refine and enhance this knowledge with carefully curated data.
three major categories for high-quality knowledge learning
Re-Captioned Detailed Description Data:
We used the model to generate new captions
for the images from the following datasets: COCO118K, BLIP558K, and CC3M. We combined them to form the Re-Captioned Detailed Description Data, totaling 3.5M samples.
Document / OCR Data
Text Reading subset from the UReader dataset, totaling 100K, which is easily accessible through PDF rendering
We used this text reading data along with the SynDOG EN/CN, to form the Document / OCR Data, totaling 1.1M samples
Chinese and Language Data
used the original ShareGPT4V [20] images and utilized
GPT-4V provided by the Azure API to generate 92K detailed Chinese caption data, aiming to improve the model’s capability in Chinese.
We collected 143K samples from the Evo-Instruct dataset
almost all (accounting for 99.8%) of the high-quality knowledge data is synthetic.
Author
Bo Li2,♡ Yuanhan Zhang2,♡ Dong Guo1 Renrui Zhang3,♡ Feng Li4,♡ Hao Zhang4,♡ Kaichen Zhang2 Peiyuan Zhang2 Yanwei Li3,♡ Ziwei Liu2 Chunyuan Li1
https://llava-vl.github.io/blog/2024-08-05-llava-onevision/
Abstract
single-image, multi-image, and videoscenariosIntroduction
including single-image, multi-image, and video scenariosModeling
Network Architecture
Visual Representations
Data
4.1 High-Quality Knowledge
for the images from the following datasets: COCO118K, BLIP558K, and CC3M. We combined them to form the Re-Captioned Detailed Description Data, totaling 3.5M samples.
GPT-4V provided by the Azure API to generate 92K detailed Chinese caption data, aiming to improve the model’s capability in Chinese.
4.2 Visual Instruction Tuning Data
5 Training Strategies