[LLaVA OneVision] Easy Visual Task Transfer


# Author
- Bo Li2,♡ Yuanhan Zhang2,♡ Dong Guo1 Renrui Zhang3,♡ Feng Li4,♡ Hao Zhang4,♡ Kaichen Zhang2 Peiyuan Zhang2 Yanwei Li3,♡ Ziwei Liu2 Chunyuan Li1
  - 1ByteDance 2S-Lab, NTU 3CUHK 4HKUST

- https://llava-vl.github.io/blog/2024-08-05-llava-onevision/

# Abstract
- LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: `single-image, multi-image, and video` scenarios
- Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities

# Introduction
- The first LLaVA model [83] demonstrates impressive multimodal chat abilities, sometimes exhibiting the behaviors similar to GPT-4V on previously unseen images and instructions for the first time
  - chat 가능한 GPT-4V와 비슷한 첫 연구
- LLaVA-1.5 [81] significantly expands and improves the capabilities by incorporating more academic related instruction data, achieving SoTA performance on a dozens of benchmarks with a data-efficient recipe
  - Academic-related instruction data로 튜닝해서 성능 좋게 거둠
- LLaVA-NeXT [82] inherits this property, further pushing performance boundaries through three key techniques: AnyRes for handling high-resolution images, expanding high-quality instruction data, and utilizing the best open LLM available at the time.
  - AnyRes를 통해 고해상도 이미지도 핸들링하면서 양질의 데이터로 좋은 성능 냄 (NeXT도 비디오 커버 가능하긴했음)
    - The Video blog [169] shows that the image-only-trained LLaVA-NeXT model is surprisingly strong on video tasks with zero-shot modality transfer, due to the design of AnyRes to digest any vision signals as a sequence of images.
      - AnyRes 덕분에 어떤 비전 시그널이든 이미지의 시퀀스로 이해했기 때문에 잘했던듯
    - LLaVA-NeXT 블로그는 총 4개인데, Video, Stronger, Ablation, Interleave 이렇게 구성되어있음
- contributions:
  - **Large multimodal models**. We develop LLaVA-OneVision, a family of open large multimodal models (LMMs) that improves the performance boundaries of open LMMs in three important vision settings, `including single-image, multi-image, and video scenarios`
  - **Emerging Capabilities with Task Transfer**. Our design in modeling and data representations allow task transfer across different scenarios, suggesting a simple approach to yield new emgerging capabilities. In particular, LLaVA-OneVision demonstrate **strong video understanding through task transfer from images**.
  - **Open-source**. the generated multimodal instruction data, the codebase, the model checkpoints, and a visual chat demo

# Modeling

![Image](https://github.com/user-attachments/assets/42412f70-d093-49d9-84d7-0343e88abc84)

## Network Architecture
- LLM: Qwen-2, Vision Encoder: SigLIP, Projector: 2-layer MLP
- 조건부확률보면 항상 vision signal X_v를 넣고 있고, 이건 모든 답변은 vision feature에 그라운딩한다는걸 표현함
![Image](https://github.com/user-attachments/assets/ef7e35b7-6ae4-408a-a542-43fa20a43c6c)

## Visual Representations
- It relates to two factors, **the resolution in the raw pixel space** and **the number of tokens in the feature space**
  - the visual input representation configuration (resolution, #token)
  - we observe that the scaling of resolution is more effective than that of token numbers, and recommend an AnyRes strategy with pooling (스케일이 토큰 개수보다 더 중요해서  AnyRes에 풀링 적용을 권장)

<img width="728" alt="Image" src="https://github.com/user-attachments/assets/ae7dabda-d4c2-4759-80eb-41b143a35168" />

<img width="736" alt="Image" src="https://github.com/user-attachments/assets/55917ae5-80ff-49e8-8b2c-a3163a0019b7" />

- For AnyRes with a configuration of width a, height b, it divides the image into a×b crops, each with the shape (a,b). Each crop has the same resolution suitable for the vision encoder. Assuming there are T tokens per crop, the total number of visual tokens is L = (a×b+ 1) ×T,
  - 하나의 crop당 비전인코더가 인코딩할수있고, 그 단위당 토큰이 T개 나온다고하면 axb*T + 전체를보는 resize용 1*T개가 나오게됨
- We consider a threshold τ, and reduce the #token per crop, using bilinear interpolation if needed
  - Threshold를 기준으로 토큰 개수가 너무 많으면 Bilinear interpolation을 통해 Crop당 토큰 개수를 조절함 
  - 예를들면 하나의 패치당 토큰 개수가 30이고, a*b가 3*1이고, Threshold가 100개인데,총 토큰 L이 120개((3*1+1)*30)가 나왔다면, 100/(3+1) = 100/4 = 25개를 Crop당 토큰 개수로 변경해줘서, 25*4 = 100개가 나오도록 즉 최대 Threshold 개수만큼 조절해줌. a,b의 구성을 정의해놓으면 가장 작은 crop 수가 나오도록 선택해서 진행함
- We illustratie the configuration in Figure 3, describe the detailed in Section C.1 and provide high-level encoding strategies as below

<img width="725" alt="Image" src="https://github.com/user-attachments/assets/8195f76e-9346-4550-993b-115ac6ec459b" />

  - Single-image
    - consider a large maximum spatial configuration (a,b) for single-image representation to maintain the original image resolution without resizing
    - By representing an image with a long sequence that mimics video representation, we facilitate a smoother capability transfer from image to video understanding [169, 64]
  - Multi-image
    - Only the base image resolution is considered
    - eliminating the need for multi-crop of high resolution image and thus saving computational resources
  - Video
    - Each frame of the video is resized to the base image resolution and processed by the vision encoder to generate feature maps. **Bilinear interpolation** is employed to reduce the number of tokens, allowing the consideration of a larger number of frames by reducing tokens per frame

<img width="1409" alt="Image" src="https://github.com/user-attachments/assets/89aec9f9-2604-4efe-af7d-ee561d827203" />


# Data
- quality over quantity
## 4.1 High-Quality Knowledge
- The web-scale public image-text data is often of low quality, rendering the data scaling of multimodal pre-training less efficient
- Instead, we recommend to focus on high-quality knowledge learning, given a limited compute budget. This approach acknowledges that the pre-trained LLMs and ViTs already possess a substantial knowledge base, and the goal is to refine and enhance this knowledge with carefully curated data.
- three major categories for high-quality knowledge learning
  - Re-Captioned Detailed Description Data:
    - We used the model to generate new captions
for the images from the following datasets: **COCO118K, BLIP558K, and CC3M**. We combined them to form the Re-Captioned Detailed Description Data, totaling 3.5M samples.
  - Document / OCR Data
    - Text Reading subset from the UReader dataset, totaling 100K, which is easily accessible through PDF rendering
    - We used this text reading data along with the  SynDOG EN/CN, to form the Document / OCR Data, totaling 1.1M samples
  - Chinese and Language Data
    - used the original ShareGPT4V [20] images and utilized
GPT-4V provided by the Azure API to generate 92K detailed Chinese caption data, aiming to improve the model’s capability in Chinese.
     - We collected 143K samples from the Evo-Instruct dataset 
- almost all (accounting for 99.8%) of the high-quality knowledge data is **synthetic**.

## 4.2 Visual Instruction Tuning Data

<img width="710" alt="Image" src="https://github.com/user-attachments/assets/db1a3e59-e68e-4f8c-a6c6-d3c87d7bec18" />

<img width="716" alt="Image" src="https://github.com/user-attachments/assets/b00f6df9-1469-4519-9f7d-35b69464f995" />

<img width="717" alt="Image" src="https://github.com/user-attachments/assets/6ec4cb3e-845a-49e9-a8b4-4607f10ee87f" />

<img width="716" alt="Image" src="https://github.com/user-attachments/assets/4deb05cb-07dc-4296-a84b-84fe201b4586" />


# 5 Training Strategies

<img width="944" alt="Image" src="https://github.com/user-attachments/assets/f8890953-fdee-447e-aaf1-af98ccacb0bd" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLaVA OneVision] Easy Visual Task Transfer #43

Author

Abstract

Introduction

Modeling

Network Architecture

Visual Representations

Data

4.1 High-Quality Knowledge

4.2 Visual Instruction Tuning Data

5 Training Strategies

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[LLaVA OneVision] Easy Visual Task Transfer #43

Description

Author

Abstract

Introduction

Modeling

Network Architecture

Visual Representations

Data

4.1 High-Quality Knowledge

4.2 Visual Instruction Tuning Data

5 Training Strategies

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions