[Discussion] Video Return Format and Transmission Overhead in Inference Performance

### Motivation

**Background**

Adding video generation workloads to the endpoints framework, we need to decide how video transmission between client and server should be handled, and whether that transmission/compression cost should be included in inference performance measurement.
Metrics: throughput 
Single stream: Videos-per-second
 ~220 MB/video as raw bytes tensor (or 5-10MB mp4 videos if compressed)
Concurrency = 1 (GB200x4)  --> need to scale up to GB200/GB300x72
Output is a blob video, output do not do frame by frame output. 
                                                                                                                                                                                                             
**Problem Statement**

  The endpoints framework requires video transmission between client and server. 
Several design questions: 

-    Does transmission need to be counted into inference perf?  i.e. click to download/play video or pass video directly? 
 
   248 videos for Accuracy Mode
   50 videos for Performance Mode (But we currently collect perf with 100 videos)
                                                 
                                                                                                                                                                                                             
  Key Questions                                                                                                                                                                                              
                                                                                                                                                                                                             
  1. Does video transmission count as inference performance?
    - Option A: Measure only the model inference time; transmission/compression is out-of-scope
    - Option B: Include transmission in the latency/throughput measurement (click-to-download or click to play with video streaming)                     
    - Pass path/hash only or a blob video file.                                                                                 
  2. What is the API response-complete signal?                                                                                                                                                               
    - When is a request considered "done" — when the model finishes generating, or when the encoded video is available for download?                                                                         
  3. Does MP4 compression count in inference perf?                                                                                                                                                           
    - MP4 is required for VBench accuracy scoring                                                                                                                                                            
    - Compression could be folded into the accuracy phase (download → compress → score), keeping it out of the performance-phase critical path                                                                                                                                                             
  4. Hardware path for encoding:                                                                                                                                                                             
    - Is there a GPU-accelerated path for encoding/decoding (e.g., on B200)?                                                                                                                                 
    - Could compression be offloaded to a separate hardware unit? 

### Proposed Solution

Check above. 

### Alternatives Considered

_No response_

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Video Return Format and Transmission Overhead in Inference Performance #289

Motivation

Proposed Solution

Alternatives Considered

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Discussion] Video Return Format and Transmission Overhead in Inference Performance #289

Description

Motivation

Proposed Solution

Alternatives Considered

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions