Motivation
Background
Adding video generation workloads to the endpoints framework, we need to decide how video transmission between client and server should be handled, and whether that transmission/compression cost should be included in inference performance measurement.
Metrics: throughput
Single stream: Videos-per-second
~220 MB/video as raw bytes tensor (or 5-10MB mp4 videos if compressed)
Concurrency = 1 (GB200x4) --> need to scale up to GB200/GB300x72
Output is a blob video, output do not do frame by frame output.
Problem Statement
The endpoints framework requires video transmission between client and server.
Several design questions:
- Does transmission need to be counted into inference perf? i.e. click to download/play video or pass video directly?
248 videos for Accuracy Mode
50 videos for Performance Mode (But we currently collect perf with 100 videos)
Key Questions
- Does video transmission count as inference performance?
- Option A: Measure only the model inference time; transmission/compression is out-of-scope
- Option B: Include transmission in the latency/throughput measurement (click-to-download or click to play with video streaming)
- Pass path/hash only or a blob video file.
- What is the API response-complete signal?
- When is a request considered "done" — when the model finishes generating, or when the encoded video is available for download?
- Does MP4 compression count in inference perf?
- MP4 is required for VBench accuracy scoring
- Compression could be folded into the accuracy phase (download → compress → score), keeping it out of the performance-phase critical path
- Hardware path for encoding:
- Is there a GPU-accelerated path for encoding/decoding (e.g., on B200)?
- Could compression be offloaded to a separate hardware unit?
Proposed Solution
Check above.
Alternatives Considered
No response
Additional Context
No response
Motivation
Background
Adding video generation workloads to the endpoints framework, we need to decide how video transmission between client and server should be handled, and whether that transmission/compression cost should be included in inference performance measurement.
Metrics: throughput
Single stream: Videos-per-second
~220 MB/video as raw bytes tensor (or 5-10MB mp4 videos if compressed)
Concurrency = 1 (GB200x4) --> need to scale up to GB200/GB300x72
Output is a blob video, output do not do frame by frame output.
Problem Statement
The endpoints framework requires video transmission between client and server.
Several design questions:
248 videos for Accuracy Mode
50 videos for Performance Mode (But we currently collect perf with 100 videos)
Key Questions
- Option A: Measure only the model inference time; transmission/compression is out-of-scope
- Option B: Include transmission in the latency/throughput measurement (click-to-download or click to play with video streaming)
- Pass path/hash only or a blob video file.
- When is a request considered "done" — when the model finishes generating, or when the encoded video is available for download?
- MP4 is required for VBench accuracy scoring
- Compression could be folded into the accuracy phase (download → compress → score), keeping it out of the performance-phase critical path
- Is there a GPU-accelerated path for encoding/decoding (e.g., on B200)?
- Could compression be offloaded to a separate hardware unit?
Proposed Solution
Check above.
Alternatives Considered
No response
Additional Context
No response