DeepSeek-V4-Pro API Test Report: Together & Nvidia by yjfireworks · Pull Request #4 · fw-ai/.github

yjfireworks · 2026-04-27T00:36:13Z

Summary

Test report evaluating DeepSeek-V4-Pro (DSv4-Pro) API responses from Together AI and Nvidia NIM with reasoning_effort=high. Two runs were performed: Run 1 sequential (6 tests each), Run 2 fully parallel via 12 concurrent agents (6 tests each).

Key Findings

Answer Correctness

Both providers produce correct final answers for all tested prompts when requests complete successfully. Tested across riddles, math, coding, and explanations.

Gibberish in Reasoning Traces (Both Providers)

The most significant finding — both providers exhibit artifacts/gibberish injected into reasoning traces:

Random numbers mid-word: "So771", "the716716", "we32", "as42-is", "But13;"
Colon-number patterns: "key19275:", "can16:", "But71:", "it790:"
Repeated number padding: "06" / "07" appearing 30-40 times in a single response
Foreign words: "böjnings", "dátummal", "Didži"
Special token leaks (Nvidia): "<｜end▁of▁repo▁name｜>"
1 instance leaked into final content field (Together, Stack vs Queue: "to400: different")

Affected: 7/12 Together responses (58%), 4/4 successful Nvidia responses (100%).

Nvidia Reliability

Only 4/12 requests succeeded across both runs (33% reliability)
8/12 requests timed out at 200s with 0 bytes received
Successful requests averaged ~80s latency vs Together's ~13s

Together Truncation

4/12 responses hit the 500-token max_tokens limit during reasoning, truncating content

Combined Results (24 total requests)

Metric	Together AI	Nvidia NIM
Successful	12/12 (100%)	4/12 (33%)
Correct Answers	12/12	4/4
Gibberish in Reasoning	7/12 (58%)	4/4 (100%)
Avg Latency (success)	~13s	~80s
Timeouts	0	8/12 (67%)

Slack Thread

Test results for DSv4-Pro with high reasoning effort across 6 prompts: - Both providers return correct answers - Both exhibit gibberish/artifacts in reasoning traces - Nvidia has severe reliability issues (3/6 timeouts, high latency) - Together has token truncation risk with reasoning-heavy queries Co-authored-by: Yun Jin <yjfireworks@users.noreply.github.com>

12 parallel agents tested both APIs simultaneously: - Together: 12/12 success, 7/12 had reasoning gibberish, 1 leaked to content - Nvidia: only 1/6 succeeded (17%), 5/6 timed out at 200s - New artifact type: Nvidia leaked special token <end_of_repo_name> - Combined across both runs: Together 100% reliable, Nvidia 33% reliable Co-authored-by: Yun Jin <yjfireworks@users.noreply.github.com>

cursoragent and others added 2 commits April 27, 2026 00:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSeek-V4-Pro API Test Report: Together & Nvidia#4

DeepSeek-V4-Pro API Test Report: Together & Nvidia#4
yjfireworks wants to merge 2 commits intomainfrom
cursor/dsv4-pro-api-test-report-ff65

yjfireworks commented Apr 27, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yjfireworks commented Apr 27, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Findings

Answer Correctness

Gibberish in Reasoning Traces (Both Providers)

Nvidia Reliability

Together Truncation

Combined Results (24 total requests)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yjfireworks commented Apr 27, 2026 •

edited by cursor Bot

Loading