Skip to content

DeepSeek-V4-Pro API Test Report: Together & Nvidia#4

Draft
yjfireworks wants to merge 2 commits intomainfrom
cursor/dsv4-pro-api-test-report-ff65
Draft

DeepSeek-V4-Pro API Test Report: Together & Nvidia#4
yjfireworks wants to merge 2 commits intomainfrom
cursor/dsv4-pro-api-test-report-ff65

Conversation

@yjfireworks
Copy link
Copy Markdown

@yjfireworks yjfireworks commented Apr 27, 2026

Summary

Test report evaluating DeepSeek-V4-Pro (DSv4-Pro) API responses from Together AI and Nvidia NIM with reasoning_effort=high. Two runs were performed: Run 1 sequential (6 tests each), Run 2 fully parallel via 12 concurrent agents (6 tests each).

Key Findings

Answer Correctness

Both providers produce correct final answers for all tested prompts when requests complete successfully. Tested across riddles, math, coding, and explanations.

Gibberish in Reasoning Traces (Both Providers)

The most significant finding — both providers exhibit artifacts/gibberish injected into reasoning traces:

  • Random numbers mid-word: "So771", "the716716", "we32", "as42-is", "But13;"
  • Colon-number patterns: "key19275:", "can16:", "But71:", "it790:"
  • Repeated number padding: "06" / "07" appearing 30-40 times in a single response
  • Foreign words: "böjnings", "dátummal", "Didži"
  • Special token leaks (Nvidia): "<|end▁of▁repo▁name|>"
  • 1 instance leaked into final content field (Together, Stack vs Queue: "to400: different")

Affected: 7/12 Together responses (58%), 4/4 successful Nvidia responses (100%).

Nvidia Reliability

  • Only 4/12 requests succeeded across both runs (33% reliability)
  • 8/12 requests timed out at 200s with 0 bytes received
  • Successful requests averaged ~80s latency vs Together's ~13s

Together Truncation

  • 4/12 responses hit the 500-token max_tokens limit during reasoning, truncating content

Combined Results (24 total requests)

Metric Together AI Nvidia NIM
Successful 12/12 (100%) 4/12 (33%)
Correct Answers 12/12 4/4
Gibberish in Reasoning 7/12 (58%) 4/4 (100%)
Avg Latency (success) ~13s ~80s
Timeouts 0 8/12 (67%)

Slack Thread

Open in Web Open in Cursor 

cursoragent and others added 2 commits April 27, 2026 00:35
Test results for DSv4-Pro with high reasoning effort across 6 prompts:
- Both providers return correct answers
- Both exhibit gibberish/artifacts in reasoning traces
- Nvidia has severe reliability issues (3/6 timeouts, high latency)
- Together has token truncation risk with reasoning-heavy queries

Co-authored-by: Yun Jin <yjfireworks@users.noreply.github.com>
12 parallel agents tested both APIs simultaneously:
- Together: 12/12 success, 7/12 had reasoning gibberish, 1 leaked to content
- Nvidia: only 1/6 succeeded (17%), 5/6 timed out at 200s
- New artifact type: Nvidia leaked special token <end_of_repo_name>
- Combined across both runs: Together 100% reliable, Nvidia 33% reliable

Co-authored-by: Yun Jin <yjfireworks@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants