You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The important thing here is not just whether a model finishes. It also has to produce a usable repository understanding artifact. Several models either failed with DEGRADED_COMPUTE or completed with shallow/generic output that did not actually represent the repo well.
Results
Model
Result
Time
Quality read
mistral-small3.1:24b
READY
3m 55s
Best overall. Completed with grounded, usable cliff notes and the best repo-shape fidelity of the models tested.
magistral:24b
READY
1m 50s
Fast, but quality was too generic. It collapsed into a simple Go server story and missed the real multi-language shape.
gemma3:27b
READY
2m 47s
Faster than Mistral Small 3.1, but quality was noticeably weaker and overly generic.
devstral:24b
READY
1m 26s
Very fast, but poor for this workload. It over-focused on go/main.go and missed the broader repository.
mistral-nemo:12b
READY
0m 53s
Fastest among the Mistral-family tests, but quality was poor and it started inventing domain details.
llama3.3:70b-instruct-q4_K_M
READY
completed
Completed, but artifact quality/parsing was poor and several sections were effectively unusable.
qwen3.5:122b-a10b
FAILED
3m 15s
Failed with DEGRADED_COMPUTE; not viable on this inference stack for this path.
qwen3:14b
FAILED
6m 23s
Failed with DEGRADED_COMPUTE. Better than the smaller Qwen run, but still not stable enough.
qwen3.5:9b
FAILED
9m 20s
Failed badly with DEGRADED_COMPUTE; not worth pursuing for this path.
Recommendation
Current recommendation for SourceBridge repository understanding and cliff notes:
Use mistral-small3.1:24b
Why:
It is the best model we tested on the combined metric that actually matters:
completion reliability
artifact quality
grounded repo understanding
Several faster models completed, but they produced artifacts that were too generic or materially wrong.
Several larger or more reasoning-heavy models simply failed on this workload and stack.
Takeaway
The headline result is straightforward:
fastest is not the same as best
bigger is not the same as more reliable
mistral-small3.1:24b is currently the best tested model for SourceBridge’s new understanding-first generation flow
Notes
The benchmark path is now repeatable with the harness script above.
During this work, we also fixed a real worker/protobuf mismatch so diagnostics and benchmark runs are now stable and reproducible.
# Generation Model Benchmark Update
This is a short benchmark summary for the new UNDERSTANDING_FIRST generation path in SourceBridge.
Test setup
Path tested: repository cliff notes on the new UNDERSTANDING_FIRST flow
The important thing here is not just whether a model finishes. It also has to produce a usable repository understanding artifact. Several models either failed with DEGRADED_COMPUTE or completed with shallow/generic output that did not actually represent the repo well.
Results
Model
Result
Time
Quality read
mistral-small3.1:24b
READY
3m 55s
Best overall. Completed with grounded, usable cliff notes and the best repo-shape fidelity of the models tested.
magistral:24b
READY
1m 50s
Fast, but quality was too generic. It collapsed into a simple Go server story and missed the real multi-language shape.
gemma3:27b
READY
2m 47s
Faster than Mistral Small 3.1, but quality was noticeably weaker and overly generic.
devstral:24b
READY
1m 26s
Very fast, but poor for this workload. It over-focused on go/main.go and missed the broader repository.
mistral-nemo:12b
READY
0m 53s
Fastest among the Mistral-family tests, but quality was poor and it started inventing domain details.
llama3.3:70b-instruct-q4_K_M
READY
completed
Completed, but artifact quality/parsing was poor and several sections were effectively unusable.
qwen3.5:122b-a10b
FAILED
3m 15s
Failed with DEGRADED_COMPUTE; not viable on this inference stack for this path.
qwen3:14b
FAILED
6m 23s
Failed with DEGRADED_COMPUTE. Better than the smaller Qwen run, but still not stable enough.
qwen3.5:9b
FAILED
9m 20s
Failed badly with DEGRADED_COMPUTE; not worth pursuing for this path.
Recommendation
Current recommendation for SourceBridge repository understanding and cliff notes:
Use mistral-small3.1:24b
Why:
It is the best model we tested on the combined metric that actually matters:
completion reliability
artifact quality
grounded repo understanding
Several faster models completed, but they produced artifacts that were too generic or materially wrong.
Several larger or more reasoning-heavy models simply failed on this workload and stack.
Takeaway
The headline result is straightforward:
fastest is not the same as best
bigger is not the same as more reliable
mistral-small3.1:24b is currently the best tested model for SourceBridge’s new understanding-first generation flow
Notes
The benchmark path is now repeatable with the harness script above.
During this work, we also fixed a real worker/protobuf mismatch so diagnostics and benchmark runs are now stable and reproducible.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Generation Model Benchmark Update
This is a short benchmark summary for the new
UNDERSTANDING_FIRSTgeneration path in SourceBridge.Test setup
Path tested: repository cliff notes on the new
UNDERSTANDING_FIRSTflowFixture repo:
tests/fixtures/multi-lang-repoAudience/depth:
DEVELOPER+MEDIUMBenchmark harness: scripts/run_generation_mode_benchmark.py
Inference server
The benchmark runs were executed against an Ollama OpenAI-compatible endpoint:
Provider:
ollamaBase URL:
http://192.168.10.108:11434/v1This was exercised through the local SourceBridge benchmark stack:
API:
http://127.0.0.1:18084Worker: local gRPC worker on
127.0.0.1:15053SurrealDB: benchmark database
benchmark_modes_0413fWhat we measured
The important thing here is not just whether a model finishes. It also has to produce a usable repository understanding artifact. Several models either failed with
DEGRADED_COMPUTEor completed with shallow/generic output that did not actually represent the repo well.Results
Recommendation
Current recommendation for SourceBridge repository understanding and cliff notes:
Use
mistral-small3.1:24bWhy:
It is the best model we tested on the combined metric that actually matters:
completion reliability
artifact quality
grounded repo understanding
Several faster models completed, but they produced artifacts that were too generic or materially wrong.
Several larger or more reasoning-heavy models simply failed on this workload and stack.
Takeaway
The headline result is straightforward:
fastest is not the same as best
bigger is not the same as more reliable
mistral-small3.1:24bis currently the best tested model for SourceBridge’s new understanding-first generation flowNotes
# Generation Model Benchmark UpdateThe benchmark path is now repeatable with the harness script above.
During this work, we also fixed a real worker/protobuf mismatch so diagnostics and benchmark runs are now stable and reproducible.
This is a short benchmark summary for the new
UNDERSTANDING_FIRSTgeneration path in SourceBridge.Test setup
UNDERSTANDING_FIRSTflowtests/fixtures/multi-lang-repoDEVELOPER+MEDIUMInference server
The benchmark runs were executed against an Ollama OpenAI-compatible endpoint:
Provider:
ollamaBase URL:
http://192.168.10.108:11434/v1This was exercised through the local SourceBridge benchmark stack:
API:
http://127.0.0.1:18084Worker: local gRPC worker on
127.0.0.1:15053SurrealDB: benchmark database
benchmark_modes_0413fWhat we measured
The important thing here is not just whether a model finishes. It also has to produce a usable repository understanding artifact. Several models either failed with
DEGRADED_COMPUTEor completed with shallow/generic output that did not actually represent the repo well.Results
mistral-small3.1:24bREADY3m 55smagistral:24bREADY1m 50sgemma3:27bREADY2m 47sdevstral:24bREADY1m 26sgo/main.goand missed the broader repository.mistral-nemo:12bREADY0m 53sllama3.3:70b-instruct-q4_K_MREADYqwen3.5:122b-a10bFAILED3m 15sDEGRADED_COMPUTE; not viable on this inference stack for this path.qwen3:14bFAILED6m 23sDEGRADED_COMPUTE. Better than the smaller Qwen run, but still not stable enough.qwen3.5:9bFAILED9m 20sDEGRADED_COMPUTE; not worth pursuing for this path.Recommendation
Current recommendation for SourceBridge repository understanding and cliff notes:
Use
mistral-small3.1:24bWhy:
It is the best model we tested on the combined metric that actually matters:
Several faster models completed, but they produced artifacts that were too generic or materially wrong.
Several larger or more reasoning-heavy models simply failed on this workload and stack.
Takeaway
The headline result is straightforward:
mistral-small3.1:24bis currently the best tested model for SourceBridge’s new understanding-first generation flowNotes
Beta Was this translation helpful? Give feedback.
All reactions