Performance Benchmarks #3

jstuart0 · 2026-04-14T03:22:33Z

jstuart0
Apr 14, 2026
Maintainer

Generation Model Benchmark Update

This is a short benchmark summary for the new UNDERSTANDING_FIRST generation path in SourceBridge.

Test setup

Path tested: repository cliff notes on the new UNDERSTANDING_FIRST flow
Fixture repo: tests/fixtures/multi-lang-repo
Audience/depth: DEVELOPER + MEDIUM
Benchmark harness: scripts/run_generation_mode_benchmark.py

Inference server

The benchmark runs were executed against an Ollama OpenAI-compatible endpoint:

Provider: ollama
Base URL: http://192.168.10.108:11434/v1
This was exercised through the local SourceBridge benchmark stack:
API: http://127.0.0.1:18084
Worker: local gRPC worker on 127.0.0.1:15053
SurrealDB: benchmark database benchmark_modes_0413f

What we measured

The important thing here is not just whether a model finishes. It also has to produce a usable repository understanding artifact. Several models either failed with DEGRADED_COMPUTE or completed with shallow/generic output that did not actually represent the repo well.

Results

Model	Result	Time	Quality read
mistral-small3.1:24b	READY	3m 55s	Best overall. Completed with grounded, usable cliff notes and the best repo-shape fidelity of the models tested.
magistral:24b	READY	1m 50s	Fast, but quality was too generic. It collapsed into a simple Go server story and missed the real multi-language shape.
gemma3:27b	READY	2m 47s	Faster than Mistral Small 3.1, but quality was noticeably weaker and overly generic.
devstral:24b	READY	1m 26s	Very fast, but poor for this workload. It over-focused on go/main.go and missed the broader repository.
mistral-nemo:12b	READY	0m 53s	Fastest among the Mistral-family tests, but quality was poor and it started inventing domain details.
llama3.3:70b-instruct-q4_K_M	READY	completed	Completed, but artifact quality/parsing was poor and several sections were effectively unusable.
qwen3.5:122b-a10b	FAILED	3m 15s	Failed with DEGRADED_COMPUTE; not viable on this inference stack for this path.
qwen3:14b	FAILED	6m 23s	Failed with DEGRADED_COMPUTE. Better than the smaller Qwen run, but still not stable enough.
qwen3.5:9b	FAILED	9m 20s	Failed badly with DEGRADED_COMPUTE; not worth pursuing for this path.

Recommendation

Current recommendation for SourceBridge repository understanding and cliff notes:

Use mistral-small3.1:24b
Why:
It is the best model we tested on the combined metric that actually matters:
- completion reliability
- artifact quality
- grounded repo understanding
Several faster models completed, but they produced artifacts that were too generic or materially wrong.
Several larger or more reasoning-heavy models simply failed on this workload and stack.

Takeaway

The headline result is straightforward:

fastest is not the same as best
bigger is not the same as more reliable
mistral-small3.1:24b is currently the best tested model for SourceBridge’s new understanding-first generation flow

Notes

The benchmark path is now repeatable with the harness script above.
During this work, we also fixed a real worker/protobuf mismatch so diagnostics and benchmark runs are now stable and reproducible.

# Generation Model Benchmark Update

This is a short benchmark summary for the new UNDERSTANDING_FIRST generation path in SourceBridge.

Test setup

Path tested: repository cliff notes on the new UNDERSTANDING_FIRST flow
Fixture repo: tests/fixtures/multi-lang-repo
Audience/depth: DEVELOPER + MEDIUM
Benchmark harness: [scripts/run_generation_mode_benchmark.py](/Users/jaystuart/dev/sourcebridge/scripts/run_generation_mode_benchmark.py)

Inference server

The benchmark runs were executed against an Ollama OpenAI-compatible endpoint:

Provider: ollama
Base URL: http://192.168.10.108:11434/v1

This was exercised through the local SourceBridge benchmark stack:
API: http://127.0.0.1:18084
Worker: local gRPC worker on 127.0.0.1:15053
SurrealDB: benchmark database benchmark_modes_0413f

What we measured

The important thing here is not just whether a model finishes. It also has to produce a usable repository understanding artifact. Several models either failed with DEGRADED_COMPUTE or completed with shallow/generic output that did not actually represent the repo well.

Results

Model	Result	Time	Quality read
`mistral-small3.1:24b`	`READY`	`3m 55s`	Best overall. Completed with grounded, usable cliff notes and the best repo-shape fidelity of the models tested.
`magistral:24b`	`READY`	`1m 50s`	Fast, but quality was too generic. It collapsed into a simple Go server story and missed the real multi-language shape.
`gemma3:27b`	`READY`	`2m 47s`	Faster than Mistral Small 3.1, but quality was noticeably weaker and overly generic.
`devstral:24b`	`READY`	`1m 26s`	Very fast, but poor for this workload. It over-focused on `go/main.go` and missed the broader repository.
`mistral-nemo:12b`	`READY`	`0m 53s`	Fastest among the Mistral-family tests, but quality was poor and it started inventing domain details.
`llama3.3:70b-instruct-q4_K_M`	`READY`	completed	Completed, but artifact quality/parsing was poor and several sections were effectively unusable.
`qwen3.5:122b-a10b`	`FAILED`	`3m 15s`	Failed with `DEGRADED_COMPUTE`; not viable on this inference stack for this path.
`qwen3:14b`	`FAILED`	`6m 23s`	Failed with `DEGRADED_COMPUTE`. Better than the smaller Qwen run, but still not stable enough.
`qwen3.5:9b`	`FAILED`	`9m 20s`	Failed badly with `DEGRADED_COMPUTE`; not worth pursuing for this path.

Recommendation

Current recommendation for SourceBridge repository understanding and cliff notes:

Use mistral-small3.1:24b

Why:
It is the best model we tested on the combined metric that actually matters:
- completion reliability
- artifact quality
- grounded repo understanding
Several faster models completed, but they produced artifacts that were too generic or materially wrong.
Several larger or more reasoning-heavy models simply failed on this workload and stack.

Takeaway

The headline result is straightforward:

fastest is not the same as best
bigger is not the same as more reliable
mistral-small3.1:24b is currently the best tested model for SourceBridge’s new understanding-first generation flow

Notes

The benchmark path is now repeatable with the harness script above.
During this work, we also fixed a real worker/protobuf mismatch so diagnostics and benchmark runs are now stable and reproducible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Benchmarks #3

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Performance Benchmarks #3

Uh oh!

jstuart0 Apr 14, 2026 Maintainer

Generation Model Benchmark Update

Test setup

Inference server

What we measured

Results

Recommendation

Takeaway

Notes

Test setup

Inference server

What we measured

Results

Recommendation

Takeaway

Notes

Replies: 0 comments

jstuart0
Apr 14, 2026
Maintainer