Goal
Create a reproducible benchmark that explains why graphify-ts can perform worse on a backend-only folder but better on the full monorepo/workspace.
This is a diagnostic spike before changing the graph methodology.
Background
Recent usage showed inconsistent behavior:
- Running on only the GoValidate backend was slower and produced worse/higher-token results.
- Running on the full GoValidate workspace was faster and produced better/lower-token results.
This suggests the issue may not be raw repo size. It may be graph topology, workspace boundaries, hub nodes, noisy edges, or missing semantic context.
Scope
Build a benchmark fixture/workflow that compares at least:
- Backend-only graph generation and retrieval.
- Full workspace/monorepo graph generation and retrieval.
- Same prompt set against both graphs.
- Same token budgets and command flags.
Suggested prompt set
Use 8–12 realistic prompts, for example:
- Explain the validation report generation flow end-to-end.
- Why is validation report generation slow?
- What can break if the research agent changes?
- Review the current backend diff for risky changes.
- Which tests are likely relevant for report generation?
- Find the call path from controller to final report persistence.
Metrics to capture
For every run, capture:
- command used
- runtime by phase if available
- total selected nodes/files
- output/context token count
- number of snippets/code blocks emitted
- number of handle_ids or expandable refs
- top selected files/nodes
- answer quality notes if using
compare
- missing-context notes
- whether output stayed within budget
Deliverables
- A committed benchmark folder, e.g.
docs/benchmarks/<date>-backend-vs-monorepo/.
- A script that can reproduce the comparison.
- A short Markdown report explaining observed failure modes.
- Clear notes on whether the problem appears to be graph generation, graph traversal, ranking, packing, or workspace detection.
Acceptance criteria
- The benchmark can be rerun locally with one script.
- Results compare backend-only vs monorepo using the same prompts.
- The report identifies the top 3 reasons for inconsistent behavior.
- No engine rewrite is required in this issue; this is a measurement spike.
Non-goals
- Do not redesign the graph yet.
- Do not tune hardcoded behavior just to make GoValidate pass.
- Do not hide bad results; the point is to expose them clearly.
Goal
Create a reproducible benchmark that explains why
graphify-tscan perform worse on a backend-only folder but better on the full monorepo/workspace.This is a diagnostic spike before changing the graph methodology.
Background
Recent usage showed inconsistent behavior:
This suggests the issue may not be raw repo size. It may be graph topology, workspace boundaries, hub nodes, noisy edges, or missing semantic context.
Scope
Build a benchmark fixture/workflow that compares at least:
Suggested prompt set
Use 8–12 realistic prompts, for example:
Metrics to capture
For every run, capture:
compareDeliverables
docs/benchmarks/<date>-backend-vs-monorepo/.Acceptance criteria
Non-goals