Skip to content

Spike: benchmark backend-only vs monorepo context quality and runtime #69

@mohanagy

Description

@mohanagy

Goal

Create a reproducible benchmark that explains why graphify-ts can perform worse on a backend-only folder but better on the full monorepo/workspace.

This is a diagnostic spike before changing the graph methodology.

Background

Recent usage showed inconsistent behavior:

  • Running on only the GoValidate backend was slower and produced worse/higher-token results.
  • Running on the full GoValidate workspace was faster and produced better/lower-token results.

This suggests the issue may not be raw repo size. It may be graph topology, workspace boundaries, hub nodes, noisy edges, or missing semantic context.

Scope

Build a benchmark fixture/workflow that compares at least:

  1. Backend-only graph generation and retrieval.
  2. Full workspace/monorepo graph generation and retrieval.
  3. Same prompt set against both graphs.
  4. Same token budgets and command flags.

Suggested prompt set

Use 8–12 realistic prompts, for example:

  • Explain the validation report generation flow end-to-end.
  • Why is validation report generation slow?
  • What can break if the research agent changes?
  • Review the current backend diff for risky changes.
  • Which tests are likely relevant for report generation?
  • Find the call path from controller to final report persistence.

Metrics to capture

For every run, capture:

  • command used
  • runtime by phase if available
  • total selected nodes/files
  • output/context token count
  • number of snippets/code blocks emitted
  • number of handle_ids or expandable refs
  • top selected files/nodes
  • answer quality notes if using compare
  • missing-context notes
  • whether output stayed within budget

Deliverables

  • A committed benchmark folder, e.g. docs/benchmarks/<date>-backend-vs-monorepo/.
  • A script that can reproduce the comparison.
  • A short Markdown report explaining observed failure modes.
  • Clear notes on whether the problem appears to be graph generation, graph traversal, ranking, packing, or workspace detection.

Acceptance criteria

  • The benchmark can be rerun locally with one script.
  • Results compare backend-only vs monorepo using the same prompts.
  • The report identifies the top 3 reasons for inconsistent behavior.
  • No engine rewrite is required in this issue; this is a measurement spike.

Non-goals

  • Do not redesign the graph yet.
  • Do not tune hardcoded behavior just to make GoValidate pass.
  • Do not hide bad results; the point is to expose them clearly.

Metadata

Metadata

Assignees

Labels

context-qualityQuality of the compiled context packenhancementNew feature or requestperformanceRuntime, latency, throughput, or token-cost workresearchResearch spike or measurement work

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions