Spike: benchmark backend-only vs monorepo context quality and runtime

## Goal
Create a reproducible benchmark that explains why `graphify-ts` can perform worse on a backend-only folder but better on the full monorepo/workspace.

This is a diagnostic spike before changing the graph methodology.

## Background
Recent usage showed inconsistent behavior:

- Running on only the GoValidate backend was slower and produced worse/higher-token results.
- Running on the full GoValidate workspace was faster and produced better/lower-token results.

This suggests the issue may not be raw repo size. It may be graph topology, workspace boundaries, hub nodes, noisy edges, or missing semantic context.

## Scope
Build a benchmark fixture/workflow that compares at least:

1. Backend-only graph generation and retrieval.
2. Full workspace/monorepo graph generation and retrieval.
3. Same prompt set against both graphs.
4. Same token budgets and command flags.

## Suggested prompt set
Use 8–12 realistic prompts, for example:

- Explain the validation report generation flow end-to-end.
- Why is validation report generation slow?
- What can break if the research agent changes?
- Review the current backend diff for risky changes.
- Which tests are likely relevant for report generation?
- Find the call path from controller to final report persistence.

## Metrics to capture
For every run, capture:

- command used
- runtime by phase if available
- total selected nodes/files
- output/context token count
- number of snippets/code blocks emitted
- number of handle_ids or expandable refs
- top selected files/nodes
- answer quality notes if using `compare`
- missing-context notes
- whether output stayed within budget

## Deliverables
- A committed benchmark folder, e.g. `docs/benchmarks/<date>-backend-vs-monorepo/`.
- A script that can reproduce the comparison.
- A short Markdown report explaining observed failure modes.
- Clear notes on whether the problem appears to be graph generation, graph traversal, ranking, packing, or workspace detection.

## Acceptance criteria
- The benchmark can be rerun locally with one script.
- Results compare backend-only vs monorepo using the same prompts.
- The report identifies the top 3 reasons for inconsistent behavior.
- No engine rewrite is required in this issue; this is a measurement spike.

## Non-goals
- Do not redesign the graph yet.
- Do not tune hardcoded behavior just to make GoValidate pass.
- Do not hide bad results; the point is to expose them clearly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike: benchmark backend-only vs monorepo context quality and runtime #69

Goal

Background

Scope

Suggested prompt set

Metrics to capture

Deliverables

Acceptance criteria

Non-goals

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Spike: benchmark backend-only vs monorepo context quality and runtime #69

Description

Goal

Background

Scope

Suggested prompt set

Metrics to capture

Deliverables

Acceptance criteria

Non-goals

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions