Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,17 @@ All notable changes to the TypeScript package will be documented in this file.

## [Unreleased]

### Added

- **Opt-in task-conditioned slicing v1**: retrieval can now run with `retrievalStrategy: 'slice-v1'` to anchor on explicit symbols/paths, take bounded explain/debug/impact/review-oriented slices, suppress barrel-like nodes, and emit `slice` metadata (`mode`, `anchors`, `directions`, `selected_paths`) alongside the selected pack.
- **Real-workspace benchmark flow**: `docs/benchmarks/2026-05-11-spi-vs-legacy/` now ships `run-real-workspace.sh`, `summarize-real-workspaces.mjs`, `prompts.real-workspace.example.json`, and `REAL_WORKSPACE_REPORT_TEMPLATE.md` so backend-only and monorepo workspaces can be benchmarked locally without committing private paths or artifacts.

### Changed

- **Sketch semantics are richer but still deterministic**: `resolution: 'sketch'` now surfaces `reads env`, config reads, and compact side-effect hints such as `external_http`, `llm_call`, and `db_write` when graph evidence exists, while preserving dependency-record output for lighter nodes.
- **Slice-v1 is exposed safely in CLI/MCP**: CLI `pack`, MCP `retrieve`, and MCP `context_pack` now accept `retrieval_strategy: 'default' | 'slice-v1'`, validate unsupported values clearly, and keep compact output unchanged unless the caller opts in.
- **Benchmark analysis is broader and more honest**: the SPI probe now records resolution comparisons (`detail` / `signature` / `sketch`), slice-v1 runs, retrieval-gate metadata, top files, and a value-per-token calibration summary instead of implying a token win.

## [0.21.0] - 2026-05-11

### Changed
Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ NestJS + Next.js SaaS, 1,268 files, ~860K words. Same question, same Claude Opus

PR-review proof on a real diff: prompt tokens 63,024 → **8,690** (**7.25× fewer**). Receipts: [`docs/benchmarks/2026-05-02-govalidate-pr-review/`](docs/benchmarks/2026-05-02-govalidate-pr-review/).

`--spi` benchmark (bundled fixture, 7 prompts): pack tokens **−26%**, graph.json size **−32%**, cache-hit rebuild **−27%** vs legacy. Receipts: [`docs/benchmarks/2026-05-11-spi-vs-legacy/`](docs/benchmarks/2026-05-11-spi-vs-legacy/).
`--spi` benchmark (bundled fixture, 7 prompts): **better framework-shaped correctness**, **operational retrieval-level expansion**, **graph.json size −32%**, **cache-hit rebuild −27% vs legacy**, but **no measured explain-pack token win on that fixture**. Receipts: [`docs/benchmarks/2026-05-11-spi-vs-legacy/`](docs/benchmarks/2026-05-11-spi-vs-legacy/).

[Reproduce them](docs/benchmarks/2026-04-30-govalidate/verify.sh) with one shell script against the committed evidence files.

Expand Down Expand Up @@ -142,6 +142,7 @@ graphify-ts generate . # build the graph
graphify-ts generate . --spi # opt-in SPI pipeline (framework metadata + disk cache)
graphify-ts watch . # rebuild on file change
graphify-ts pack "how does auth work?" --task explain # compact CLI context payload
graphify-ts pack "why does auth fail?" --task explain --retrieval-strategy slice-v1
graphify-ts prompt "how does auth work?" --provider claude # provider-ready compiled prompt
graphify-ts review-compare graphify-out/graph.json --exec '...' --yes # PR review benchmark
graphify-ts compare "How does auth work?" --exec '...' --yes # general benchmark
Expand Down
24 changes: 24 additions & 0 deletions docs/benchmarks/2026-05-11-spi-vs-legacy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ The runner now produces:
1. `legacy.json`, `spi-cold.json`, `spi-warm.json`
2. `spi-cold.analysis.json` — strategy comparison + retrieval-level sweep
3. `summary.json` — top-level aggregate report
4. `edge_count` in each variant JSON

### Optional: point the runner at another local repo

Expand All @@ -101,6 +102,29 @@ node docs/benchmarks/2026-05-11-spi-vs-legacy/probe.mjs \

If GoValidate is available locally, use the template above for both the backend-only checkout and the monorepo checkout. This repo does **not** commit any private-path defaults or fake results for those runs.

### Real-workspace matrix runner

You can benchmark two local workspaces side by side without committing private paths or artifacts:

```bash
GRAPHIFY_BENCH_BACKEND=/absolute/path/to/backend \
GRAPHIFY_BENCH_MONOREPO=/absolute/path/to/monorepo \
bash docs/benchmarks/2026-05-11-spi-vs-legacy/run-real-workspace.sh
```

Defaults:

- prompts file: `docs/benchmarks/2026-05-11-spi-vs-legacy/prompts.real-workspace.example.json`
- output bundle: `docs/benchmarks/2026-05-11-spi-vs-legacy/results/real-workspaces/<timestamp>/`

Artifacts:

1. one normal benchmark run per workspace (`backend/summary.json`, `monorepo/summary.json`)
2. `real-workspaces.summary.json` — side-by-side aggregate summary
3. `REAL_WORKSPACE_REPORT_TEMPLATE.md` — sharing template with privacy disclaimer

The aggregate summary keeps objective metrics separate from qualitative notes and does not claim any private-repo numbers unless you run the benchmark locally.

## Caveats / limitations

- **Fixture is synthetic.** It is still small enough that the new `value-per-token` scorer does not beat evidence-order on final pack size.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Real workspace benchmark report template

This benchmark can be run on private repos locally.
No private paths or artifacts are committed.
If GoValidate is unavailable, no GoValidate-specific numbers are claimed.

## Workspace matrix

| Workspace | Variant | Build time (ms) | Graph size (bytes) | Nodes | Edges |
|---|---|---:|---:|---:|---:|

## Strategy / resolution comparisons

| Workspace | Prompt | Strategy | Resolution | Tokens | Nodes | Quality | Notes |
|---|---|---|---|---:|---:|---:|---|

## Retrieval-level comparisons

| Workspace | Prompt | Retrieval level | Tokens | Nodes | Gate reason |
|---|---|---:|---:|---:|---|

## Value-per-token calibration

- Where value-per-token helps:
- Where it does not change output:
- Where it hurts or increases tokens:
- Suggested scoring adjustments:

## Qualitative notes

- Objective metrics are listed separately from qualitative notes.
- Private workspace paths must be redacted before sharing any report excerpt.
26 changes: 26 additions & 0 deletions docs/benchmarks/2026-05-11-spi-vs-legacy/graph-stats.mjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#!/usr/bin/env node

import { readFileSync } from 'node:fs'

const graphPath = process.argv[2]
if (!graphPath) {
console.error('usage: graph-stats.mjs <graph.json>')
process.exit(2)
}

const graphJson = readFileSync(graphPath, 'utf8')
let graph
try {
graph = JSON.parse(graphJson)
} catch (error) {
const message = error instanceof Error ? error.message : String(error)
console.error(`failed to parse graph JSON at ${graphPath}: ${message}`)
process.exit(1)
}
const nodeCount = Array.isArray(graph.nodes) ? graph.nodes.length : 0
const edgeCount = Array.isArray(graph.edges) ? graph.edges.length : 0

console.log(JSON.stringify({
node_count: nodeCount,
edge_count: edgeCount,
}))
80 changes: 79 additions & 1 deletion docs/benchmarks/2026-05-11-spi-vs-legacy/probe.mjs
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@ import { readFileSync } from 'node:fs'
import { basename, relative, resolve } from 'node:path'

import { computeContextPackDiagnostics } from '../../../dist/src/runtime/context-pack-diagnostics.js'
import { estimateContextPackEntryTokens } from '../../../dist/src/runtime/context-pack.js'
import { applyContextPackResolution } from '../../../dist/src/runtime/context-pack-resolution.js'
import { classifyCalibrationBucket } from '../../../dist/src/runtime/benchmark/probe-calibration.js'
import { contextPackFromRetrieveResult, retrieveContext } from '../../../dist/src/runtime/retrieve.js'
import { loadGraph } from '../../../dist/src/runtime/serve.js'

Expand Down Expand Up @@ -31,16 +34,50 @@ function summarizeRun(result) {
result.matched_nodes
.map((node) => node.framework_role)
.filter((value) => typeof value === 'string' && value.length > 0),
),
),
).sort()
const topFiles = Array.from(
new Set(
result.matched_nodes
.map((node) => node.source_file)
.filter((value) => typeof value === 'string' && value.length > 0),
),
).slice(0, 5)
const resolvedSummaries = Object.fromEntries(
['detail', 'signature', 'sketch'].map((resolution) => {
const resolved = resolution === 'detail'
? {
nodes: pack.nodes,
bytes_saved: 0,
}
: applyContextPackResolution(pack.nodes, {
resolution,
relationships: pack.relationships,
})
const tokenCount = resolved.nodes.reduce(
(total, node) => total + estimateContextPackEntryTokens(node.label, node.source_file, node.line_number, node.snippet),
0,
)
return [resolution, {
token_count: tokenCount,
bytes_saved: resolved.bytes_saved,
representation_types: Array.from(new Set(resolved.nodes.map((node) => node.representation_type ?? 'detail'))).sort(),
}]
}),
)

return {
token_count: result.token_count,
node_count: result.matched_nodes.length,
labels: result.matched_nodes.map((node) => node.label),
top_files: topFiles,
framework_roles: frameworkRoles,
quality_score: diagnostics.quality_score,
warnings: diagnostics.warnings.map((warning) => warning.kind),
retrieval_gate: result.retrieval_gate ?? null,
retrieval_strategy: result.retrieval_strategy ?? 'default',
slice: result.slice ?? null,
resolutions: resolvedSummaries,
selection_strategy: result.selection_diagnostics?.selection_strategy,
used_tokens: result.selection_diagnostics?.used_tokens ?? result.token_count,
required_overflow: result.selection_diagnostics?.required_overflow ?? false,
Expand Down Expand Up @@ -70,6 +107,12 @@ const promptAnalyses = prompts.map((prompt) => {
budget,
selectionStrategy: 'value-per-token',
})
const sliceV1 = retrieveContext(graph, {
question: prompt.text,
budget,
selectionStrategy: 'value-per-token',
retrievalStrategy: 'slice-v1',
})

return {
id: prompt.id,
Expand All @@ -78,10 +121,13 @@ const promptAnalyses = prompts.map((prompt) => {
strategies: {
evidence_order: summarizeRun(evidenceOrder),
value_per_token: summarizeRun(valuePerToken),
slice_v1: summarizeRun(sliceV1),
},
deltas: {
token_count: valuePerToken.token_count - evidenceOrder.token_count,
node_count: valuePerToken.matched_nodes.length - evidenceOrder.matched_nodes.length,
slice_token_count: sliceV1.token_count - valuePerToken.token_count,
slice_node_count: sliceV1.matched_nodes.length - valuePerToken.matched_nodes.length,
},
retrieval_levels: retrievalLevels.map((level) => ({
level,
Expand All @@ -95,8 +141,40 @@ const promptAnalyses = prompts.map((prompt) => {
}
})

const calibration = promptAnalyses.reduce((summary, prompt) => {
const evidenceOrder = prompt.strategies.evidence_order
const valuePerToken = prompt.strategies.value_per_token
const tokenDelta = valuePerToken.token_count - evidenceOrder.token_count
const qualityDelta = valuePerToken.quality_score - evidenceOrder.quality_score
const labelDelta = valuePerToken.labels.filter((label) => !evidenceOrder.labels.includes(label))
const note = {
prompt: prompt.id,
token_delta: tokenDelta,
quality_delta: qualityDelta,
added_labels: labelDelta,
}

switch (classifyCalibrationBucket({ tokenDelta, qualityDelta })) {
case 'helps':
summary.helps.push(note)
break
case 'hurts_or_expands':
summary.hurts_or_expands.push(note)
break
default:
summary.no_material_change.push(note)
break
}
return summary
}, {
helps: [],
no_material_change: [],
hurts_or_expands: [],
})

console.log(JSON.stringify({
graph_path: graphPathForOutput,
budget,
prompts: promptAnalyses,
calibration,
}, null, 2))
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
{
"schema_version": 1,
"prompts": [
{
"id": "auth-flow",
"intent": "explain",
"text": "Explain auth flow end to end."
},
{
"id": "report-generation",
"intent": "explain",
"text": "Explain validation report generation end to end."
},
{
"id": "report-generation-slow",
"intent": "debug",
"text": "Why is validation report generation slow?"
},
{
"id": "research-agent-impact",
"intent": "impact",
"text": "What can break if the research agent changes?"
},
{
"id": "report-generation-tests",
"intent": "explain",
"text": "Which tests are relevant for report generation?"
},
{
"id": "controller-to-persistence",
"intent": "explain",
"text": "Find the call path from controller to final report persistence."
},
{
"id": "config-runtime-effect",
"intent": "debug",
"text": "Where does this env/config variable affect runtime behavior?"
},
{
"id": "auth-config-impact",
"intent": "impact",
"text": "What can break if session/cookie/auth config changes?"
},
{
"id": "review-current-diff",
"intent": "review",
"text": "Review current backend diff for risky changes."
},
{
"id": "onboarding-routes",
"intent": "explain",
"text": "Which routes/controllers/services are involved in onboarding?"
}
]
}
45 changes: 45 additions & 0 deletions docs/benchmarks/2026-05-11-spi-vs-legacy/run-real-workspace.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
#!/usr/bin/env bash

set -euo pipefail

HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
TS="$(date -u +%Y-%m-%dT%H%M%SZ)"
BUNDLE_DIR="${GRAPHIFY_BENCH_REAL_RESULTS_DIR:-$HERE/results/real-workspaces/$TS}"
PROMPTS_FILE="${GRAPHIFY_BENCH_REAL_PROMPTS:-$HERE/prompts.real-workspace.example.json}"

if [[ ! -f "$PROMPTS_FILE" ]]; then
echo "GRAPHIFY_BENCH_REAL_PROMPTS must point to an existing prompts JSON file: $PROMPTS_FILE" >&2
exit 2
fi

run_workspace() {
local workspace_name="$1"
local workspace_path="$2"
local workspace_var_name="$3"
if [[ -z "$workspace_path" ]]; then
return
fi
if [[ ! -d "$workspace_path" ]]; then
echo "$workspace_var_name must point to an existing workspace directory: $workspace_path" >&2
exit 2
fi

mkdir -p "$BUNDLE_DIR/$workspace_name"
echo "[real-workspace] $workspace_name -> $workspace_path"
GRAPHIFY_BENCH_FIXTURE="$workspace_path" \
GRAPHIFY_BENCH_PROMPTS="$PROMPTS_FILE" \
GRAPHIFY_BENCH_RESULTS_DIR="$BUNDLE_DIR/$workspace_name" \
bash "$HERE/run.sh"
Comment thread
coderabbitai[bot] marked this conversation as resolved.
}

if [[ -z "${GRAPHIFY_BENCH_BACKEND:-}" && -z "${GRAPHIFY_BENCH_MONOREPO:-}" ]]; then
echo "Set GRAPHIFY_BENCH_BACKEND and/or GRAPHIFY_BENCH_MONOREPO before running." >&2
exit 2
fi

mkdir -p "$BUNDLE_DIR"
run_workspace "backend" "${GRAPHIFY_BENCH_BACKEND:-}" "GRAPHIFY_BENCH_BACKEND"
run_workspace "monorepo" "${GRAPHIFY_BENCH_MONOREPO:-}" "GRAPHIFY_BENCH_MONOREPO"

node "$HERE/summarize-real-workspaces.mjs" "$BUNDLE_DIR" > "$BUNDLE_DIR/real-workspaces.summary.json"
cat "$BUNDLE_DIR/real-workspaces.summary.json"
Loading
Loading