mohanagy · mohanagy · May 11, 2026 · May 11, 2026 · May 11, 2026
@@ -4,6 +4,17 @@ All notable changes to the TypeScript package will be documented in this file.
 
 ## [Unreleased]
 
+### Added
+
+- **Opt-in task-conditioned slicing v1**: retrieval can now run with `retrievalStrategy: 'slice-v1'` to anchor on explicit symbols/paths, take bounded explain/debug/impact/review-oriented slices, suppress barrel-like nodes, and emit `slice` metadata (`mode`, `anchors`, `directions`, `selected_paths`) alongside the selected pack.
+- **Real-workspace benchmark flow**: `docs/benchmarks/2026-05-11-spi-vs-legacy/` now ships `run-real-workspace.sh`, `summarize-real-workspaces.mjs`, `prompts.real-workspace.example.json`, and `REAL_WORKSPACE_REPORT_TEMPLATE.md` so backend-only and monorepo workspaces can be benchmarked locally without committing private paths or artifacts.
+
+### Changed
+
+- **Sketch semantics are richer but still deterministic**: `resolution: 'sketch'` now surfaces `reads env`, config reads, and compact side-effect hints such as `external_http`, `llm_call`, and `db_write` when graph evidence exists, while preserving dependency-record output for lighter nodes.
+- **Slice-v1 is exposed safely in CLI/MCP**: CLI `pack`, MCP `retrieve`, and MCP `context_pack` now accept `retrieval_strategy: 'default' | 'slice-v1'`, validate unsupported values clearly, and keep compact output unchanged unless the caller opts in.
+- **Benchmark analysis is broader and more honest**: the SPI probe now records resolution comparisons (`detail` / `signature` / `sketch`), slice-v1 runs, retrieval-gate metadata, top files, and a value-per-token calibration summary instead of implying a token win.
+
 ## [0.21.0] - 2026-05-11
 
 ### Changed

@@ -94,7 +94,7 @@ NestJS + Next.js SaaS, 1,268 files, ~860K words. Same question, same Claude Opus
 
 PR-review proof on a real diff: prompt tokens 63,024 → **8,690** (**7.25× fewer**). Receipts: [`docs/benchmarks/2026-05-02-govalidate-pr-review/`](docs/benchmarks/2026-05-02-govalidate-pr-review/).
 
-`--spi` benchmark (bundled fixture, 7 prompts): pack tokens **−26%**, graph.json size **−32%**, cache-hit rebuild **−27%** vs legacy. Receipts: [`docs/benchmarks/2026-05-11-spi-vs-legacy/`](docs/benchmarks/2026-05-11-spi-vs-legacy/).
+`--spi` benchmark (bundled fixture, 7 prompts): **better framework-shaped correctness**, **operational retrieval-level expansion**, **graph.json size −32%**, **cache-hit rebuild −27% vs legacy**, but **no measured explain-pack token win on that fixture**. Receipts: [`docs/benchmarks/2026-05-11-spi-vs-legacy/`](docs/benchmarks/2026-05-11-spi-vs-legacy/).
 
 [Reproduce them](docs/benchmarks/2026-04-30-govalidate/verify.sh) with one shell script against the committed evidence files.
 
@@ -142,6 +142,7 @@ graphify-ts generate .                          # build the graph
 graphify-ts generate . --spi                    # opt-in SPI pipeline (framework metadata + disk cache)
 graphify-ts watch .                             # rebuild on file change
 graphify-ts pack "how does auth work?" --task explain          # compact CLI context payload
+graphify-ts pack "why does auth fail?" --task explain --retrieval-strategy slice-v1
 graphify-ts prompt "how does auth work?" --provider claude     # provider-ready compiled prompt
 graphify-ts review-compare graphify-out/graph.json --exec '...' --yes  # PR review benchmark
 graphify-ts compare "How does auth work?" --exec '...' --yes           # general benchmark

@@ -77,6 +77,7 @@ The runner now produces:
 1. `legacy.json`, `spi-cold.json`, `spi-warm.json`
 2. `spi-cold.analysis.json` — strategy comparison + retrieval-level sweep
 3. `summary.json` — top-level aggregate report
+4. `edge_count` in each variant JSON
 
 ### Optional: point the runner at another local repo
 
@@ -101,6 +102,29 @@ node docs/benchmarks/2026-05-11-spi-vs-legacy/probe.mjs \
 
 If GoValidate is available locally, use the template above for both the backend-only checkout and the monorepo checkout. This repo does **not** commit any private-path defaults or fake results for those runs.
 
+### Real-workspace matrix runner
+
+You can benchmark two local workspaces side by side without committing private paths or artifacts:
+
+```bash
+GRAPHIFY_BENCH_BACKEND=/absolute/path/to/backend \
+GRAPHIFY_BENCH_MONOREPO=/absolute/path/to/monorepo \
+bash docs/benchmarks/2026-05-11-spi-vs-legacy/run-real-workspace.sh
+```
+
+Defaults:
+
+- prompts file: `docs/benchmarks/2026-05-11-spi-vs-legacy/prompts.real-workspace.example.json`
+- output bundle: `docs/benchmarks/2026-05-11-spi-vs-legacy/results/real-workspaces/<timestamp>/`
+
+Artifacts:
+
+1. one normal benchmark run per workspace (`backend/summary.json`, `monorepo/summary.json`)
+2. `real-workspaces.summary.json` — side-by-side aggregate summary
+3. `REAL_WORKSPACE_REPORT_TEMPLATE.md` — sharing template with privacy disclaimer
+
+The aggregate summary keeps objective metrics separate from qualitative notes and does not claim any private-repo numbers unless you run the benchmark locally.
+
 ## Caveats / limitations
 
 - **Fixture is synthetic.** It is still small enough that the new `value-per-token` scorer does not beat evidence-order on final pack size.

@@ -0,0 +1,32 @@
+# Real workspace benchmark report template
+
+This benchmark can be run on private repos locally.
+No private paths or artifacts are committed.
+If GoValidate is unavailable, no GoValidate-specific numbers are claimed.
+
+## Workspace matrix
+
+| Workspace | Variant | Build time (ms) | Graph size (bytes) | Nodes | Edges |
+|---|---|---:|---:|---:|---:|
+
+## Strategy / resolution comparisons
+
+| Workspace | Prompt | Strategy | Resolution | Tokens | Nodes | Quality | Notes |
+|---|---|---|---|---:|---:|---:|---|
+
+## Retrieval-level comparisons
+
+| Workspace | Prompt | Retrieval level | Tokens | Nodes | Gate reason |
+|---|---|---:|---:|---:|---|
+
+## Value-per-token calibration
+
+- Where value-per-token helps:
+- Where it does not change output:
+- Where it hurts or increases tokens:
+- Suggested scoring adjustments:
+
+## Qualitative notes
+
+- Objective metrics are listed separately from qualitative notes.
+- Private workspace paths must be redacted before sharing any report excerpt.
@@ -0,0 +1,26 @@
+#!/usr/bin/env node
+
+import { readFileSync } from 'node:fs'
+
+const graphPath = process.argv[2]
+if (!graphPath) {
+  console.error('usage: graph-stats.mjs <graph.json>')
+  process.exit(2)
+}
+
+const graphJson = readFileSync(graphPath, 'utf8')
+let graph
+try {
+  graph = JSON.parse(graphJson)
+} catch (error) {
+  const message = error instanceof Error ? error.message : String(error)
+  console.error(`failed to parse graph JSON at ${graphPath}: ${message}`)
+  process.exit(1)
+}
+const nodeCount = Array.isArray(graph.nodes) ? graph.nodes.length : 0
+const edgeCount = Array.isArray(graph.edges) ? graph.edges.length : 0
+
+console.log(JSON.stringify({
+  node_count: nodeCount,
+  edge_count: edgeCount,
+}))
@@ -4,6 +4,9 @@ import { readFileSync } from 'node:fs'
 import { basename, relative, resolve } from 'node:path'
 
 import { computeContextPackDiagnostics } from '../../../dist/src/runtime/context-pack-diagnostics.js'
+import { estimateContextPackEntryTokens } from '../../../dist/src/runtime/context-pack.js'
+import { applyContextPackResolution } from '../../../dist/src/runtime/context-pack-resolution.js'
+import { classifyCalibrationBucket } from '../../../dist/src/runtime/benchmark/probe-calibration.js'
 import { contextPackFromRetrieveResult, retrieveContext } from '../../../dist/src/runtime/retrieve.js'
 import { loadGraph } from '../../../dist/src/runtime/serve.js'
 
@@ -31,16 +34,50 @@ function summarizeRun(result) {
       result.matched_nodes
         .map((node) => node.framework_role)
         .filter((value) => typeof value === 'string' && value.length > 0),
-    ),
+      ),
   ).sort()
+  const topFiles = Array.from(
+    new Set(
+      result.matched_nodes
+        .map((node) => node.source_file)
+        .filter((value) => typeof value === 'string' && value.length > 0),
+    ),
+  ).slice(0, 5)
+  const resolvedSummaries = Object.fromEntries(
+    ['detail', 'signature', 'sketch'].map((resolution) => {
+      const resolved = resolution === 'detail'
+        ? {
+            nodes: pack.nodes,
+            bytes_saved: 0,
+          }
+        : applyContextPackResolution(pack.nodes, {
+            resolution,
+            relationships: pack.relationships,
+          })
+      const tokenCount = resolved.nodes.reduce(
+        (total, node) => total + estimateContextPackEntryTokens(node.label, node.source_file, node.line_number, node.snippet),
+        0,
+      )
+      return [resolution, {
+        token_count: tokenCount,
+        bytes_saved: resolved.bytes_saved,
+        representation_types: Array.from(new Set(resolved.nodes.map((node) => node.representation_type ?? 'detail'))).sort(),
+      }]
+    }),
+  )
 
   return {
     token_count: result.token_count,
     node_count: result.matched_nodes.length,
     labels: result.matched_nodes.map((node) => node.label),
+    top_files: topFiles,
     framework_roles: frameworkRoles,
     quality_score: diagnostics.quality_score,
     warnings: diagnostics.warnings.map((warning) => warning.kind),
+    retrieval_gate: result.retrieval_gate ?? null,
+    retrieval_strategy: result.retrieval_strategy ?? 'default',
+    slice: result.slice ?? null,
+    resolutions: resolvedSummaries,
     selection_strategy: result.selection_diagnostics?.selection_strategy,
     used_tokens: result.selection_diagnostics?.used_tokens ?? result.token_count,
     required_overflow: result.selection_diagnostics?.required_overflow ?? false,
@@ -70,6 +107,12 @@ const promptAnalyses = prompts.map((prompt) => {
     budget,
     selectionStrategy: 'value-per-token',
   })
+  const sliceV1 = retrieveContext(graph, {
+    question: prompt.text,
+    budget,
+    selectionStrategy: 'value-per-token',
+    retrievalStrategy: 'slice-v1',
+  })
 
   return {
     id: prompt.id,
@@ -78,10 +121,13 @@ const promptAnalyses = prompts.map((prompt) => {
     strategies: {
       evidence_order: summarizeRun(evidenceOrder),
       value_per_token: summarizeRun(valuePerToken),
+      slice_v1: summarizeRun(sliceV1),
     },
     deltas: {
       token_count: valuePerToken.token_count - evidenceOrder.token_count,
       node_count: valuePerToken.matched_nodes.length - evidenceOrder.matched_nodes.length,
+      slice_token_count: sliceV1.token_count - valuePerToken.token_count,
+      slice_node_count: sliceV1.matched_nodes.length - valuePerToken.matched_nodes.length,
     },
     retrieval_levels: retrievalLevels.map((level) => ({
       level,
@@ -95,8 +141,40 @@ const promptAnalyses = prompts.map((prompt) => {
   }
 })
 
+const calibration = promptAnalyses.reduce((summary, prompt) => {
+  const evidenceOrder = prompt.strategies.evidence_order
+  const valuePerToken = prompt.strategies.value_per_token
+  const tokenDelta = valuePerToken.token_count - evidenceOrder.token_count
+  const qualityDelta = valuePerToken.quality_score - evidenceOrder.quality_score
+  const labelDelta = valuePerToken.labels.filter((label) => !evidenceOrder.labels.includes(label))
+  const note = {
+    prompt: prompt.id,
+    token_delta: tokenDelta,
+    quality_delta: qualityDelta,
+    added_labels: labelDelta,
+  }
+
+  switch (classifyCalibrationBucket({ tokenDelta, qualityDelta })) {
+    case 'helps':
+      summary.helps.push(note)
+      break
+    case 'hurts_or_expands':
+      summary.hurts_or_expands.push(note)
+      break
+    default:
+      summary.no_material_change.push(note)
+      break
+  }
+  return summary
+}, {
+  helps: [],
+  no_material_change: [],
+  hurts_or_expands: [],
+})
+
 console.log(JSON.stringify({
   graph_path: graphPathForOutput,
   budget,
   prompts: promptAnalyses,
+  calibration,
 }, null, 2))
@@ -0,0 +1,55 @@
+{
+  "schema_version": 1,
+  "prompts": [
+    {
+      "id": "auth-flow",
+      "intent": "explain",
+      "text": "Explain auth flow end to end."
+    },
+    {
+      "id": "report-generation",
+      "intent": "explain",
+      "text": "Explain validation report generation end to end."
+    },
+    {
+      "id": "report-generation-slow",
+      "intent": "debug",
+      "text": "Why is validation report generation slow?"
+    },
+    {
+      "id": "research-agent-impact",
+      "intent": "impact",
+      "text": "What can break if the research agent changes?"
+    },
+    {
+      "id": "report-generation-tests",
+      "intent": "explain",
+      "text": "Which tests are relevant for report generation?"
+    },
+    {
+      "id": "controller-to-persistence",
+      "intent": "explain",
+      "text": "Find the call path from controller to final report persistence."
+    },
+    {
+      "id": "config-runtime-effect",
+      "intent": "debug",
+      "text": "Where does this env/config variable affect runtime behavior?"
+    },
+    {
+      "id": "auth-config-impact",
+      "intent": "impact",
+      "text": "What can break if session/cookie/auth config changes?"
+    },
+    {
+      "id": "review-current-diff",
+      "intent": "review",
+      "text": "Review current backend diff for risky changes."
+    },
+    {
+      "id": "onboarding-routes",
+      "intent": "explain",
+      "text": "Which routes/controllers/services are involved in onboarding?"
+    }
+  ]
+}
@@ -0,0 +1,45 @@
+#!/usr/bin/env bash
+
+set -euo pipefail
+
+HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+TS="$(date -u +%Y-%m-%dT%H%M%SZ)"
+BUNDLE_DIR="${GRAPHIFY_BENCH_REAL_RESULTS_DIR:-$HERE/results/real-workspaces/$TS}"
+PROMPTS_FILE="${GRAPHIFY_BENCH_REAL_PROMPTS:-$HERE/prompts.real-workspace.example.json}"
+
+if [[ ! -f "$PROMPTS_FILE" ]]; then
+  echo "GRAPHIFY_BENCH_REAL_PROMPTS must point to an existing prompts JSON file: $PROMPTS_FILE" >&2
+  exit 2
+fi
+
+run_workspace() {
+  local workspace_name="$1"
+  local workspace_path="$2"
+  local workspace_var_name="$3"
+  if [[ -z "$workspace_path" ]]; then
+    return
+  fi
+  if [[ ! -d "$workspace_path" ]]; then
+    echo "$workspace_var_name must point to an existing workspace directory: $workspace_path" >&2
+    exit 2
+  fi
+
+  mkdir -p "$BUNDLE_DIR/$workspace_name"
+  echo "[real-workspace] $workspace_name -> $workspace_path"
+  GRAPHIFY_BENCH_FIXTURE="$workspace_path" \
+  GRAPHIFY_BENCH_PROMPTS="$PROMPTS_FILE" \
+  GRAPHIFY_BENCH_RESULTS_DIR="$BUNDLE_DIR/$workspace_name" \
+  bash "$HERE/run.sh"
+}
+
+if [[ -z "${GRAPHIFY_BENCH_BACKEND:-}" && -z "${GRAPHIFY_BENCH_MONOREPO:-}" ]]; then
+  echo "Set GRAPHIFY_BENCH_BACKEND and/or GRAPHIFY_BENCH_MONOREPO before running." >&2
+  exit 2
+fi
+
+mkdir -p "$BUNDLE_DIR"
+run_workspace "backend" "${GRAPHIFY_BENCH_BACKEND:-}" "GRAPHIFY_BENCH_BACKEND"
+run_workspace "monorepo" "${GRAPHIFY_BENCH_MONOREPO:-}" "GRAPHIFY_BENCH_MONOREPO"
+
+node "$HERE/summarize-real-workspaces.mjs" "$BUNDLE_DIR" > "$BUNDLE_DIR/real-workspaces.summary.json"
+cat "$BUNDLE_DIR/real-workspaces.summary.json"