Skip to content

RFC: Phase 6 validation rollout#645

Draft
mariusvniekerk wants to merge 36 commits into
mainfrom
quality/phase-6-validation-rollout
Draft

RFC: Phase 6 validation rollout#645
mariusvniekerk wants to merge 36 commits into
mainfrom
quality/phase-6-validation-rollout

Conversation

@mariusvniekerk

Copy link
Copy Markdown
Collaborator

This is intentionally a draft and should be read more like an RFC or idea review than a finished rollout. The branch sketches a direction for Phase 6 validation, calibration, evidence, and generated-insight review surfaces so we can evaluate whether the quality flow is useful before treating the scoring as final. Final calibration should still wait for Phases 2-5 to settle.

The status quo gave deterministic scores and generated insight text, but not enough reviewer-facing context to judge whether the categories were actionable. This draft tries to make the proposed workflow concrete: scoped scored insights, evidence drilldowns, generated insight archives, automated-session filtering, weaker-signal downgrades, and contracts around stale signal versions.

Screenshots below use synthetic mocked API data so the PR can show the proposed UI without exposing local archive content.

Screenshots

Quality patterns

Evidence drilldown

Generated insight archive

Review Focus

  • Whether the proposed quality categories and evidence links make the page worth using instead of just producing vague labels.
  • Whether automated session filtering should be the default for scored and generated insights.
  • Whether the downgraded weak signals and frustration markers are the right calibration direction.
  • Whether the generated-insight archive, retry/copy/delete flows, and validation failures are understandable enough for iterative use.
  • What should remain RFC-only until Phases 2-5 land.

Validation

  • go test ./internal/parser ./internal/insight ./internal/server ./internal/signals ./internal/sync ./internal/service
  • go test ./internal/db -run 'TestAnalyticsFilterTermination|TestWriteSessionBatch|TestGetAnalyticsSignalSessionsRejectsUnsupportedSignal|TestParseEvidenceTime|TestBackfillSignals'
  • npm --prefix frontend run check
  • Generated screenshots from mocked API responses against the Vite dev UI at http://127.0.0.1:5173/insights.

Known residual: full go test ./internal/db still hits the pre-existing TestReplaceSessionMessages_LargeSession / messages_fts fixture issue, so the PR validation uses the focused DB tests that cover this work.

mariusvniekerk and others added 30 commits May 26, 2026 18:43
Phase 1 needs an implementation-ready mapping from AI Engineering Coach rules to agentsview signal families before any scorer, schema, API, or UI work begins. This records which heuristics are safe per-session score candidates, which belong in aggregate analytics or insights, and which are blocked until agentsview stores more precise request, mode, approval, context, or review metadata.

Validation: verified all 45 Coach rule IDs from /tmp/AI-Engineering-Coach/src/core/rules are referenced in the artifact; ran git diff --cached --check.

Generated with Codex\nCo-authored-by: Codex <codex@openai.com>
Phase 2 scorers were being used to calculate health scores, but their evidence counts were not first-class session data. Persisting the scalar counts makes detail explanations stable, lets analytics aggregate the new signal families, and allows PostgreSQL mirrors to stay consistent with SQLite archives without recreating the archive database.\n\nThe new backfill marker intentionally causes existing sessions with messages to flow through the established signal recomputation path so old rows get a quality-signal version instead of being confused with computed-zero rows.\n\nValidation: go test ./internal/db -run 'Test(UpdateSessionSignals|PendingSignalSessions|BackfillSignals|MarkSignalsBackfillDone|GetAnalyticsSignals)'; go test ./internal/service ./internal/server ./internal/sync ./internal/postgres ./cmd/agentsview; npm test -- --run src/lib/stores/analytics.test.ts; npm run check; git diff --check. Full go test ./internal/db still hits the pre-existing TestReplaceSessionMessages_LargeSession FTS fixture failure: no such table: messages_fts.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
The first persistence pass exposed quality signal columns through most SQLite and PostgreSQL paths, but two PG guardrails still treated quality-only updates as invisible: compatibility probing and push fingerprints. That could let old PostgreSQL schemas pass readiness checks and could skip mirror updates when only persisted quality evidence changed.\n\nHTTP-backed service calls also need to round-trip the grouped quality_signals JSON back into the scalar session fields used by shared service code, otherwise CLI calls through a daemon lose the new detail evidence while direct reads retain it.\n\nValidation: go test ./internal/postgres -run 'TestSessionPushFingerprint|TestEnsureSchema|TestCheckSchemaCompat|TestPushSingleSession'; go test ./internal/service -run 'TestHTTPBackend_Get_Roundtrip|TestDirectBackend_Get_HealthBreakdownIncludesHeuristics'; go test ./internal/db -run 'TestUpdateSessionSignals'; go test ./internal/db -run 'Test(UpdateSessionSignals|PendingSignalSessions|BackfillSignals|MarkSignalsBackfillDone|GetAnalyticsSignals)'; go test ./internal/service ./internal/server ./internal/sync ./internal/postgres ./cmd/agentsview; npm test -- --run src/lib/stores/analytics.test.ts; npm run check; git diff --check. Full go test ./... still hits the pre-existing internal/db TestReplaceSessionMessages_LargeSession FTS fixture failure: no such table: messages_fts.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
Phase 4 needs the Insights tab to lead with inspectable quality facts before any generated recommendation text. This moves the page onto the Phase 3 aggregate quality signal contract, keeping prompt maturity, context health, workflow hygiene, and tool reliability visibly rule-based and backed by concrete counts, trends, and evidence groupings.

Generated insights remain available only as a separate archive section so saved generated text cannot be confused with scored facts or canonical recommendations. The transform layer stays frontend-only and deterministic; it does not add LLM generation.

Validation: npm run check; npm run test -- qualityPatterns.test.ts; npm run build; git diff --check; mocked Playwright smoke checks for populated desktop/mobile Insights states, including mobile no-overflow.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
Roborev found that Insights reused Analytics signal fetching without clearing date/hour drill-down state, which could make the dashboard display a hidden subset while the toolbar showed the full range. The Insights path now uses an explicit signal fetch that clears those drill-down filters before querying.

The design review also called out that pattern cards mixed concrete affected-session counts with proxy trend metrics. Trend labels now state whether the sparkline is score pressure, interrupted sessions, or average failure signals, and severity thresholds are named and covered by boundary tests. Generated insight copy now makes the archive-only Phase 4 behavior explicit.

Validation: npm run check; npm run test -- qualityPatterns.test.ts analytics.test.ts; npm run build; git diff --check.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
The Coach comparison showed two persisted Phase 3 signal fields were available to Insights but not displayed as named deterministic facts. Keeping those counts hidden made the Phase 4 dashboard underrepresent the quality-pattern contract and could let regressions silently drop code-context or runaway-loop evidence.

This keeps the hardening in the UI transform layer because the lower stack already stores and exposes the counts; no LLM generation or mocked recommendation behavior is added.

Validation: npm run test -- qualityPatterns.test.ts; npm run check; npm run build; git diff --check.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
Phase 5 needs generated recommendations to be explicitly requested, cacheable, and separate from deterministic quality scoring. This adds a fixed-template path that consumes existing aggregate signal and usage data, validates structured JSON output, renders generated recommendation text server-side, and stores provenance/cache metadata alongside legacy insight content.

The generated path intentionally never writes session scores, signal columns, PR comments, or instruction files. The UI exposes the canned recommendation mode as an opt-in template selection while preserving existing daily and agent-analysis generation.

Validation: go test ./internal/insight ./internal/server ./internal/postgres; go test ./internal/db -run 'TestInsights|TestOpenMigratesColumnsWithoutDrop|TestOpenBackfillsLegacyTokenCoverageFlags|TestCopyInsightsFrom'; npm run check; npm test -- --run src/lib/stores/insights.test.ts src/lib/api/client.test.ts. Full go test ./internal/db still fails in TestReplaceSessionMessages_LargeSession with no such table: messages_fts, matching the pre-existing FTS test setup issue observed outside the touched insight paths.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
Roborev found two frontend state regressions and several ambiguity points in the new canned insight flow. Cache hits now move existing rows without duplicating ids, quality recommendations keep their visible date range controls, and cache-hit responses carry consistent response-only provenance while preserving the stored fresh row.

The custom focus field is explicitly bounded and surfaced in the fixed prompt, and PG serve-mode comments now make the local-only insight boundary clear so remote readers do not imply partial metadata support.

Validation: go test ./internal/insight ./internal/server ./internal/postgres; go test ./internal/db -run 'TestInsights|TestOpenMigratesColumnsWithoutDrop|TestOpenBackfillsLegacyTokenCoverageFlags|TestCopyInsightsFrom'; npm run check; npm test -- --run src/lib/stores/insights.test.ts src/lib/api/client.test.ts. go test ./... still fails at internal/db TestReplaceSessionMessages_LargeSession with no such table: messages_fts, the same pre-existing FTS setup issue.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
Canned recommendation focus is part of the deterministic payload hash and cache key, so accepting raw whitespace while validating only the trimmed text could fragment cache entries and store effectively equivalent requests differently.

Normalize the focus once before validation and downstream payload construction so bounds, provenance, prompts, and cache semantics all use the same value.

Validation: go test ./internal/server -run 'TestGenerateCannedInsight_(RejectsOversizedFocus|NormalizesFocusBeforeCaching|SaveCacheAndPreserveSignals)'; go test ./internal/insight ./internal/server ./internal/postgres. go test ./... still fails at internal/db TestReplaceSessionMessages_LargeSession with no such table: messages_fts, matching the existing FTS fixture issue.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
Phase 5 is meant to give opt-in LLM recommendations richer agent-workflow context than simple aggregate threshold matching, while still keeping generated prose away from canonical scores and signal rows. The canned payload now includes deterministic AI-Engineering-Coach-shaped inputs for intent mix, spec-driven starts, prompt maturity, and repeated workflow clusters so the fixed templates have the same families of evidence the source Coach explored.

These derived fields are provenance-bearing LLM inputs only. They are filtered through the same request scope, included in the aggregate cache hash, and validated by unit coverage so cache semantics remain tied to exactly what the model saw.

Validation: go test ./internal/insight ./internal/server ./internal/postgres; go test ./internal/db -run 'TestInsights|TestOpenMigratesColumnsWithoutDrop|TestOpenBackfillsLegacyTokenCoverageFlags|TestCopyInsightsFrom'; npm run check. Full go test ./... still fails at internal/db TestReplaceSessionMessages_LargeSession with no such table: messages_fts, matching the existing FTS fixture issue.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
Roborev found that the Coach-derived canned insight context could diverge from the rest of the aggregate payload for large ranges and could churn cache keys because workflow cluster IDs depended on map iteration order.

The canned path now pages through the full matching session scope before building Coach inputs, so the LLM sees all sessions covered by the date/project scope instead of only the first ListSessions page. Workflow cluster IDs are derived from stable cluster content so equivalent source data produces the same payload hash and cache key.

Validation: go test ./internal/insight -run 'TestBuildCannedCoachSummary(UsesCoachInsightFamilies|StableWorkflowClusterIDs)|TestCannedEvidenceRefsIncludesCoachSummary'; go test ./internal/server -run 'TestGenerateCannedInsight_(CoachSummaryUsesAllPages|SaveCacheAndPreserveSignals|NormalizesFocusBeforeCaching)'; go test ./internal/insight ./internal/server ./internal/postgres; go test ./internal/db -run 'TestInsights|TestOpenMigratesColumnsWithoutDrop|TestOpenBackfillsLegacyTokenCoverageFlags|TestCopyInsightsFrom'; npm run check. Full go test ./... still fails at internal/db TestReplaceSessionMessages_LargeSession with no such table: messages_fts, matching the existing FTS fixture issue.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
Phase 6 needs durable validation coverage before calibration can be finalized against real archives. The new goldens make score deltas and explanation fields reviewable, and the resync assertion protects persisted quality signals when source files have disappeared.

The Insights e2e coverage locks in the rollout boundary that deterministic recommendation metadata remains visible while read-only mode prevents generation. The rollout document records that final weight calibration waits for Phases 2-5 to settle.

Validation: go test ./internal/signals -count=1; go test -run TestResyncAllPreservesTrashedSessionData ./internal/sync -count=1; npm --prefix frontend run check; npm --prefix frontend run e2e -- insights-quality.spec.ts --project=chromium.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
The Phase 6 validation branch now sits on the restacked deterministic Insights dashboard, where saved generated recommendations live inside the generated archive rather than the earlier sidebar/detail layout. The rollout e2e should validate the current UI contract instead of failing on obsolete selectors.

Read-only coverage now asserts the archived Generate action is disabled in the generated archive section, matching the restacked UI while preserving the LLM-disabled rollout guard.

Validation: go test ./internal/signals -count=1; go test -run TestResyncAllPreservesTrashedSessionData ./internal/sync -count=1; npm --prefix frontend run check; npm --prefix frontend run e2e -- insights-quality.spec.ts --project=chromium.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
Roborev found that the Phase 6 e2e scaffolding was not fully proving the intended rollout guards: one mock targeted an obsolete update-check URL, and the read-only assertion could pass before version data applied.

The test now stubs the actual update-check endpoint and waits for the read-only-specific Generate title before asserting the disabled state, so failures map to the behavior Phase 6 is meant to protect.

Validation: npm --prefix frontend run check; npm --prefix frontend run e2e -- insights-quality.spec.ts --project=chromium; go test ./internal/signals -count=1; go test -run TestResyncAllPreservesTrashedSessionData ./internal/sync -count=1.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
Generated recommendations and scored quality signals were both missing an explicit way to keep automated review sessions from dominating the aggregate. Make human sessions the default scope, add all and automated-only scopes, and include that scope in generated payload hashes and provenance so cache hits never mix materially different inputs.

The insights UI now uses the shared custom option/typeahead control family for project, agent, template, and scope selection; native selects are removed from this surface. Failed generated tasks can also be retried with their original request snapshot.

Validation: npm --prefix frontend run check; npm --prefix frontend run test -- src/lib/stores/insights.test.ts src/lib/stores/analytics.test.ts; npm --prefix frontend run e2e -- insights-quality.spec.ts --project=chromium; go test ./internal/insight ./internal/server; go test ./internal/db -run 'TestSessionFilterExcludeAutomated|TestListSessionsWithIncludeChildrenAndExcludeAutomated|TestGetDailyUsage_ExcludeAutomated|TestGetTopSessionsByCost|TestGetUsageSessionCounts|TestGetAnalyticsSignals|TestAnalytics'. Full go test ./internal/db still fails at TestReplaceSessionMessages_LargeSession with no such table: messages_fts, matching the existing FTS fixture issue.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
Roborev found that the new insight scope controls only fully applied on the SQLite path and could drift when older includeAutomated state still drove analytics requests. That made pg serve and legacy UI sync paths report different session populations for the same selection.

The dev proxy Origin rewrite also needed to preserve backend CSRF semantics for cross-origin requests, and generated insight links needed to keep the desktop marker even when it is represented as a bare query key.

Validation: npm --prefix frontend run check; npm --prefix frontend run test -- src/lib/stores/analytics.test.ts src/lib/stores/insights.test.ts; npm --prefix frontend run e2e -- insights-quality.spec.ts --project=chromium; go test ./internal/postgres ./internal/server ./internal/insight; browser layout probe at http://127.0.0.1:5173/insights. Full go test ./... still fails at internal/db TestReplaceSessionMessages_LargeSession with the existing messages_fts fixture issue.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
At the embedded browser width, the insights toolbar switches to column layout. The filter row inherited the desktop flex shorthand, so its 560px flex-basis became height and pushed the content below the fold.

Reset the filter group flex basis in the narrow layout so the controls keep their natural height while still spanning the available width.

Validation: npm --prefix frontend run check; Playwright layout probe against http://127.0.0.1:5173/insights?insight=6 at 896px CSS width confirmed .filter-group computes to 26px high with flex 0 1 auto.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
The insights toolbar project filter was allowed to grow to 340px, which made the medium-width layout feel lopsided and forced related controls onto another row.

Keep the scored-insights project picker compact so project, agent, and session scope read as one filter group at embedded-browser widths.

Validation: npm --prefix frontend run check; Playwright layout probe at 1086px and 896px confirmed the project filter is 220px and the filter group stays one row.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
The session scope typeahead widened when opened because the search input placeholder had a larger intrinsic width than the closed trigger. That made the toolbar jump during normal interaction.

Give the scope control and its typeahead a fixed toolbar width so closed and open states occupy the same space.

Validation: npm --prefix frontend run check; Playwright interaction probe at 1120px confirmed the scope control remains 220px and the input remains 128px before and after opening.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
Context pressure coverage is optional in the deterministic aggregate. When selected sessions have no context-pressure rows, the previous template still steered the model toward pressure and produced a low-value finding about missing telemetry.

Make pressure conditional in the context setup template and add prompt rules that redirect zero-coverage requests toward compactions, mid-task context loss, missing code context, spec-driven starts, and prompt maturity evidence. The template version is bumped so older cached generations are not reused.

Validation: go test ./internal/insight; go test ./internal/server -run 'TestGenerateCannedInsight|TestFold|TestCanned'.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
mariusvniekerk and others added 6 commits May 27, 2026 09:35
Roborev found that the new evidence drilldowns could show stale sessions after filter changes and that calibration semantics were too loose for outcome cohorts. It also caught that the narrowed short-start heuristic needed a persisted signal-version boundary so existing archives would not mix old and new counts.

This makes the signal evidence endpoint fail closed on unknown ids, keeps outcome cohorts out of diagnostic calibration, and documents the weak-signal and API contract expectations so future signal additions update SQLite, PostgreSQL, server, frontend, tests, and docs together.

Validation: go test ./internal/db -run 'TestBackfillSignals|TestGetAnalyticsSignals|TestGetAnalyticsSignalSessionsRejectsUnsupportedSignal|TestBuildSignalExamplesUsesObservedOrdinal|TestParseEvidenceTime'; go test ./internal/server -run 'TestAnalyticsSignalSessionsRejectsUnsupportedSignal|TestAnalyticsErrorRedaction|TestAnalyticsEndpoints_DefaultParams'; go test ./internal/postgres; go test ./internal/signals ./internal/sync; npm --prefix frontend run check; npm --prefix frontend run test -- src/lib/components/insights/qualityPatterns.test.ts. Full go test ./internal/db still fails at TestReplaceSessionMessages_LargeSession with the existing messages_fts fixture issue.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
This branch is being prepared as a draft PR for RFC-style review of the Phase 6 validation, calibration, and rollout direction. The screenshots give reviewers concrete UI surfaces to react to without exposing local archive data, and final calibration is still expected to wait for Phases 2-5.

The commit also carries hook-required lint cleanup from the local Go toolchain and stale service assertions for the renamed repeated-prompt penalty, so the branch passes the same validation path as normal code changes.

Validation: generated screenshots from mocked API responses against the Vite dev UI at http://127.0.0.1:5173/insights; go test ./internal/parser ./internal/insight ./internal/server ./internal/signals ./internal/sync ./internal/service; go test ./internal/db -run 'TestAnalyticsFilterTermination|TestWriteSessionBatch|TestGetAnalyticsSignalSessionsRejectsUnsupportedSignal|TestParseEvidenceTime|TestBackfillSignals'; npm --prefix frontend run check.

Generated with Codex
Co-authored-by: Codex <codex@openai.com>
@roborev-ci

roborev-ci Bot commented Jun 11, 2026

Copy link
Copy Markdown

roborev: Combined Review (1c68fc1)

Medium confidence: one medium issue needs attention before merge.

Medium

  • internal/server/insights.go:719 — Canned insight generation rebuilds aggregates with only date/project/automated scope and hard-coded UTC, ignoring the active analytics filters used by the visible dashboard such as session agent, machine, termination, min-user-message, one-shot inclusion, and browser timezone. This can generate and cache recommendations for a broader or different cohort than the user is viewing.

    Fix: Add a separate session-filter payload to generation requests, send the current analytics filter params, apply them to AnalyticsFilter, UsageFilter, and SessionFilter, and include them in provenance/cache keys.


Panel: ci_default_security | Synthesis: codex, 7s | Members: codex_default (codex/default, done, 17m30s), codex_security (codex/security, done, 4m51s) | Total: 22m28s

@cpcloud

cpcloud commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

I think my only feedback right now is that I would probably like to see the all the recommendations first and then be presented with the detailed data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants