fix(skills): stop dashboard-intent leaking away from databricks-aibi-dashboards#144
fix(skills): stop dashboard-intent leaking away from databricks-aibi-dashboards#144pkosiec wants to merge 2 commits into
Conversation
61468b2 to
260c92d
Compare
simonfaltum
left a comment
There was a problem hiding this comment.
Review note:
One small non-experimental metadata nit: skills/databricks-app-design/agents/openai.yaml updates default_prompt, but short_description still says "Design analytics/BI/AI data app UX, bound to AppKit". Since openai.yaml is hand-preserved by the generator, this will not get refreshed from the new SKILL.md frontmatter. I would update that short description to say custom-code Databricks App/AppKit data screens too, so the Codex marketplace metadata matches the routing change.
The core routing changes and generated manifest validation look good.
…eference Follow-up to PR #144 review: - databricks-app-design/agents/openai.yaml: refresh short_description to the custom-code Databricks App (AppKit/React) framing so it matches the updated default_prompt. The generator preserves hand-authored openai.yaml, so the tagline does not refresh from SKILL.md frontmatter on its own. - databricks-aibi-dashboards: add a "when a custom app fits better" callout and a Related Skills entry pointing to databricks-apps for genuine custom-app needs (write-back, bespoke UI, in-app Genie/chat). Points at the databricks-apps parent (not the databricks-app-design subskill) so the scaffold/data-access gate isn't skipped. The "Ask Genie button on this dashboard" case stays in aibi. No frontmatter description or manifest.json changes. Co-authored-by: Isaac
|
Thanks for the review, Simon! Follow-up commit (66f2837) addresses both notes:
No frontmatter |
|
Running evals before merging the PR 👍 |
App Evals run on this PR — failure investigationThe eval run triggered for this PR ( Run summary
Per-app scores vs last green PR-preset baseline (
|
| App | Baseline | This run | Delta |
|---|---|---|---|
| cb_brickhouse_simple | 1.00 | n/a (eval timed out) | — |
| cb_genie_chat_advanced | 1.00 | 1.00 | 0 |
| cb_pixels_simple | 1.00 | 1.00 | 0 |
| city_performance_app | 1.00 | 0.125 | −0.875 |
| devhub_saas_tracker | 1.00 | 1.00 | 0 |
| genie_taxi_chat | 1.00 | 1.00 | 0 |
| parts_catalog_app | 0.55 | 0.125 | −0.425 |
| property_search_app | 1.00 | 0.125 | −0.875 |
| serving_chat | 1.00 | 1.00 | 0 |
| taxi_zones_map | 1.00 | 0.125 | −0.875 |
The 4 regressed apps are exactly the databricks_v2 promptset entries (SQL-Warehouse-backed apps). All 5 non-Warehouse apps held at 1.00.
Root cause
All 4 regressed apps' eval.json shows the same failure: npm_install failed, cascading every downstream step (build / unit / smoke / typecheck / apps_validate) to "npm install failed". The actual error in the iteration logs:
> appkit generate-types
[appkit:type-generator:query-registry]
DESCRIBE rejected for city_booking_trends:
Response from server (Forbidden)
{"error_code":"PERMISSION_DENIED",
"message":"You do not have permission to use the SQL Warehouse.",
"details":[{"resource_type":"warehouse",
"resource_name":"75d3c8bdec7d1569",
"description":"user is not authorized to use this warehouse"}]}
Error: Type generation failed: 5 queries could not be described.
npm error code 1 (postinstall: npm run typegen)
appkit generate-types runs as the postinstall step of npm install and hits the SQL warehouse 75d3c8bdec7d1569. The eval cluster's identity is not authorized for that warehouse, so typegen errors → postinstall errors → install is marked failed → every downstream eval step short-circuits.
The two other failures (eval timeout on cb_brickhouse_simple, edit_app error_max_turns on drop_unrequested_feature/parts_catalog_app) hit the same wall — the agent burned 31 turns retrying typegen against the inaccessible warehouse.
The pipeline's own trajectory-analysis artifact (trajectory_analysis.md on the MLflow run) corroborates this from the generation side: 5 of 9 trajectories burned 10–60 steps each tripping over warehouse 75d3c8bdec7d1569.
Verdict on this PR
Not blamed. The failures cascade from a postinstall infra step the agent doesn't control — the agent never reaches the part of the workflow this PR (fix(skills): stop dashboard-intent leaking away from databricks-aibi-dashboards) changes. The 5 apps that don't depend on the broken warehouse all scored 1.00, same as the baseline.
Next steps to unblock
- Re-grant USAGE on warehouse
75d3c8bdec7d1569to the eval cluster's identity on e2-dogfood (or whichever SP runsnpm installon the eval cluster). The baseline run succeeded yesterday on the same workspace, so this is a recent permission change. - Once unblocked, re-trigger this PR's eval run — only then will it actually exercise the skill changes.
- Longer-term: stop hardcoding
75d3c8bdec7d1569as the default warehouse indatabricks apps init(separate AppKit/template issue surfaced by the trajectory analysis as the dominant generation-time friction acrossdatabricks_v2apps).
…dashboards A routing A/B (176 headless Claude Code sessions, realistic bundle of databricks-aibi-dashboards + databricks-app-design + databricks-apps) found plain "create a dashboard" prompts routed to aibi only 14/18 (78%). Root cause: databricks-apps lists "dashboards" as a trigger and hard-routes any data-displaying app to databricks-app-design → AppKit, so aibi was never considered. - databricks-app-design: lead the description with custom-code Databricks Apps (AppKit/React) screens instead of "dashboards, KPI pages…"; add an explicit rule that a plain "create a dashboard" = managed AI/BI (Lakeview) → databricks-aibi-dashboards. Mirror the rule in "When to use / when NOT". - databricks-apps: drop "dashboards" from the trigger list; add the same deference rule to the description and an "Is this even a Databricks App?" paragraph in the body. - databricks-app-design Codex openai.yaml: reword default_prompt so it no longer leads with dashboards. - Regenerate manifest.json. Validated empirically: dashboard-intent → 18/18 aibi in the bundle, ambiguous prompts → aibi or a clarifying question, app-design keeps 6/6 of its legit app-UX traffic (zero over-correction). Co-authored-by: Isaac
…eference Follow-up to PR #144 review: - databricks-app-design/agents/openai.yaml: refresh short_description to the custom-code Databricks App (AppKit/React) framing so it matches the updated default_prompt. The generator preserves hand-authored openai.yaml, so the tagline does not refresh from SKILL.md frontmatter on its own. - databricks-aibi-dashboards: add a "when a custom app fits better" callout and a Related Skills entry pointing to databricks-apps for genuine custom-app needs (write-back, bespoke UI, in-app Genie/chat). Points at the databricks-apps parent (not the databricks-app-design subskill) so the scaffold/data-access gate isn't skipped. The "Ask Genie button on this dashboard" case stays in aibi. No frontmatter description or manifest.json changes. Co-authored-by: Isaac
66f2837 to
b2af9a7
Compare
Problem
A routing A/B test (176 headless Claude Code sessions, 6 install conditions, 11 prompts × 3 trials) measured which skill is invoked when
databricks-aibi-dashboards,databricks-app-design(#125), anddatabricks-appsare all installed.The fear that
databricks-app-design"steals" dashboard requests is mostly unfounded head-to-head — its "NOT for Lakeview" disclaimer holds, and explicit "AI/BI"/"Lakeview" phrasing routed correctly 100% of the time. The real leak isdatabricks-apps: its description lists "dashboards" as a trigger and its body hard-routes any data-displaying app todatabricks-app-design→ AppKit, never mentioningdatabricks-aibi-dashboards. So "sales dashboard with a region filter" went apps → app-design → AppKit in 3/3 trials, and aibi was never even in the consideration set (0 mentions across all funnel transcripts). Ambiguous prompts hit that funnel 8/9 times.Fix
databricks-app-design— lead the description with "custom-code Databricks Apps (AppKit/React) screens" instead of bare "dashboards, KPI pages…", and add an explicit rule: a plain "create a dashboard" means a managed AI/BI (Lakeview) dashboard →databricks-aibi-dashboards. Mirrored in the "When to use / when NOT" section.databricks-apps— drop "dashboards" from the trigger list; add the same deference rule to the description and an "Is this even a Databricks App?" paragraph at the top of the body.databricks-app-design/agents/openai.yamldefault_promptso it no longer leads with dashboards (same leak on the OpenAI side).manifest.json.Result (validated empirically, not just proposed)
With both edits: dashboard-intent routing → 18/18 (100%) aibi in the bundle, ambiguous prompts now go to aibi or a clarifying question, and
databricks-app-designkeeps 6/6 of its legitimate app-UX traffic — zero over-correction.This pull request and its description were written by Isaac.