fix(skills): stop dashboard-intent leaking away from databricks-aibi-dashboards by pkosiec · Pull Request #144 · databricks/databricks-agent-skills

pkosiec · 2026-06-12T13:36:41Z

Problem

A routing A/B test (176 headless Claude Code sessions, 6 install conditions, 11 prompts × 3 trials) measured which skill is invoked when databricks-aibi-dashboards, databricks-app-design (#125), and databricks-apps are all installed.

Install condition	dashboard-intent → aibi
aibi alone (baseline)	12/12
aibi + app-design	17/18
realistic bundle (aibi + app-design + databricks-apps)	14/18 (78%) 🚨

The fear that databricks-app-design "steals" dashboard requests is mostly unfounded head-to-head — its "NOT for Lakeview" disclaimer holds, and explicit "AI/BI"/"Lakeview" phrasing routed correctly 100% of the time. The real leak is databricks-apps: its description lists "dashboards" as a trigger and its body hard-routes any data-displaying app to databricks-app-design → AppKit, never mentioning databricks-aibi-dashboards. So "sales dashboard with a region filter" went apps → app-design → AppKit in 3/3 trials, and aibi was never even in the consideration set (0 mentions across all funnel transcripts). Ambiguous prompts hit that funnel 8/9 times.

Fix

databricks-app-design — lead the description with "custom-code Databricks Apps (AppKit/React) screens" instead of bare "dashboards, KPI pages…", and add an explicit rule: a plain "create a dashboard" means a managed AI/BI (Lakeview) dashboard → databricks-aibi-dashboards. Mirrored in the "When to use / when NOT" section.
databricks-apps — drop "dashboards" from the trigger list; add the same deference rule to the description and an "Is this even a Databricks App?" paragraph at the top of the body.
Codex metadata — reword databricks-app-design/agents/openai.yaml default_prompt so it no longer leads with dashboards (same leak on the OpenAI side).
Regenerate manifest.json.

Result (validated empirically, not just proposed)

With both edits: dashboard-intent routing → 18/18 (100%) aibi in the bundle, ambiguous prompts now go to aibi or a clarifying question, and databricks-app-design keeps 6/6 of its legitimate app-UX traffic — zero over-correction.

This pull request and its description were written by Isaac.

simonfaltum

Review note:

One small non-experimental metadata nit: skills/databricks-app-design/agents/openai.yaml updates default_prompt, but short_description still says "Design analytics/BI/AI data app UX, bound to AppKit". Since openai.yaml is hand-preserved by the generator, this will not get refreshed from the new SKILL.md frontmatter. I would update that short description to say custom-code Databricks App/AppKit data screens too, so the Codex marketplace metadata matches the routing change.

The core routing changes and generated manifest validation look good.

…eference Follow-up to PR #144 review: - databricks-app-design/agents/openai.yaml: refresh short_description to the custom-code Databricks App (AppKit/React) framing so it matches the updated default_prompt. The generator preserves hand-authored openai.yaml, so the tagline does not refresh from SKILL.md frontmatter on its own. - databricks-aibi-dashboards: add a "when a custom app fits better" callout and a Related Skills entry pointing to databricks-apps for genuine custom-app needs (write-back, bespoke UI, in-app Genie/chat). Points at the databricks-apps parent (not the databricks-app-design subskill) so the scaffold/data-access gate isn't skipped. The "Ask Genie button on this dashboard" case stays in aibi. No frontmatter description or manifest.json changes. Co-authored-by: Isaac

pkosiec · 2026-06-15T10:57:30Z

Thanks for the review, Simon! Follow-up commit (66f2837) addresses both notes:

Refreshed databricks-app-design's Codex short_description to the custom-code Databricks App (AppKit/React) framing — matches the updated default_prompt. (The generator preserves hand-authored openai.yaml, so it needed the manual nudge.)
Added a databricks-aibi-dashboards → databricks-apps pointer for genuine custom-app cases (write-back, bespoke UI, in-app Genie/chat), targeting the parent entry skill rather than the databricks-app-design subskill so the scaffold/data-access gate isn't skipped. The "Ask Genie button on this dashboard" case stays in aibi.

No frontmatter description or manifest.json changes.

pkosiec · 2026-06-15T12:56:02Z

Running evals before merging the PR 👍

pkosiec · 2026-06-16T13:12:16Z

App Evals run on this PR — failure investigation

The eval run triggered for this PR (pr_skills144_20260615_120505) hit SUCCESS_WITH_FAILURES and tripped the 0.85 gate (avg_appeval 0.61 < 0.85). The failures are not caused by this PR — they cascade from an eval-environment infra issue the agent never gets past.

Run summary

Field	Value
Job run	`814658487513252`
MLflow run	`c9d18c20412747e1940b162d5b3f4ad2`
Workspace	e2-dogfood
Prompt preset	`pr` (10 prompts)
Tags	`skills_pr:144`, `trigger:one_time`
Skills ref	`66f283719a717ec5eaf89e470e8bca71c8b74bbb`
CLI	`v1.3.0`
AppKit	`0.38.1`

Per-app scores vs last green PR-preset baseline (`916364192518713` / `appkit_pr:346`)

App	Baseline	This run	Delta
cb_brickhouse_simple	1.00	n/a (eval timed out)	—
cb_genie_chat_advanced	1.00	1.00	0
cb_pixels_simple	1.00	1.00	0
city_performance_app	1.00	0.125	−0.875
devhub_saas_tracker	1.00	1.00	0
genie_taxi_chat	1.00	1.00	0
parts_catalog_app	0.55	0.125	−0.425
property_search_app	1.00	0.125	−0.875
serving_chat	1.00	1.00	0
taxi_zones_map	1.00	0.125	−0.875

The 4 regressed apps are exactly the databricks_v2 promptset entries (SQL-Warehouse-backed apps). All 5 non-Warehouse apps held at 1.00.

Root cause

All 4 regressed apps' eval.json shows the same failure: npm_install failed, cascading every downstream step (build / unit / smoke / typecheck / apps_validate) to "npm install failed". The actual error in the iteration logs:

> appkit generate-types
[appkit:type-generator:query-registry]
  DESCRIBE rejected for city_booking_trends:
  Response from server (Forbidden)
  {"error_code":"PERMISSION_DENIED",
   "message":"You do not have permission to use the SQL Warehouse.",
   "details":[{"resource_type":"warehouse",
               "resource_name":"75d3c8bdec7d1569",
               "description":"user is not authorized to use this warehouse"}]}
Error: Type generation failed: 5 queries could not be described.
npm error code 1 (postinstall: npm run typegen)

appkit generate-types runs as the postinstall step of npm install and hits the SQL warehouse 75d3c8bdec7d1569. The eval cluster's identity is not authorized for that warehouse, so typegen errors → postinstall errors → install is marked failed → every downstream eval step short-circuits.

The two other failures (eval timeout on cb_brickhouse_simple, edit_app error_max_turns on drop_unrequested_feature/parts_catalog_app) hit the same wall — the agent burned 31 turns retrying typegen against the inaccessible warehouse.

The pipeline's own trajectory-analysis artifact (trajectory_analysis.md on the MLflow run) corroborates this from the generation side: 5 of 9 trajectories burned 10–60 steps each tripping over warehouse 75d3c8bdec7d1569.

Verdict on this PR

Not blamed. The failures cascade from a postinstall infra step the agent doesn't control — the agent never reaches the part of the workflow this PR (fix(skills): stop dashboard-intent leaking away from databricks-aibi-dashboards) changes. The 5 apps that don't depend on the broken warehouse all scored 1.00, same as the baseline.

Next steps to unblock

Re-grant USAGE on warehouse 75d3c8bdec7d1569 to the eval cluster's identity on e2-dogfood (or whichever SP runs npm install on the eval cluster). The baseline run succeeded yesterday on the same workspace, so this is a recent permission change.
Once unblocked, re-trigger this PR's eval run — only then will it actually exercise the skill changes.
Longer-term: stop hardcoding 75d3c8bdec7d1569 as the default warehouse in databricks apps init (separate AppKit/template issue surfaced by the trajectory analysis as the dominant generation-time friction across databricks_v2 apps).

…dashboards A routing A/B (176 headless Claude Code sessions, realistic bundle of databricks-aibi-dashboards + databricks-app-design + databricks-apps) found plain "create a dashboard" prompts routed to aibi only 14/18 (78%). Root cause: databricks-apps lists "dashboards" as a trigger and hard-routes any data-displaying app to databricks-app-design → AppKit, so aibi was never considered. - databricks-app-design: lead the description with custom-code Databricks Apps (AppKit/React) screens instead of "dashboards, KPI pages…"; add an explicit rule that a plain "create a dashboard" = managed AI/BI (Lakeview) → databricks-aibi-dashboards. Mirror the rule in "When to use / when NOT". - databricks-apps: drop "dashboards" from the trigger list; add the same deference rule to the description and an "Is this even a Databricks App?" paragraph in the body. - databricks-app-design Codex openai.yaml: reword default_prompt so it no longer leads with dashboards. - Regenerate manifest.json. Validated empirically: dashboard-intent → 18/18 aibi in the bundle, ambiguous prompts → aibi or a clarifying question, app-design keeps 6/6 of its legit app-UX traffic (zero over-correction). Co-authored-by: Isaac

…eference Follow-up to PR #144 review: - databricks-app-design/agents/openai.yaml: refresh short_description to the custom-code Databricks App (AppKit/React) framing so it matches the updated default_prompt. The generator preserves hand-authored openai.yaml, so the tagline does not refresh from SKILL.md frontmatter on its own. - databricks-aibi-dashboards: add a "when a custom app fits better" callout and a Related Skills entry pointing to databricks-apps for genuine custom-app needs (write-back, bespoke UI, in-app Genie/chat). Points at the databricks-apps parent (not the databricks-app-design subskill) so the scaffold/data-access gate isn't skipped. The "Ask Genie button on this dashboard" case stays in aibi. No frontmatter description or manifest.json changes. Co-authored-by: Isaac

pkosiec force-pushed the pkosiec/improve-data-dashboards-skill branch 2 times, most recently from 61468b2 to 260c92d Compare June 12, 2026 13:43

simonfaltum reviewed Jun 15, 2026

View reviewed changes

pkosiec marked this pull request as ready for review June 15, 2026 10:57

pkosiec requested review from a team, dustinvannoy-db and lennartkats-db as code owners June 15, 2026 10:57

simonfaltum approved these changes Jun 15, 2026

View reviewed changes

pkosiec added 2 commits June 18, 2026 12:24

pkosiec force-pushed the pkosiec/improve-data-dashboards-skill branch from 66f2837 to b2af9a7 Compare June 18, 2026 10:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(skills): stop dashboard-intent leaking away from databricks-aibi-dashboards#144

fix(skills): stop dashboard-intent leaking away from databricks-aibi-dashboards#144
pkosiec wants to merge 2 commits into
mainfrom
pkosiec/improve-data-dashboards-skill

pkosiec commented Jun 12, 2026

Uh oh!

simonfaltum left a comment

Uh oh!

pkosiec commented Jun 15, 2026

Uh oh!

pkosiec commented Jun 15, 2026

Uh oh!

pkosiec commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pkosiec commented Jun 12, 2026

Problem

Fix

Result (validated empirically, not just proposed)

Uh oh!

simonfaltum left a comment

Choose a reason for hiding this comment

Uh oh!

pkosiec commented Jun 15, 2026

Uh oh!

pkosiec commented Jun 15, 2026

Uh oh!

pkosiec commented Jun 16, 2026

App Evals run on this PR — failure investigation

Run summary

Per-app scores vs last green PR-preset baseline (916364192518713 / appkit_pr:346)

Root cause

Verdict on this PR

Next steps to unblock

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Per-app scores vs last green PR-preset baseline (`916364192518713` / `appkit_pr:346`)