Skip to content

fix(skills): stop dashboard-intent leaking away from databricks-aibi-dashboards#144

Open
pkosiec wants to merge 2 commits into
mainfrom
pkosiec/improve-data-dashboards-skill
Open

fix(skills): stop dashboard-intent leaking away from databricks-aibi-dashboards#144
pkosiec wants to merge 2 commits into
mainfrom
pkosiec/improve-data-dashboards-skill

Conversation

@pkosiec

@pkosiec pkosiec commented Jun 12, 2026

Copy link
Copy Markdown
Member

Problem

A routing A/B test (176 headless Claude Code sessions, 6 install conditions, 11 prompts × 3 trials) measured which skill is invoked when databricks-aibi-dashboards, databricks-app-design (#125), and databricks-apps are all installed.

Install condition dashboard-intent → aibi
aibi alone (baseline) 12/12
aibi + app-design 17/18
realistic bundle (aibi + app-design + databricks-apps) 14/18 (78%) 🚨

The fear that databricks-app-design "steals" dashboard requests is mostly unfounded head-to-head — its "NOT for Lakeview" disclaimer holds, and explicit "AI/BI"/"Lakeview" phrasing routed correctly 100% of the time. The real leak is databricks-apps: its description lists "dashboards" as a trigger and its body hard-routes any data-displaying app to databricks-app-design → AppKit, never mentioning databricks-aibi-dashboards. So "sales dashboard with a region filter" went apps → app-design → AppKit in 3/3 trials, and aibi was never even in the consideration set (0 mentions across all funnel transcripts). Ambiguous prompts hit that funnel 8/9 times.

Fix

  • databricks-app-design — lead the description with "custom-code Databricks Apps (AppKit/React) screens" instead of bare "dashboards, KPI pages…", and add an explicit rule: a plain "create a dashboard" means a managed AI/BI (Lakeview) dashboard → databricks-aibi-dashboards. Mirrored in the "When to use / when NOT" section.
  • databricks-apps — drop "dashboards" from the trigger list; add the same deference rule to the description and an "Is this even a Databricks App?" paragraph at the top of the body.
  • Codex metadata — reword databricks-app-design/agents/openai.yaml default_prompt so it no longer leads with dashboards (same leak on the OpenAI side).
  • Regenerate manifest.json.

Result (validated empirically, not just proposed)

With both edits: dashboard-intent routing → 18/18 (100%) aibi in the bundle, ambiguous prompts now go to aibi or a clarifying question, and databricks-app-design keeps 6/6 of its legitimate app-UX traffic — zero over-correction.

This pull request and its description were written by Isaac.

@pkosiec pkosiec force-pushed the pkosiec/improve-data-dashboards-skill branch 2 times, most recently from 61468b2 to 260c92d Compare June 12, 2026 13:43

@simonfaltum simonfaltum left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review note:

One small non-experimental metadata nit: skills/databricks-app-design/agents/openai.yaml updates default_prompt, but short_description still says "Design analytics/BI/AI data app UX, bound to AppKit". Since openai.yaml is hand-preserved by the generator, this will not get refreshed from the new SKILL.md frontmatter. I would update that short description to say custom-code Databricks App/AppKit data screens too, so the Codex marketplace metadata matches the routing change.

The core routing changes and generated manifest validation look good.

pkosiec added a commit that referenced this pull request Jun 15, 2026
…eference

Follow-up to PR #144 review:

- databricks-app-design/agents/openai.yaml: refresh short_description to the
  custom-code Databricks App (AppKit/React) framing so it matches the updated
  default_prompt. The generator preserves hand-authored openai.yaml, so the
  tagline does not refresh from SKILL.md frontmatter on its own.
- databricks-aibi-dashboards: add a "when a custom app fits better" callout and
  a Related Skills entry pointing to databricks-apps for genuine custom-app
  needs (write-back, bespoke UI, in-app Genie/chat). Points at the databricks-apps
  parent (not the databricks-app-design subskill) so the scaffold/data-access
  gate isn't skipped. The "Ask Genie button on this dashboard" case stays in aibi.

No frontmatter description or manifest.json changes.

Co-authored-by: Isaac
@pkosiec

pkosiec commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

Thanks for the review, Simon! Follow-up commit (66f2837) addresses both notes:

  • Refreshed databricks-app-design's Codex short_description to the custom-code Databricks App (AppKit/React) framing — matches the updated default_prompt. (The generator preserves hand-authored openai.yaml, so it needed the manual nudge.)
  • Added a databricks-aibi-dashboardsdatabricks-apps pointer for genuine custom-app cases (write-back, bespoke UI, in-app Genie/chat), targeting the parent entry skill rather than the databricks-app-design subskill so the scaffold/data-access gate isn't skipped. The "Ask Genie button on this dashboard" case stays in aibi.

No frontmatter description or manifest.json changes.

@pkosiec pkosiec marked this pull request as ready for review June 15, 2026 10:57
@pkosiec

pkosiec commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

Running evals before merging the PR 👍

@pkosiec

pkosiec commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

App Evals run on this PR — failure investigation

The eval run triggered for this PR (pr_skills144_20260615_120505) hit SUCCESS_WITH_FAILURES and tripped the 0.85 gate (avg_appeval 0.61 < 0.85). The failures are not caused by this PR — they cascade from an eval-environment infra issue the agent never gets past.

Run summary

Field Value
Job run 814658487513252
MLflow run c9d18c20412747e1940b162d5b3f4ad2
Workspace e2-dogfood
Prompt preset pr (10 prompts)
Tags skills_pr:144, trigger:one_time
Skills ref 66f283719a717ec5eaf89e470e8bca71c8b74bbb
CLI v1.3.0
AppKit 0.38.1

Per-app scores vs last green PR-preset baseline (916364192518713 / appkit_pr:346)

App Baseline This run Delta
cb_brickhouse_simple 1.00 n/a (eval timed out)
cb_genie_chat_advanced 1.00 1.00 0
cb_pixels_simple 1.00 1.00 0
city_performance_app 1.00 0.125 −0.875
devhub_saas_tracker 1.00 1.00 0
genie_taxi_chat 1.00 1.00 0
parts_catalog_app 0.55 0.125 −0.425
property_search_app 1.00 0.125 −0.875
serving_chat 1.00 1.00 0
taxi_zones_map 1.00 0.125 −0.875

The 4 regressed apps are exactly the databricks_v2 promptset entries (SQL-Warehouse-backed apps). All 5 non-Warehouse apps held at 1.00.

Root cause

All 4 regressed apps' eval.json shows the same failure: npm_install failed, cascading every downstream step (build / unit / smoke / typecheck / apps_validate) to "npm install failed". The actual error in the iteration logs:

> appkit generate-types
[appkit:type-generator:query-registry]
  DESCRIBE rejected for city_booking_trends:
  Response from server (Forbidden)
  {"error_code":"PERMISSION_DENIED",
   "message":"You do not have permission to use the SQL Warehouse.",
   "details":[{"resource_type":"warehouse",
               "resource_name":"75d3c8bdec7d1569",
               "description":"user is not authorized to use this warehouse"}]}
Error: Type generation failed: 5 queries could not be described.
npm error code 1 (postinstall: npm run typegen)

appkit generate-types runs as the postinstall step of npm install and hits the SQL warehouse 75d3c8bdec7d1569. The eval cluster's identity is not authorized for that warehouse, so typegen errors → postinstall errors → install is marked failed → every downstream eval step short-circuits.

The two other failures (eval timeout on cb_brickhouse_simple, edit_app error_max_turns on drop_unrequested_feature/parts_catalog_app) hit the same wall — the agent burned 31 turns retrying typegen against the inaccessible warehouse.

The pipeline's own trajectory-analysis artifact (trajectory_analysis.md on the MLflow run) corroborates this from the generation side: 5 of 9 trajectories burned 10–60 steps each tripping over warehouse 75d3c8bdec7d1569.

Verdict on this PR

Not blamed. The failures cascade from a postinstall infra step the agent doesn't control — the agent never reaches the part of the workflow this PR (fix(skills): stop dashboard-intent leaking away from databricks-aibi-dashboards) changes. The 5 apps that don't depend on the broken warehouse all scored 1.00, same as the baseline.

Next steps to unblock

  1. Re-grant USAGE on warehouse 75d3c8bdec7d1569 to the eval cluster's identity on e2-dogfood (or whichever SP runs npm install on the eval cluster). The baseline run succeeded yesterday on the same workspace, so this is a recent permission change.
  2. Once unblocked, re-trigger this PR's eval run — only then will it actually exercise the skill changes.
  3. Longer-term: stop hardcoding 75d3c8bdec7d1569 as the default warehouse in databricks apps init (separate AppKit/template issue surfaced by the trajectory analysis as the dominant generation-time friction across databricks_v2 apps).

pkosiec added 2 commits June 18, 2026 12:24
…dashboards

A routing A/B (176 headless Claude Code sessions, realistic bundle of
databricks-aibi-dashboards + databricks-app-design + databricks-apps) found
plain "create a dashboard" prompts routed to aibi only 14/18 (78%). Root cause:
databricks-apps lists "dashboards" as a trigger and hard-routes any
data-displaying app to databricks-app-design → AppKit, so aibi was never
considered.

- databricks-app-design: lead the description with custom-code Databricks Apps
  (AppKit/React) screens instead of "dashboards, KPI pages…"; add an explicit
  rule that a plain "create a dashboard" = managed AI/BI (Lakeview) →
  databricks-aibi-dashboards. Mirror the rule in "When to use / when NOT".
- databricks-apps: drop "dashboards" from the trigger list; add the same
  deference rule to the description and an "Is this even a Databricks App?"
  paragraph in the body.
- databricks-app-design Codex openai.yaml: reword default_prompt so it no
  longer leads with dashboards.
- Regenerate manifest.json.

Validated empirically: dashboard-intent → 18/18 aibi in the bundle, ambiguous
prompts → aibi or a clarifying question, app-design keeps 6/6 of its legit
app-UX traffic (zero over-correction).

Co-authored-by: Isaac
…eference

Follow-up to PR #144 review:

- databricks-app-design/agents/openai.yaml: refresh short_description to the
  custom-code Databricks App (AppKit/React) framing so it matches the updated
  default_prompt. The generator preserves hand-authored openai.yaml, so the
  tagline does not refresh from SKILL.md frontmatter on its own.
- databricks-aibi-dashboards: add a "when a custom app fits better" callout and
  a Related Skills entry pointing to databricks-apps for genuine custom-app
  needs (write-back, bespoke UI, in-app Genie/chat). Points at the databricks-apps
  parent (not the databricks-app-design subskill) so the scaffold/data-access
  gate isn't skipped. The "Ask Genie button on this dashboard" case stays in aibi.

No frontmatter description or manifest.json changes.

Co-authored-by: Isaac
@pkosiec pkosiec force-pushed the pkosiec/improve-data-dashboards-skill branch from 66f2837 to b2af9a7 Compare June 18, 2026 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants