Refactor databricks-apps around capability composition + warehouse mutations#132
Refactor databricks-apps around capability composition + warehouse mutations#132MarioCadenas wants to merge 7 commits into
Conversation
cf2c711 to
4c43aa3
Compare
Document Delta/UC DML via custom routes, unify write-path guidance across skills, and expand Lakebase scaffolding and deployment notes.
The prior reorder conflated Lakebase deploy-first with the full lifecycle; keep Scaffold → Develop → Validate → Deploy and call out the OLTP exception.
Add data-patterns and lifecycle guides, slim SKILL.md to a 5-step agent workflow, dedupe overview and plugin guides, and broaden skill frontmatter for multi-plugin apps.
Extract OLTP and synced-read guides from the monolithic lakebase doc, add a thin router, point data-patterns and cross-skill links at the right targets, and trim custom-endpoints/proto-first duplication.
f238e3f to
2cc1f99
Compare
Adds a Local-vs-agentic-mode split keyed to DATABRICKS_APPS_AGENTIC_MODE, plus the P1-P3 review fixes from the data-path refactor. Agentic mode (DATABRICKS_APPS_AGENTIC_MODE=true): - New references/appkit/environments.md as the canonical Local-vs-agentic delta; Step-0 detection branch in SKILL.md. - In agentic mode the app is pre-scaffolded and all plugin resources are provisioned: read enabled plugins from appkit.plugins.json / app.yaml (don't infer); ambient auth (no profile, omit --profile); run only design+discovery gates; skip provisioning gates, scaffold, deploy, and smoke tests; npm run dev hits live resources; still run databricks apps validate. Stop and surface if a needed plugin isn't wired. - Short agentic callouts in lifecycle, data-patterns, lakebase-oltp, genie, model-serving, files, jobs, overview, sql-queries, warehouse-mutations. Doc fixes: - Capability flags marked as concepts, not --features values. - Single canonical write-path table in data-patterns; custom-endpoints and warehouse-mutations now guard-and-link instead of restating it. - warehouse-mutations leads with the inline pattern; generic is optional. - Reframed the warehouse smoke test to a non-mutating check. - Simplified the lifecycle phase matrix; standardized await createApp. Co-authored-by: Isaac
|
🧪 Dev eval run kicked off for this PR Running the Setup
Status: generation in progress (~45–60 min). I'll follow up with per-app Note: a prod nightly is running concurrently and shares the Anthropic API key, so an isolated |
✅ Eval results — no generation-quality regression from this PRRun
Generation — all 1.0 (build + unit + smoke + typecheck + The one miss (
|
| Run | Skill | genie_taxi_chat |
Layout |
|---|---|---|---|
| original | this branch | 0.0 | genie-taxi-chat/genie-taxi-chat/ ✗ |
| re-run | this branch | 1.0 | genie-taxi-chat/ ✓ |
| control | stock main |
1.0 | genie-taxi-chat/ ✓ |
On re-run the PR skill produced a correct app identical to stock — the double-nesting was a one-off. Soft flag: worth a glance at whether the lifecycle/scaffold guidance reorg makes an extra wrapper directory slightly more likely, but it is not deterministic.
Edits — 0 build regressions
| Edit | Δ appeval | Note |
|---|---|---|
property_search_app · add_emoji |
−0.17 | smoke test pass→fail (build + unit OK) — likely flaky selector |
city_performance_app · fix_critical_issue |
0 | no critical issue found (legit no-op) |
taxi_zones_map · simplify_code |
0 | clean |
parts_catalog_app · drop_unrequested_feature |
0 | clean |
parts_catalog_app · multi_turn_additive |
0 | clean |
Bottom line
The capability-composition refactor generates apps on par with stock skills across warehouse-read, Lakebase OLTP, Genie, model-serving, and devhub prompts — no build or quality regression. Skill confirmed installed from this branch (Using skills version refactor-app-capability-composition, 9 skills, no rate-limit on CLI v1.2.1).
Caveat: a prod nightly shared the Anthropic API key during the main run; it didn't materially affect results (the single transient slip cleared on re-run).
🧪 Full eval set now running on this PRFollowing the
I'll follow up with the aggregate |
Full eval-set results + merge assessmentFull Headline: ✅ no generation/edit quality regression —
|
| Signal | Result |
|---|---|
Apps at appeval_100 = 1.0 |
67 / 89 |
Aggregate appeval_100 |
0.8407 (gate is ≥0.85 → run is SUCCESS_WITH_FAILURES, same steady state as stock prod nightlies) |
| Edit build regressions | 0 / 52 |
| Confirmed PR-attributable defects | 1 pattern (scaffold nesting) |
The 14 hard/near failures, attributed
- 🔴 Doubly-nested scaffold — PR-attributable (4):
booking_calendar,cb_ontos_full_spec,driver_opportunity_map,customer_segment_clusters. The agent emits a valid, build- and smoke-passing app one directory too deep (<app>/<app>/package.json), so the harness reports "Missing package.json" → hard 0.0. Direct proof it's this PR: the same 4 apps scaffold at the correct depth (<app>/package.json) on stock skills — 4/4 nested on this branch vs 0/4 on stock. (It's intermittent —genie_taxi_chatnested once in the smoke run but scaffolded correctly on re-run — but clearly correlated with this branch.) - 🟡
wall_clock_timeout— environmental, not the PR (5):cb_aichemy_full_spec,cb_pixels_advanced,cb_pixels_full_spec,cb_support_dashboard_full_spec,devhub_lakebase_cdc. Hit the 2400s/40-min cap with zero artifacts produced; a prod nightly shared the Anthropic API key during this run's generation. Heaviest cookbook specs; stock struggles with these too. - ⚪ Non-standard app / stock also fails — not a regression (5):
cb_apx_full_spec(Python CLI tool),cb_dbt_docs_full_spec/_advanced(static docs),devhub_tpl_inventory_intelligence,devhub_off_platform_lakebase. Nopackage.json/buildscript by nature; stock scores the same.
Plus 6 minor (0.83–0.97) — almost all smoke-test or runability partials, not build failures.
Edits (52)
0 build regressions. 5 smoke-only regressions (4× add_emoji, 1× drop_unrequested_feature; each −0.167, app still builds/typechecks/validates — classic emoji-shifts-a-Playwright-selector flake). 4 edit-task errors (3× fix_critical_issue, 1× multi_turn_refactor) — within the normal edit-agent noise seen on stock nightlies. 42 clean, 1 improved.
Verdict: Fix the scaffold nesting before merge; everything else is clean.
The capability-composition refactor itself looks sound — 67/89 perfect, 0 build regressions on edits, and the aggregate sits exactly where stock prod nightlies sit (the sub-0.85 is heavy-app timeouts, not this PR). I would not merge as-is only because of one genuine, PR-introduced regression: the doubly-nested scaffold, which hard-fails otherwise-valid apps purely on directory placement and is proven against stock (4/4 vs 0/4).
To unblock: adjust the lifecycle/scaffold guidance so the agent scaffolds at <app>/ rather than wrapping it in a second <app>/<app>/ directory (e.g. be explicit that databricks apps init targets the existing app directory, not a new child). A defensive harness tweak (auto-descend one wrapper level before scoring) would also recover these, but the real fix belongs in the skill. After the fix, re-running just these 4 apps (+ genie_taxi_chat) should confirm recovery — happy to do that.
Caveats: single full run (n=1) on dev-dogfood, appkit 0.38.1, with a concurrent prod nightly sharing the API key (inflated the 5 timeouts). The scaffold-nesting conclusion is the robust one — it's a layout signal independent of appkit/contention.
The capability-composition scaffold guidance let the agent occasionally emit the app one directory too deep (work_dir/<app>/<app>/package.json) instead of work_dir/<app>/. The eval harness (and `databricks apps` tooling) expect the app at <app>/, so a valid, build- and smoke-passing app scored a hard 0.0 on "Missing package.json". Seen in a full nightly-lakebase eval of this branch: 4/89 apps nested on this skill vs 0/4 of the same apps on stock skills, plus an intermittent hit on genie_taxi_chat (cleared on re-run). Make step 3 explicit: run `apps init` from the working dir (it creates <app>/), don't mkdir/cd/re-init first, and verify <app>/package.json — lifting the inner dir up one level if a doubled <app>/<app>/ appears.
🔧 Pushed a fix for the scaffold-nesting regression —
|
✅ Fix verified — scaffold nesting resolvedRe-ran the affected apps on the patched branch (commit
5/5 now produce a single-level Net for this PRWith the scaffold fix in, the one PR-attributable regression from the full-set eval is resolved. Combined with the earlier results — 67/89 apps at 1.0, the sub-0.85 aggregate being the same heavy-app-timeout steady state as stock nightlies, and 0 edit build regressions — the capability-composition refactor is good to merge. The remaining full-run misses were all environmental (5 wall-clock timeouts under shared-API-key contention) or non-regressions (5 non-standard apps that fail on stock too). |
|
Final number for the one pending app: So all 5 affected apps are recovered: 4 → 1.0, |
🔍 Agent-consumability review — inconsistencies that would confuse a consuming agentFresh static review of the full skill content on this branch (5 parallel reviewers over file groups + manual verification of every quoted claim against the files). Review bar: would another agent do the wrong thing, hit a contradiction, or have to guess? Eval results above tested the happy path; this hunts the paths evals don't cover. Overall: the refactor's architecture holds up well — the canonical-table + guard-and-link model is real (no leftover divergent pattern tables; 143 relative links with only 1 hard break; genie CLI usage consistent; no capability-vs- P0 — agent ships broken code
P1 — agent does the wrong thing or must guess
P2 — confusing but recoverable
VerdictFix-before-merge recommended — nothing architectural; the P0 and the agentic-guard P1s are one-to-few-line fixes each, but they sit exactly where agents will trip: the canonical mutation example, the orchestrator's step 0, and the two guides that missed agentic callouts. Happy to push these as a follow-up commit (same pattern as the scaffold fix) if useful. |
Fixes the issues found by the static review on this PR (see review
comment): the contradictions and missing agentic-mode guards that would
make a consuming agent do the wrong thing.
- warehouse-mutations: sql.int() -> sql.number() in the canonical
example (sql-queries.md lists sql.int under "DO NOT exist"); widen
the extracted-routes param generic to include sql.number; fix the
invented appkit.jobs.run() -> appkit.jobs("<key>").runNow(); split
the hybrid --features sentence (read warehouse + write Delta is
analytics-only, not analytics,lakebase)
- SKILL.md step 0: agentic mode skips step 3 + deploy only — step 2's
design + discovery gates (write_path, read_path, data_discovery)
still run, matching step 2 and environments.md
- custom-endpoints: add the missing agentic-mode callout (never
manifest/--profile; read appkit.plugins.json / server.ts instead)
- lakebase-synced-reads: add the missing agentic-mode callout
(mirrors lakebase-oltp.md)
- genie: agentic-mode note for Multi-Space Deployment (skip bundle
provisioning subsections; client-side patterns still apply)
- proto-first: fix self-contradicting "When to Use" row (multi-plugin
apps only); fix broken link references/plugin-contracts.md ->
proto-contracts.md; schema-qualify the example migration (SP cannot
use `public`) and state migrations run in onPluginsReady with
deploy-first still applying
python3.10 scripts/skills.py validate: clean.
🔧 Review fixes shipped —
|
✅ Review fixes verified — eval clean (avg 0.98, aggregate gate passed)Re-ran the
No regression from the review fixes; the scaffold-nesting fix continues to hold; this is the cleanest run of the series. One watch-item, not a blocker: avg generation turns were 69 (≈690s/app) — elevated vs the 20–30 healthy band, though without a like-for-like turns baseline for this preset I can't attribute it (concurrent load on the shared API key is the usual suspect). That closes out everything from our side: evals (smoke + full 89-app), failure attribution, scaffold fix + verification, consumability review, review fixes + this verification. 🚢 |
Summary
Unifies the former #135 + #132 stack into a single PR (based on
main). Refactorsdatabricks-appsso agents compose apps from capabilities (reads_warehouse,writes_oltp,genie,files, …) instead of monolithic archetype docs, adds the warehouse-mutations write path, and teaches the skill to handle two environments (local vs agentic mode).Capability refactor
warehouse-mutations.md— Delta/UC DML viaappkit.analytics.query()in custom routesdata-patterns.md— canonical capability catalog, conditional gates, write/read paths, recipes, checklist sliceslifecycle.md— dev / validate / deploy orderingSKILL.mdto a thin orchestratorlakebase.mdinto router +lakebase-oltp.md+lakebase-synced-reads.mdcustom-endpoints.md→ points at data-patterns; markproto-first.mdadvanced-onlyAgentic mode (
DATABRICKS_APPS_AGENTIC_MODE=true)environments.mdas the canonical Local-vs-agentic delta; Step-0 detection branch inSKILL.mdappkit.plugins.json/app.yaml(don't infer); ambient auth (no profile, omit--profile); run only design+discovery gates; skip provisioning gates, scaffold, deploy, and smoke tests;npm run devhits live resources; still rundatabricks apps validate. Stop and surface if a needed plugin isn't wired.Review fixes (P1–P3)
--featuresvalueswarehouse-mutations.mdleads with the simple inline pattern (generic optional); non-mutating smoke check; simplified lifecycle matrix; standardizedawait createAppSupersedes #135 (its commits are included here).
Test plan
python3 scripts/skills.py generate && python3 scripts/skills.py validateappkit.analytics.query()supports DML on the shipped AppKit version before relying onwarehouse-mutations.mdDATABRICKS_APPS_AGENTIC_MODE=true→ readsappkit.plugins.json, no scaffold/deploy, ambient auth