Refactor databricks-apps around capability composition + warehouse mutations by MarioCadenas · Pull Request #132 · databricks/databricks-agent-skills

MarioCadenas · 2026-06-09T16:44:26Z

Summary

Unifies the former #135 + #132 stack into a single PR (based on main). Refactors databricks-apps so agents compose apps from capabilities (reads_warehouse, writes_oltp, genie, files, …) instead of monolithic archetype docs, adds the warehouse-mutations write path, and teaches the skill to handle two environments (local vs agentic mode).

Capability refactor

Add warehouse-mutations.md — Delta/UC DML via appkit.analytics.query() in custom routes
Add data-patterns.md — canonical capability catalog, conditional gates, write/read paths, recipes, checklist slices
Add lifecycle.md — dev / validate / deploy ordering
Slim SKILL.md to a thin orchestrator
Split lakebase.md into router + lakebase-oltp.md + lakebase-synced-reads.md
Strip duplicate pattern tables from plugin guides; trim custom-endpoints.md → points at data-patterns; mark proto-first.md advanced-only

Agentic mode (`DATABRICKS_APPS_AGENTIC_MODE=true`)

Add environments.md as the canonical Local-vs-agentic delta; Step-0 detection branch in SKILL.md
In agentic mode the app is pre-scaffolded and all plugin resources are provisioned: read enabled plugins from appkit.plugins.json / app.yaml (don't infer); ambient auth (no profile, omit --profile); run only design+discovery gates; skip provisioning gates, scaffold, deploy, and smoke tests; npm run dev hits live resources; still run databricks apps validate. Stop and surface if a needed plugin isn't wired.
Short agentic callouts across lifecycle, data-patterns, lakebase-oltp, genie, model-serving, files, jobs, overview, sql-queries, warehouse-mutations

Review fixes (P1–P3)

Capability flags marked as concepts, not --features values
One canonical write-path table; downstream guides guard-and-link instead of restating it
warehouse-mutations.md leads with the simple inline pattern (generic optional); non-mutating smoke check; simplified lifecycle matrix; standardized await createApp

Supersedes #135 (its commits are included here).

Test plan

python3 scripts/skills.py generate && python3 scripts/skills.py validate
Verify appkit.analytics.query() supports DML on the shipped AppKit version before relying on warehouse-mutations.md
Spot-check local flow: multi-plugin request → data-patterns gates → correct deep guide
Spot-check agentic flow: DATABRICKS_APPS_AGENTIC_MODE=true → reads appkit.plugins.json, no scaffold/deploy, ambient auth
Spot-check: form/CRUD → Lakebase; Delta write → warehouse mutations; dashboard → SQL queries

Document Delta/UC DML via custom routes, unify write-path guidance across skills, and expand Lakebase scaffolding and deployment notes.

The prior reorder conflated Lakebase deploy-first with the full lifecycle; keep Scaffold → Develop → Validate → Deploy and call out the OLTP exception.

Add data-patterns and lifecycle guides, slim SKILL.md to a 5-step agent workflow, dedupe overview and plugin guides, and broaden skill frontmatter for multi-plugin apps.

Extract OLTP and synced-read guides from the monolithic lakebase doc, add a thin router, point data-patterns and cross-skill links at the right targets, and trim custom-endpoints/proto-first duplication.

Adds a Local-vs-agentic-mode split keyed to DATABRICKS_APPS_AGENTIC_MODE, plus the P1-P3 review fixes from the data-path refactor. Agentic mode (DATABRICKS_APPS_AGENTIC_MODE=true): - New references/appkit/environments.md as the canonical Local-vs-agentic delta; Step-0 detection branch in SKILL.md. - In agentic mode the app is pre-scaffolded and all plugin resources are provisioned: read enabled plugins from appkit.plugins.json / app.yaml (don't infer); ambient auth (no profile, omit --profile); run only design+discovery gates; skip provisioning gates, scaffold, deploy, and smoke tests; npm run dev hits live resources; still run databricks apps validate. Stop and surface if a needed plugin isn't wired. - Short agentic callouts in lifecycle, data-patterns, lakebase-oltp, genie, model-serving, files, jobs, overview, sql-queries, warehouse-mutations. Doc fixes: - Capability flags marked as concepts, not --features values. - Single canonical write-path table in data-patterns; custom-endpoints and warehouse-mutations now guard-and-link instead of restating it. - warehouse-mutations leads with the inline pattern; generic is optional. - Reframed the warehouse smoke test to a non-mutating check. - Simplified the lifecycle phase matrix; standardized await createApp. Co-authored-by: Isaac

keugenek · 2026-06-10T11:47:31Z

🧪 Dev eval run kicked off for this PR

Running the app-evals pipeline with the databricks-apps skill pinned to this branch, to check the capability-composition refactor doesn't regress generated-app quality.

Setup

Skill loaded from this branch — confirmed on the eval cluster:

Installing Databricks skills with ref override: refactor-app-capability-composition
Using skills version refactor-app-capability-composition
Installed 9 skills (databricks-core, databricks-apps, databricks-lakebase, databricks-model-serving, …)

Preset: custom-pr — 10 apps spanning every promptset (warehouse reads, Lakebase OLTP cb_brickhouse, Genie, model serving, devhub) + 5 edit tasks. Covers the surfaces this PR reorganizes (capability composition, warehouse mutations, lakebase split, genie/serving/files guides).
Env: dev-dogfood, CLI v1.2.1, run 119124075098867 (internal Databricks workspace).

Status: generation in progress (~45–60 min). I'll follow up with per-app appeval_100 (build / unit / smoke / typecheck / apps validate / local runability), the average against the ≥ 0.85 health gate, and any edit regressions — compared against last night's stock-skills nightly as a baseline for the overlapping apps.

Note: a prod nightly is running concurrently and shares the Anthropic API key, so an isolated wall_clock_timeout on a heavy app may be capacity contention rather than a skill regression — I'll flag it if that happens.

keugenek · 2026-06-10T12:33:44Z

✅ Eval results — no generation-quality regression from this PR

Run 119124075098867 — custom-pr preset (10 apps across every promptset + 5 edits), databricks-apps skill pinned to this branch.

Metric	Result
Apps at `appeval_100` = 1.0	9 / 10
Avg `appeval_100`	0.90 (health gate ≥ 0.85 ✓)
Edit build regressions	0

Generation — all 1.0 (build + unit + smoke + typecheck + apps validate + local runability): property_search_app, parts_catalog_app, taxi_zones_map, city_performance_app, cb_brickhouse_simple (Lakebase), cb_genie_chat_advanced, cb_pixels_simple, serving_chat, devhub_saas_tracker.

The one miss (`genie_taxi_chat` = 0.0) was a non-reproducible scaffold slip, not a regression

In the main run the agent scaffolded the app one directory too deep (genie-taxi-chat/genie-taxi-chat/), so the harness couldn't find package.json → build skipped → 0.0. Generation otherwise reported success but took 48 turns (high), consistent with the agent fumbling the layout. I re-ran it both ways on the same appkit (0.38.1):

Run	Skill	`genie_taxi_chat`	Layout
original	this branch	0.0	`genie-taxi-chat/``genie-taxi-chat/` ✗
re-run	this branch	1.0	`genie-taxi-chat/` ✓
control	stock `main`	1.0	`genie-taxi-chat/` ✓

On re-run the PR skill produced a correct app identical to stock — the double-nesting was a one-off. Soft flag: worth a glance at whether the lifecycle/scaffold guidance reorg makes an extra wrapper directory slightly more likely, but it is not deterministic.

Edits — 0 build regressions

Edit	Δ appeval	Note
`property_search_app` · add_emoji	−0.17	smoke test pass→fail (build + unit OK) — likely flaky selector
`city_performance_app` · fix_critical_issue	0	no critical issue found (legit no-op)
`taxi_zones_map` · simplify_code	0	clean
`parts_catalog_app` · drop_unrequested_feature	0	clean
`parts_catalog_app` · multi_turn_additive	0	clean

Bottom line

The capability-composition refactor generates apps on par with stock skills across warehouse-read, Lakebase OLTP, Genie, model-serving, and devhub prompts — no build or quality regression. Skill confirmed installed from this branch (Using skills version refactor-app-capability-composition, 9 skills, no rate-limit on CLI v1.2.1).

Caveat: a prod nightly shared the Anthropic API key during the main run; it didn't materially affect results (the single transient slip cleared on re-run).

keugenek · 2026-06-10T13:20:43Z

🧪 Full eval set now running on this PR

Following the custom-pr smoke result above, I kicked off the full nightly-lakebase set (89 apps) with the skill pinned to this branch — the same comprehensive sweep the canonical prod nightly runs (73 nightly + 16 Lakebase, every promptset), plus the edit suite (~50 edit tasks across all 5 edit types).

Run: 526143550950576 on dev-dogfood (internal Databricks workspace), CLI v1.2.1.
Skill: confirmed installed from this branch again (Using skills version refactor-app-capability-composition).
Status: generation in progress (89 apps; ~1.5–2 h end-to-end).

I'll follow up with the aggregate appeval_100 vs the ≥ 0.85 health gate, a per-promptset breakdown, every generation failure / edit regression with a skill-vs-appkit-vs-flaky attribution, and a comparison against the stock-skills nightly. Any ambiguous skill-suspected failure gets a matched stock re-run to confirm (as with genie_taxi_chat above).

keugenek · 2026-06-11T07:44:06Z

Full eval-set results + merge assessment

Full nightly-lakebase set: 89 apps + 52 edits, databricks-apps skill pinned to this branch (run 526143550950576). Each failure was attributed via a parallel per-app investigation with adversarial verification, then cross-checked against stock skills.

Headline: ✅ no generation/edit quality regression — ⚠️ one real, fixable scaffold bug

Signal	Result
Apps at `appeval_100` = 1.0	67 / 89
Aggregate `appeval_100`	0.8407 (gate is ≥0.85 → run is `SUCCESS_WITH_FAILURES`, same steady state as stock prod nightlies)
Edit build regressions	0 / 52
Confirmed PR-attributable defects	1 pattern (scaffold nesting)

The 14 hard/near failures, attributed

🔴 Doubly-nested scaffold — PR-attributable (4): booking_calendar, cb_ontos_full_spec, driver_opportunity_map, customer_segment_clusters. The agent emits a valid, build- and smoke-passing app one directory too deep (<app>/<app>/package.json), so the harness reports "Missing package.json" → hard 0.0. Direct proof it's this PR: the same 4 apps scaffold at the correct depth (<app>/package.json) on stock skills — 4/4 nested on this branch vs 0/4 on stock. (It's intermittent — genie_taxi_chat nested once in the smoke run but scaffolded correctly on re-run — but clearly correlated with this branch.)
🟡 wall_clock_timeout — environmental, not the PR (5): cb_aichemy_full_spec, cb_pixels_advanced, cb_pixels_full_spec, cb_support_dashboard_full_spec, devhub_lakebase_cdc. Hit the 2400s/40-min cap with zero artifacts produced; a prod nightly shared the Anthropic API key during this run's generation. Heaviest cookbook specs; stock struggles with these too.
⚪ Non-standard app / stock also fails — not a regression (5): cb_apx_full_spec (Python CLI tool), cb_dbt_docs_full_spec/_advanced (static docs), devhub_tpl_inventory_intelligence, devhub_off_platform_lakebase. No package.json/build script by nature; stock scores the same.

Plus 6 minor (0.83–0.97) — almost all smoke-test or runability partials, not build failures.

Edits (52)

0 build regressions. 5 smoke-only regressions (4× add_emoji, 1× drop_unrequested_feature; each −0.167, app still builds/typechecks/validates — classic emoji-shifts-a-Playwright-selector flake). 4 edit-task errors (3× fix_critical_issue, 1× multi_turn_refactor) — within the normal edit-agent noise seen on stock nightlies. 42 clean, 1 improved.

Verdict: Fix the scaffold nesting before merge; everything else is clean.

The capability-composition refactor itself looks sound — 67/89 perfect, 0 build regressions on edits, and the aggregate sits exactly where stock prod nightlies sit (the sub-0.85 is heavy-app timeouts, not this PR). I would not merge as-is only because of one genuine, PR-introduced regression: the doubly-nested scaffold, which hard-fails otherwise-valid apps purely on directory placement and is proven against stock (4/4 vs 0/4).

To unblock: adjust the lifecycle/scaffold guidance so the agent scaffolds at <app>/ rather than wrapping it in a second <app>/<app>/ directory (e.g. be explicit that databricks apps init targets the existing app directory, not a new child). A defensive harness tweak (auto-descend one wrapper level before scoring) would also recover these, but the real fix belongs in the skill. After the fix, re-running just these 4 apps (+ genie_taxi_chat) should confirm recovery — happy to do that.

Caveats: single full run (n=1) on dev-dogfood, appkit 0.38.1, with a concurrent prod nightly sharing the API key (inflated the 5 timeouts). The scaffold-nesting conclusion is the robust one — it's a layout signal independent of appkit/contention.

The capability-composition scaffold guidance let the agent occasionally emit the app one directory too deep (work_dir/<app>/<app>/package.json) instead of work_dir/<app>/. The eval harness (and `databricks apps` tooling) expect the app at <app>/, so a valid, build- and smoke-passing app scored a hard 0.0 on "Missing package.json". Seen in a full nightly-lakebase eval of this branch: 4/89 apps nested on this skill vs 0/4 of the same apps on stock skills, plus an intermittent hit on genie_taxi_chat (cleared on re-run). Make step 3 explicit: run `apps init` from the working dir (it creates <app>/), don't mkdir/cd/re-init first, and verify <app>/package.json — lifting the inner dir up one level if a doubled <app>/<app>/ appears.

keugenek · 2026-06-11T10:55:45Z

🔧 Pushed a fix for the scaffold-nesting regression — `d24c3c5`

Committed directly to this branch to address the one PR-attributable defect from the full-set eval (the doubly-nested scaffold). 2 files, +5 lines:

SKILL.md (step 3 · Scaffold) and references/appkit/data-patterns.md (Scaffolding) — made directory placement explicit and added a verify-and-correct step:

Run apps init from the working directory — it creates the app under <app>/. Don't mkdir/cd into an app dir first or re-run init, or the app nests at <app>/<app>/. After init, verify <app>/package.json exists; if a doubled <app>/<app>/ appears, lift the inner app up one level.

Why this is the fix: the agent was running the correct single databricks apps init --name <slug> command, but — intermittently, and nudged by the new Step-0 "the app is already scaffolded" framing — emitting the app at <slug>/<slug>/. The harness (and databricks apps tooling) resolve the app at <slug>/, so it read the empty outer dir and reported "Missing package.json" → 0.0. The new verify-and-correct step is deterministic, so even on an intermittent nesting the agent now self-corrects rather than shipping a mislocated app.

python3 scripts/skills.py validate → "Everything is up to date" (body-only change; frontmatter/manifest untouched).

Verification in progress: re-running the 4 confirmed-nesting apps + genie_taxi_chat on the patched branch (run 423984765726779) to confirm they recover to a correct <app>/ layout. I'll follow up with the result.

Note: nesting is intermittent (it didn't reproduce on a clean genie_taxi_chat re-run earlier), so the durable guarantee here is the explicit verify-and-correct instruction, not just the reworded init step.

keugenek · 2026-06-11T11:25:31Z

✅ Fix verified — scaffold nesting resolved

Re-ran the affected apps on the patched branch (commit d24c3c5, run 423984765726779). All previously-nested apps now scaffold at the correct <app>/ depth and recover their scores:

App	Pre-fix (full run)	Patched branch
`booking_calendar`	0.0 — `<app>/<app>/` ✗	1.0 — `<app>/` ✓
`driver_opportunity_map`	0.0 — `<app>/<app>/` ✗	1.0 — `<app>/` ✓
`customer_segment_clusters`	0.0 — `<app>/<app>/` ✗	1.0 — `<app>/` ✓
`genie_taxi_chat`	0.0 — nested	1.0 — `<app>/` ✓
`cb_ontos_full_spec`	0.0 — `<app>/<app>/` ✗	layout `<app>/` ✓ — eval finishing

5/5 now produce a single-level <app>/package.json (the layout signal is the root cause; 4/5 evals already back to a perfect 1.0, cb_ontos_full_spec's eval is completing).

Net for this PR

With the scaffold fix in, the one PR-attributable regression from the full-set eval is resolved. Combined with the earlier results — 67/89 apps at 1.0, the sub-0.85 aggregate being the same heavy-app-timeout steady state as stock nightlies, and 0 edit build regressions — the capability-composition refactor is good to merge. The remaining full-run misses were all environmental (5 wall-clock timeouts under shared-API-key contention) or non-regressions (5 non-standard apps that fail on stock too).

keugenek · 2026-06-11T11:28:52Z

Final number for the one pending app: cb_ontos_full_spec → 0.833 on the patched branch (build ✓, typecheck ✓, validate ✓, runability ✓; only a smoke-test miss — exactly its stock-skills score, so no regression). It was 0.0 before purely from the nesting.

So all 5 affected apps are recovered: 4 → 1.0, cb_ontos_full_spec → 0.833 (= stock). Scaffold-nesting fix confirmed end-to-end. ✅

keugenek · 2026-06-11T12:10:41Z

🔍 Agent-consumability review — inconsistencies that would confuse a consuming agent

Fresh static review of the full skill content on this branch (5 parallel reviewers over file groups + manual verification of every quoted claim against the files). Review bar: would another agent do the wrong thing, hit a contradiction, or have to guess? Eval results above tested the happy path; this hunts the paths evals don't cover.

Overall: the refactor's architecture holds up well — the canonical-table + guard-and-link model is real (no leftover divergent pattern tables; 143 relative links with only 1 hard break; genie CLI usage consistent; no capability-vs---features conflation in any command example; env-var names match environments.md). The systematic gap is that agentic-mode guards are applied unevenly: the guides that got callouts are fine, but two files got none, one callout misses its provisioning-heavy section, and SKILL.md's own step 0 contradicts its step 2. Plus one hard API contradiction in the new warehouse-mutations centerpiece.

P0 — agent ships broken code

warehouse-mutations.md:77 uses sql.int(...) in the canonical copy-paste example, but sql-queries.md:114 says it doesn't exist. The primary mutation route example has rating: sql.int(parsed.data.rating), while sql-queries.md's helper list states // sql.int() - use sql.number() under "❌ These DO NOT exist". An agent copies the canonical route and ships a nonexistent helper — or reads both files and must guess which is right. Fix: sql.number(...) in the example (or, if the server-side @databricks/appkit export genuinely differs from @databricks/appkit-ui/js, scope the ❌ list per package — the two files must agree).

P1 — agent does the wrong thing or must guess

SKILL.md step 0 vs step 2 (agentic gates). Step 0: "Skip steps 2–3 and the deploy in step 5". Step 2's own note + environments.md: design + discovery gates (write_path, read_path, data_discovery) still run in agentic mode. Step 0's "Still do" list mentions only data discovery. An agent following step 0 literally skips the write/read-path decision tables and picks the wrong architecture. Fix: step 0 → "skip step 3 and the deploy in step 5; in step 2 run only the design + discovery gates."
warehouse-mutations.md:207 invents an API: appkit.jobs.run(...). jobs.md:84-97 defines the real one: appkit.jobs("key") → .runNow(...) / .runAndWait(...). Fix the handoff line to match.
warehouse-mutations.md:39 prescribes the wrong features for one of its two hybrids. "Hybrid apps (read warehouse + write Lakebase, or read SQL files + write Delta) use --features analytics,lakebase" — the second case is analytics-only; adding lakebase forces three lakebase.postgres.* --set values for resources the app doesn't have (and contradicts data-patterns' "do not add Lakebase just in case"). Fix: split the sentence.
custom-endpoints.md has zero agentic-mode guards — the only mutation-path guide with none — and its "ALWAYS complete these checks" section mandates databricks apps manifest --version <VERSION> --profile <PROFILE> (lines 41-44). environments.md: never run manifest, omit --profile in agentic mode. An agentic agent wiring a route follows the ALWAYS checklist into two forbidden commands. Fix: add the standard callout (read appkit.plugins.json / server/server.ts instead).
lakebase-synced-reads.md has no agentic callout (its sibling lakebase-oltp.md:9 has a thorough one). Its Scaffolding section instructs --set flags and line 7 instructs creating synced tables + SP grants via the databricks-lakebase skill — all forbidden in agentic mode. Fix: mirror the OLTP callout.
genie.md:7 callout's skip list omits Multi-Space Deployment (lines 144-204), which opens with pure bundle provisioning (databricks.yml variables/resources, app.yaml valueFrom). The section's client-side patterns (version stamp, stale-conversation cleanup) are still valid in agentic mode, so an agent can't just skip the whole section either. Fix: a note at the top of Multi-Space Deployment splitting provisioning (skip; use injected DATABRICKS_GENIE_SPACE_*; if a space isn't wired, stop and tell the user) from client-side patterns (still apply).
proto-first.md contradicts itself on when to use it. Header (line 3): "Advanced / optional … For most apps, skip this." Decision table (line 19): "| Creating a new Databricks app | YES — define contracts before databricks apps init |". An agent landing on the table proto-firsts a simple dashboard. Fix the row to "new multi-plugin app" + add a "single-plugin app → NO" row.
proto-first.md:308 broken link — [Plugin Contract Details](references/plugin-contracts.md) resolves to references/appkit/references/plugin-contracts.md (doesn't exist). The actual file is proto-contracts.md in the same directory — and this broken link is its only inbound reference (orphaned otherwise).
proto-first.md:249 example migration violates the skill's own Lakebase rules. CREATE TABLE IF NOT EXISTS runs ( — unqualified → public schema, which lakebase-oltp.md says the app SP cannot access ("Always create a custom schema"; troubleshooting lists exactly this permission denied for schema public failure). The file also never says migrations execute in onPluginsReady or that deploy-first still applies. Fix: schema-qualify the example + one line on where migrations run.

P2 — confusing but recoverable

Unguarded local-isms (class): SKILL.md Generic guidelines mandate --profile on validate and smoke-selector updates with no agentic exception (steps 0/4/5 carve them out, but the guidelines read as global rules); data-patterns.md slices "Deploy before local dev" (:161) and the "All AppKit apps" smoke item (:187); warehouse-mutations.md Validation/UC sections (--profile smoke check :156, smoke-spec update :215, "verify grants after deploy"); jobs.md:132 Resource Requirements manifest --profile. One-line "(local only)" guards fix the class.
data-patterns.md:5 "Derive --features as the union of capability flags below" — conflates concepts with CLI values; its own line 20 says never pass capability names (and the union cardinality is wrong: reads_synced+writes_oltp → one lakebase). Reword to "union of the Plugin column".
sql-queries.md:5 "Do not put DML in config/queries/ for client-side execution" — the qualifier implies server-side DML there might be OK; the slice rule is absolute (SELECT-only). Drop the qualifier.
sql-queries.md:102/267 "ALWAYS use sql.* … from @databricks/appkit-ui/js" — but server routes (warehouse-mutations) correctly import sql from @databricks/appkit. Scope the rule: UI package in client code, server package in routes.
warehouse-mutations.md:106 extracted-routes generic constrains params to Record<string, ReturnType<typeof sql.string>>, which can't typecheck the numeric param used by the very handler it claims to contain ("same /api/feedback handler as above").
genie.md single-space "Adding Genie to an Existing App" yml (72-97) omits user_api_scopes: dashboards.genie; multi-space includes it and troubleshooting (:295) calls it required for the does not have required scopes: genie error. State when the scope is needed (or add it to the single-space example).
model-serving.md:21 hardcodes --set "serving.serving-endpoint.name=…" with no derive-from-manifest caveat — every sibling guide (jobs.md:19, genie.md:68, data-patterns) mandates deriving keys; genie's troubleshooting documents the exact wrong-key failure.
lakebase.md router has no row for the both-capabilities hybrid; "two different patterns. Do not conflate them" reads as pick-one. One line: "Both? Follow both guides — separate tables/routes; OLTP deploy-first applies."
lifecycle.md:23 "Recommended slice order … After init: 1. Genie space" vs :16 "create or reuse the space before/with init". Make the slice item "Genie wiring (space already created at the gate)".
Cross-skill relative links (data-patterns.md:71, lakebase-synced-reads.md:7 → ../../../databricks-lakebase/…) escape the skill root — they work in this repo but break for standalone installs. Prefer naming the skill + doc.
proto-contracts.md:130 Jobs contract says "Jobs are invoked via @databricks/sdk-experimental", "Auth: Workspace token or OAuth", jobs.create(config) — contradicting jobs.md's jobs() plugin (env-discovered, SP-executed, trigger-only; authoring → databricks-jobs skill). Today the broken link (Reinforce repo policy: CODEOWNERS, security, contributing guidelines #9) ironically hides this stale page — fix the content along with the link.

Verdict

Fix-before-merge recommended — nothing architectural; the P0 and the agentic-guard P1s are one-to-few-line fixes each, but they sit exactly where agents will trip: the canonical mutation example, the orchestrator's step 0, and the two guides that missed agentic callouts. Happy to push these as a follow-up commit (same pattern as the scaffold fix) if useful.

Fixes the issues found by the static review on this PR (see review comment): the contradictions and missing agentic-mode guards that would make a consuming agent do the wrong thing. - warehouse-mutations: sql.int() -> sql.number() in the canonical example (sql-queries.md lists sql.int under "DO NOT exist"); widen the extracted-routes param generic to include sql.number; fix the invented appkit.jobs.run() -> appkit.jobs("<key>").runNow(); split the hybrid --features sentence (read warehouse + write Delta is analytics-only, not analytics,lakebase) - SKILL.md step 0: agentic mode skips step 3 + deploy only — step 2's design + discovery gates (write_path, read_path, data_discovery) still run, matching step 2 and environments.md - custom-endpoints: add the missing agentic-mode callout (never manifest/--profile; read appkit.plugins.json / server.ts instead) - lakebase-synced-reads: add the missing agentic-mode callout (mirrors lakebase-oltp.md) - genie: agentic-mode note for Multi-Space Deployment (skip bundle provisioning subsections; client-side patterns still apply) - proto-first: fix self-contradicting "When to Use" row (multi-plugin apps only); fix broken link references/plugin-contracts.md -> proto-contracts.md; schema-qualify the example migration (SP cannot use `public`) and state migrations run in onPluginsReady with deploy-first still applying python3.10 scripts/skills.py validate: clean.

keugenek · 2026-06-11T15:21:59Z

🔧 Review fixes shipped — `501cbfa`

All P0 + P1 findings from the consumability review above are now fixed on this branch (+19/−8 across 6 files; scripts/skills.py validate clean):

P0 warehouse-mutations.md — canonical example now uses sql.number(...) (was sql.int(...), which sql-queries.md lists as nonexistent); also widened the extracted-routes param generic to admit numeric params.
SKILL.md step 0 — agentic mode now skips step 3 + deploy only; step 2's design + discovery gates explicitly still run (matches step 2 + environments.md).
warehouse-mutations.md — appkit.jobs.run(...) → appkit.jobs("<key>").runNow(...); hybrid --features sentence split (read warehouse + write Delta = analytics only).
custom-endpoints.md + lakebase-synced-reads.md — added the missing agentic-mode callouts (never manifest/--profile; read appkit.plugins.json / server.ts; skip provisioning + handoffs).
genie.md — agentic note for Multi-Space Deployment (skip bundle subsections; client-side patterns still apply; stop if a space isn't wired).
proto-first.md — "When to Use" row scoped to multi-plugin apps (+ explicit single-plugin → NO row); broken link → proto-contracts.md; example migration schema-qualified (app_data.runs — SP can't use public) with a note that migrations run in onPluginsReady and deploy-first applies.

Remaining P2s (unguarded local-isms class, proto-contracts.md stale Jobs section, cross-skill links, genie user_api_scopes ambiguity, etc.) are listed in the review and left for the author's judgment — none block merge.

With these in, everything flagged as fix-before-merge is resolved; combined with the eval results above (no quality regression, scaffold fix verified), this PR looks good to merge from my side. 🚢

keugenek · 2026-06-12T10:45:42Z

✅ Review fixes verified — eval clean (avg 0.98, aggregate gate passed)

Re-ran the pr preset (10 apps + 5 edits) with the skill pinned to this branch including the review fixes (501cbfa) — run 1110944036077050:

Metric	Pre-fix smoke run (06-10)	This run (with `501cbfa`)
Generated	10/10	10/10, 0 failures
Apps at 1.0	9/10 (`genie_taxi_chat` 0.0 — scaffold nesting)	9/10 (`parts_catalog_app` 0.83 — smoke-flake only: build/types/validate/runability all pass)
Avg `appeval_100`	0.90	0.98
Aggregate gate (≥0.85)	passed	passed — task fully SUCCESS (not even `SUCCESS_WITH_FAILURES`)
Scaffold nesting	1 occurrence	0
Edit build regressions	0/5	0/5 (1 smoke-only `add_emoji` −0.17, same known-flaky class)
Apps-validate / runability	—	1.00 / 1.00

No regression from the review fixes; the scaffold-nesting fix continues to hold; this is the cleanest run of the series.

One watch-item, not a blocker: avg generation turns were 69 (≈690s/app) — elevated vs the 20–30 healthy band, though without a like-for-like turns baseline for this preset I can't attribute it (concurrent load on the shared API key is the usual suspect). appeval is unaffected.

That closes out everything from our side: evals (smoke + full 89-app), failure attribution, scaffold fix + verification, consumability review, review fixes + this verification. 🚢

MarioCadenas requested review from a team, lennartkats-db and simonfaltum as code owners June 9, 2026 16:44

MarioCadenas force-pushed the replace-trpc-with-custom-endpoints branch from cf2c711 to 4c43aa3 Compare June 10, 2026 09:15

MarioCadenas added 4 commits June 10, 2026 12:09

Add warehouse mutations guide and sharpen app data-path docs.

eea9958

Document Delta/UC DML via custom routes, unify write-path guidance across skills, and expand Lakebase scaffolding and deployment notes.

Restore develop-before-validate workflow in AppKit overview.

76f7afe

The prior reorder conflated Lakebase deploy-first with the full lifecycle; keep Scaffold → Develop → Validate → Deploy and call out the OLTP exception.

Refactor databricks-apps skills around capability composition.

ce35cb6

Add data-patterns and lifecycle guides, slim SKILL.md to a 5-step agent workflow, dedupe overview and plugin guides, and broaden skill frontmatter for multi-plugin apps.

Split Lakebase docs and finish capability-refactor polish.

2cc1f99

Extract OLTP and synced-read guides from the monolithic lakebase doc, add a thin router, point data-patterns and cross-skill links at the right targets, and trim custom-endpoints/proto-first duplication.

MarioCadenas force-pushed the refactor-app-capability-composition branch from f238e3f to 2cc1f99 Compare June 10, 2026 10:10

MarioCadenas changed the base branch from replace-trpc-with-custom-endpoints to app-data-path-docs June 10, 2026 10:11

MarioCadenas changed the title ~~Refactor databricks-apps around capability composition~~ Refactor databricks-apps around capability composition + warehouse mutations Jun 10, 2026

MarioCadenas mentioned this pull request Jun 10, 2026

Add warehouse mutations guide and sharpen app data-path docs #135

Closed

2 tasks

MarioCadenas changed the base branch from app-data-path-docs to main June 10, 2026 10:28

keugenek approved these changes Jun 11, 2026

View reviewed changes

MarioCadenas added the do not merge label Jun 11, 2026

simonfaltum approved these changes Jun 12, 2026

View reviewed changes

pkosiec removed the do not merge label Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor databricks-apps around capability composition + warehouse mutations#132

Refactor databricks-apps around capability composition + warehouse mutations#132
MarioCadenas wants to merge 7 commits into
mainfrom
refactor-app-capability-composition

MarioCadenas commented Jun 9, 2026 •

edited

Loading

Uh oh!

keugenek commented Jun 10, 2026

Uh oh!

keugenek commented Jun 10, 2026

Uh oh!

keugenek commented Jun 10, 2026

Uh oh!

keugenek commented Jun 11, 2026

Uh oh!

keugenek commented Jun 11, 2026

Uh oh!

keugenek commented Jun 11, 2026

Uh oh!

keugenek commented Jun 11, 2026

Uh oh!

keugenek commented Jun 11, 2026

Uh oh!

keugenek commented Jun 11, 2026

Uh oh!

keugenek commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

MarioCadenas commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Capability refactor

Agentic mode (DATABRICKS_APPS_AGENTIC_MODE=true)

Review fixes (P1–P3)

Test plan

Uh oh!

keugenek commented Jun 10, 2026

Uh oh!

keugenek commented Jun 10, 2026

✅ Eval results — no generation-quality regression from this PR

The one miss (genie_taxi_chat = 0.0) was a non-reproducible scaffold slip, not a regression

Edits — 0 build regressions

Bottom line

Uh oh!

keugenek commented Jun 10, 2026

🧪 Full eval set now running on this PR

Uh oh!

keugenek commented Jun 11, 2026

Full eval-set results + merge assessment

Headline: ✅ no generation/edit quality regression — ⚠️ one real, fixable scaffold bug

The 14 hard/near failures, attributed

Edits (52)

Verdict: Fix the scaffold nesting before merge; everything else is clean.

Uh oh!

keugenek commented Jun 11, 2026

🔧 Pushed a fix for the scaffold-nesting regression — d24c3c5

Uh oh!

keugenek commented Jun 11, 2026

✅ Fix verified — scaffold nesting resolved

Net for this PR

Uh oh!

keugenek commented Jun 11, 2026

Uh oh!

keugenek commented Jun 11, 2026

🔍 Agent-consumability review — inconsistencies that would confuse a consuming agent

P0 — agent ships broken code

P1 — agent does the wrong thing or must guess

P2 — confusing but recoverable

Verdict

Uh oh!

keugenek commented Jun 11, 2026

🔧 Review fixes shipped — 501cbfa

Uh oh!

keugenek commented Jun 12, 2026

✅ Review fixes verified — eval clean (avg 0.98, aggregate gate passed)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MarioCadenas commented Jun 9, 2026 •

edited

Loading

Agentic mode (`DATABRICKS_APPS_AGENTIC_MODE=true`)

The one miss (`genie_taxi_chat` = 0.0) was a non-reproducible scaffold slip, not a regression

🔧 Pushed a fix for the scaffold-nesting regression — `d24c3c5`

🔧 Review fixes shipped — `501cbfa`