skills: split into databricks-model-serving (ops) + databricks-ml-training (experimental)#110
Conversation
Phase 1 of databricks#73's TODO #1b. Adds references/fm-api-endpoints.md with the curated Foundation Model API endpoint table (chat/instruct + embedding models) from databricks-solutions/ai-dev-kit's model-serving skill, plus common defaults and query examples (CLI + SDK). Stripped: the cloud/language prefix on the docs link, and the leftover MCP-tool references in the source. The endpoint table itself is static catalog data — no MCP coupling. SKILL.md updates: - bump version to 0.2.0 - point Endpoint Types table at the new reference - point the Foundation Model discovery bullet at the new reference Subsequent phases (separate PRs / commits) port the remaining dev-side content: classical-ml autolog patterns, Custom PyFunc signatures, ResponsesAgent with the create_text_output_item gotcha, UCFunctionToolkit + VectorSearchRetrieverTool resource passthrough. Co-authored-by: Isaac
Aligns the verbatim a-d-k port with the live docs.databricks.com
supported-models page (validated via WebFetch on 2026-05-26):
ADDED (missing from a-d-k snapshot):
- databricks-claude-opus-4-7 (now most capable Claude)
- databricks-gpt-5-5-pro, 5-5
- databricks-gpt-5-4, 5-4-mini, 5-4-nano
- databricks-gpt-5-3-codex, 5-2-codex
- databricks-gemini-3-1-flash-lite, 3-5-flash
- databricks-qwen35-122b-a10b (Preview)
REMOVED (retired, no longer in docs):
- databricks-claude-3-7-sonnet
- databricks-meta-llama-3-1-405b-instruct
UPDATED notes:
- claude-opus-4-6 no longer "Most capable"
- gpt-5-2 no longer "Latest"
- gpt-5-1-codex-{max,mini} + gpt-5-2-codex marked retiring 2026-07-16
- gemini-3-pro marked retired 2026-03-26 with redirect through 2026-06-07
- Several Gemini / Codex endpoints annotated with cross-geo requirement
- qwen3-next-80b annotated as Preview
OPENING PARAGRAPH:
- "available in every workspace" -> "available in supported Model Serving
regions"; calls out cross-geo requirement for several endpoints
NOT TOUCHED (out of scope: not docs-validatable from supported-models page):
- served_entities[].entity_name guidance (line 3 second half)
- SKILL.md "system.ai.* catalog" claim on the pay-per-token row
These remain as in the a-d-k snapshot and should be revisited if/when
docs cover them directly.
Test plan: `scripts/skills.py validate` -> "Everything is up to date";
`scripts/skills.py generate` -> only refreshes manifest.json timestamps.
Co-authored-by: Isaac
…ot static catalog Quentin pointed out (PR databricks#84) that the prior two commits actually ported from `main:databricks-skills/databricks-model-serving/`, not `experimental:databricks-skills/databricks-ml-training-serving/` as the PR description claimed. The two skills take opposite approaches: - `main` ships a static catalog table of FM API endpoint names. - `experimental` deliberately rejects that ("a static skill list goes stale fast — always list at runtime instead of hard-coding names") and ships a `databricks serving-endpoints list | jq ...` one-liner plus runtime-resolved defaults (highest-numbered Claude Sonnet for agents, highest-numbered `-codex-max` for code). Re-port to match the experimental philosophy: - `references/fm-api-endpoints.md`: replace the static catalog with the runtime-list snippet (filtered by `databricks-` name prefix AND `system.ai.*` served entity, to exclude non-FM endpoints sharing the prefix), runtime-resolved family defaults, and CLI + SDK query examples that use a placeholder endpoint name rather than a hard-coded model. - `SKILL.md`: update the Endpoint Types row + the Foundation-Model discovery bullet to reframe the reference as "discover at runtime" rather than "curated table". Version stays at 0.2.0 (frontmatter unchanged → manifest unchanged). The 2026-05-26 catalog refresh in the previous commit is dropped here: the experimental skill's point is that no static table is the right shape, so curating one against docs.databricks.com isn't useful for the stable skill either. Co-authored-by: Isaac
…ental port Previous commit (c148500) restated the experimental section in my own words and added a "Querying" section + provisioned-throughput aside + docs-link gloss that aren't in the upstream skill. The PR's stated goal is to port from experimental — do an actual port, not a paraphrase. `references/fm-api-endpoints.md` now mirrors the `## Foundation Model API endpoints` section of `experimental:databricks-ml-training-serving/SKILL.md` verbatim (heading promoted from `##` to `#` since this is a standalone file): intro paragraph + the `databricks serving-endpoints list | jq ...` one-liner + the family-based default-picking rule. Nothing else. Also trim the SKILL.md discovery bullet back toward its original shape — link to the reference file for the runtime-list snippet, then the same `system.ai` / `serving-endpoints list` / `get-open-api` alternatives that were already there. Co-authored-by: Isaac
…ntal
Expands the port from the FM-endpoints-only scope to cover every
section of `experimental:databricks-ml-training-serving/`. Mirrors
the experimental skill's 3-file structure 1:1 into stable's
`references/` directory; the standalone fm-api-endpoints.md added in
earlier commits goes away (its content lives inline in
training-and-serving.md exactly as it does in experimental's SKILL.md).
Added (all verbatim ports, mechanical adjustments only):
references/training-and-serving.md
Ports experimental SKILL.md content. Mechanical changes only:
frontmatter stripped (destination is a reference file, not a
SKILL.md); `1-custom-pyfunc.md` → `custom-pyfunc.md`,
`2-genai-agents.md` → `genai-agents.md` (filename renames);
`../<skill>/SKILL.md` → `../../<skill>/SKILL.md` (one more level
of nesting since this file is in references/ rather than at the
skill root). Content covers: canonical train/register/serve flow,
`mlflow.{sklearn,xgboost,…}.autolog()` patterns, UC alias-based
promotion, batch scoring via `spark_udf`, real-time endpoint
create + zero-downtime version swap, `state.ready` vs
`state.config_update` poll-both gotcha, `jobs submit --no-wait`
serverless deploy pattern, Foundation Model API endpoints
runtime-list, and the full gotchas trap-table.
references/custom-pyfunc.md
Ports experimental 1-custom-pyfunc.md verbatim.
Mechanical change: `[SKILL.md]` → `[training-and-serving.md]`
where the original cross-referenced its parent SKILL.md.
Content: file-based PyFunc ("Models from Code"),
`infer_signature`, `code_paths`, pre-deploy validation via
`mlflow.models.predict(env_manager="uv")`.
references/genai-agents.md
Ports experimental 2-genai-agents.md verbatim.
Mechanical changes: cross-skill paths bumped one level deeper;
`[SKILL.md]` → `[training-and-serving.md]`. Content covers:
`ResponsesAgent` interface, LangGraph agent with
`UCFunctionToolkit` + `VectorSearchRetrieverTool`, the
`create_text_output_item` raw-dict-silently-fails gotcha, the
`resources=[...]` passthrough-auth list (DatabricksServingEndpoint,
DatabricksFunction, DatabricksVectorSearchIndex, DatabricksLakebase),
async deploy via `agents.deploy()` from a serverless job, query
via CLI and OpenAI-compatible client.
Removed:
references/fm-api-endpoints.md
Standalone file from earlier commits; its content lives inline
in training-and-serving.md exactly as it does in experimental's
SKILL.md, so the deliberate split is no longer needed.
Stable SKILL.md updates (minimal, ops-focus preserved):
- FM-endpoint link targets updated from `references/fm-api-endpoints.md`
to `references/training-and-serving.md#foundation-model-api-endpoints`
in the Endpoint Types table row and the FM-discovery bullet.
- New `### Develop & deploy new models` subsection under "What's Next"
with a 3-row table pointing at the new dev-side references, framed
as "this skill is ops-focused; for the dev-side flow, see below".
Manifest regenerated.
Co-authored-by: Isaac
- The mechanical `../` → `../../` rewrite in the verbatim port assumed every peer skill is stable, but 4 of them live in `experimental/`. `../../<skill>/SKILL.md` resolved to `skills/<skill>/SKILL.md` which does not exist for `databricks-agent-bricks`, `databricks-mlflow-evaluation`, `databricks-vector-search`, `databricks-unity-catalog`. Repointed to `../../../experimental/<skill>/SKILL.md`. `databricks-jobs` link unchanged (it's stable). - SKILL.md frontmatter `description` only described the ops surface, so agents wouldn't route dev-side asks (train, register, PyFunc, ResponsesAgent) to this skill. Broadened to cover both ops and the new dev surface. - Version bumped 0.2.0 → 0.3.0 + manifest regenerated. Co-authored-by: Isaac
…-phase1 # Conflicts: # manifest.json
Per @simonfaltum review: before resubmitting a deploy serverless job, agents should check whether a run is already in flight (active job runs filtered on run_name) or whether the target endpoint already exists in the right state. Avoids wasting ~15 min of serverless and racing for the same endpoint name. Co-authored-by: Isaac
…icks-ml-training Splits the post-port databricks-model-serving skill into two skills with clean responsibility boundaries: databricks-model-serving keeps the endpoint lifecycle / ops surface, and a new experimental databricks-ml-training owns the dev-side training, MLflow tracking, UC registration, custom PyFunc, and hand-rolled ResponsesAgent content. Also closes five small gaps in databricks-model-serving where non-obvious serving behavior from the original a-d-k port had fallen through the cracks (Python deployments client gotchas, zero-downtime version swap, two-field readiness rationale, classical-ML query shape, Serving-UI SP filter). Co-authored-by: Isaac
simonfaltum
left a comment
There was a problem hiding this comment.
Reviewed the proposed end state. Note this PR is one commit stacked on the still-open #84, so what shows here is the combined ~712-line delta against main; merge #84 first, then re-check this PR's true delta (relevant to the vector-search link below).
Verdict: fix-then-merge - no blockers, but a few things to address, flagged inline.
Headline items:
- The HPO "train and register" example silently promotes the wrong model (autolog registers every trial; promotion picks latest-by-version, not best-by-metric).
- A cross-link to
databricks-jobspoints at a section (and content) that doesn't exist. - The
databricks-vector-searchlink will break once rebased onto currentmain(vector-search moved toskills/). - MLflow pins (
mlflow==2.22.0) contradict the "MLflow 3" text and the skill's own pin-to-runtime rule.
Verified clean (so you see coverage):
python3 scripts/skills.py validatepasses; manifest / Codex metadata / icons in sync. (The model-serving manifest description staying short is by design - stable skills get a curated description, experimental ones derive from frontmatter.)- The MLflow APIs are real, not invented:
ResponsesAgent, theResponsesAgentRequest/Response/StreamEventclasses,output_to_responses_items_stream/to_chat_completions_input(match the official ResponsesAgent docs), andDatabricksLakebase(database_instance_name=...). - CLI flags verified against the CLI:
jobs submit --no-wait,jobs list-runs --active-only -o json. - No real credentials / workspace IDs; placeholders throughout; no destructive defaults.
- Strategy is strong: climbs to MLflow + UC registry + serverless jobs + serving, lands a durable governed gold UC table, delegates no-code agents to
databricks-agent-bricks, public APIs only.
The model-serving additions (Deployments-client gotchas, alias + update_endpoint version swap, two-field readiness, dataframe_records query, runtime FM-API listing, SP-filter troubleshooting) are high-signal - one small client-variable issue noted inline.
Posted as a COMMENT (advisory, non-blocking).
| client = MlflowClient(registry_uri="databricks-uc") | ||
| latest = max(client.search_model_versions(f"name='{FULL_NAME}'"), | ||
| key=lambda v: int(v.version)) | ||
| client.set_registered_model_alias(FULL_NAME, "prod", latest.version) |
There was a problem hiding this comment.
Promotes the wrong model. With autolog(registered_model_name=FULL_NAME) (line 85), every trial's .fit() logs and registers a version, so 20 trials produce ~20 versions. max(..., key=version) here then picks the last trial to finish, not the best by AUC, so @prod lands on an arbitrary model and the Optuna search is wasted. The prose at line 45 ("the best one is what gets registered") is inaccurate.
Fix: after study.optimize, either retrain once on study.best_params in a single parent run and register that, or select the winning run explicitly, e.g. client.search_runs(experiment_ids=[...], order_by=["metrics.<auc> DESC"], max_results=1) and alias that version.
| # → '{"model_version":"3","val_auc":0.91,"rows_scored":124,"endpoint":"turbine-risk-endpoint"}' | ||
| ``` | ||
|
|
||
| For the four `jobs submit` traps (`spec.client: "4"` requirement, TASK-vs-submit run_id, `print()` unreliable, tags rejected) and full debugging flow, see **[databricks-jobs](../../skills/databricks-jobs/SKILL.md#one-time-runs-jobs-submit--async-pattern-for-notebooks)**. |
There was a problem hiding this comment.
This anchor doesn't resolve. databricks-jobs/SKILL.md has no heading matching #one-time-runs-jobs-submit--async-pattern-for-notebooks, and the "four jobs submit traps / full debugging flow" content isn't in that skill at all (grep for jobs submit, --no-wait, notebook_output finds nothing; only spec.client: "4" exists, and it's in its references/task-types.md). An agent following the link lands at the top of databricks-jobs and never finds what's promised here.
Fix: inline the four traps here, point at the real location, or add the section to databricks-jobs.
| - **[databricks-model-serving](../../skills/databricks-model-serving/SKILL.md)** — serving-endpoint lifecycle (create, query, update-config, version-swap, AI Gateway, Foundation Model API endpoints). | ||
| - **[databricks-agent-bricks](../databricks-agent-bricks/SKILL.md)** — no-code Knowledge Assistants and Supervisor Agents. Prefer this over hand-rolling agents. | ||
| - **[databricks-mlflow-evaluation](../databricks-mlflow-evaluation/SKILL.md)** — evaluate model/agent quality before promoting `@prod`. | ||
| - **[databricks-vector-search](../databricks-vector-search/SKILL.md)** — vector indexes used as retrieval tools in agents. |
There was a problem hiding this comment.
This link will break once the PR rebases onto current main. databricks-vector-search was promoted from experimental/ to skills/ on main (it's now skills/databricks-vector-search/, gone from experimental/). ../databricks-vector-search/SKILL.md resolves on this branch only because it's based on an older main.
Fix: ../../skills/databricks-vector-search/SKILL.md.
| resources=resources, # auto-auth — DO NOT skip | ||
| input_example={"input": [{"role": "user", "content": "What's the maintenance history for turbine WTG-12?"}]}, | ||
| pip_requirements=[ | ||
| "mlflow==2.22.0", |
There was a problem hiding this comment.
This pin contradicts the surrounding text and the skill's own rule. The text calls ResponsesAgent "MLflow 3's standardized agent interface" and notes "DBR 16.1+ has mlflow 3.x", but pins mlflow==2.22.0 here (and in custom-pyfunc.md:72, SKILL.md:182). The Gotchas table (SKILL.md:236) warns that a pip_requirements mismatch crashes the endpoint at load and says to pin to the runtime. Logging from a DBR-3.x runtime but forcing serving to 2.22.0 is exactly that skew.
Fix: pin the MLflow 3.x version DBR ships, or use the live f"mlflow=={version('mlflow')}" pattern the skill already recommends. (ResponsesAgent does exist in 2.22.0, so it's not a guaranteed import error, but the version skew with databricks-langchain / langgraph is the real risk.)
| @@ -0,0 +1,257 @@ | |||
| --- | |||
| name: databricks-ml-training | |||
| description: "Classical ML and custom-agent model training, MLflow tracking, and Unity Catalog model registration on Databricks. Use when the user asks to: train models (with MLflow, sklearn, XGBoost, LightGBM, PyTorch, custom pyfunc, etc.); run hyperparameter tuning with Optuna; register models to Unity Catalog and promote versions with `@prod` / `@challenger` aliases; load a registered model for batch scoring via `mlflow.pyfunc.spark_udf`; run inferences as batch, build custom MLflow PyFunc models (Models from Code); author a custom MLflow `ResponsesAgent` (LangGraph, OpenAI-compatible chat) with UC Function or Vector Search tools. NOT for: managing existing serving endpoints (use databricks-model-serving); no-code Knowledge Assistants or Supervisor Agents (use databricks-agent-bricks); MLflow evaluation / scorers (use databricks-mlflow-evaluation)." | |||
There was a problem hiding this comment.
At ~850 characters this is the longest skill description in the repo (the experimental median is ~250; the next-longest is ~740). It's the agent's routing input. The explicit "Use when / NOT for" triage is genuinely useful and worth keeping, but the framing around it could be trimmed to bring it closer to siblings.
| from langgraph.prebuilt.tool_node import ToolNode | ||
| from typing import Annotated, Generator, Sequence, TypedDict | ||
|
|
||
| LLM_ENDPOINT = "databricks-claude-sonnet-4-6" # resolve at runtime — see training-and-serving.md |
There was a problem hiding this comment.
Stale reference. training-and-serving.md doesn't exist anywhere in the repo (the PR description itself says no such paths remain). Drop the comment or point at the live source for resolving the LLM endpoint at runtime.
| model_name = sys.argv[1] | ||
| version = sys.argv[2] | ||
| endpoint_name = sys.argv[3] if len(sys.argv) > 3 else None | ||
|
|
||
| # Always pass endpoint_name explicitly — auto-derived names are | ||
| # `agents_<catalog>-<schema>-<model>` with dots → dashes, which is unpredictable. | ||
| kwargs = {"tags": {"aidevkit_project": "ai-dev-kit"}} | ||
| if endpoint_name: | ||
| kwargs["endpoint_name"] = endpoint_name | ||
|
|
||
| deployment = agents.deploy(model_name, version, **kwargs) | ||
|
|
||
| # Land structured output via dbutils.notebook.exit — print() unreliable on serverless. | ||
| dbutils.notebook.exit(json.dumps({ | ||
| "endpoint_name": deployment.endpoint_name, | ||
| "query_endpoint": deployment.query_endpoint, | ||
| })) |
There was a problem hiding this comment.
This won't run as written. The block reads parameters from sys.argv[1..3], but line 220 says to submit it "as the notebook" via jobs submit, and the submit JSON passes no parameters. Notebook tasks receive parameters via dbutils.widgets, not sys.argv, so sys.argv[1] raises IndexError. (dbutils.notebook.exit at line 214 is notebook-only, confirming this is meant as a notebook.)
Fix: read params via dbutils.widgets.get(...) and pass base_parameters in the submit, or run it as a spark_python_task.
| client.set_registered_model_alias(FULL_NAME, "prod", new_version) | ||
| client.update_endpoint(endpoint=ENDPOINT_NAME, config={ | ||
| "served_entities": [{"entity_name": FULL_NAME, "entity_version": new_version, | ||
| "workload_size": "Small", "scale_to_zero_enabled": True}], | ||
| "traffic_config": {"routes": [ | ||
| {"served_model_name": f"{NAME}-{new_version}", "traffic_percentage": 100} | ||
| ]}, | ||
| }) |
There was a problem hiding this comment.
These two calls are on different client types but share one client variable. set_registered_model_alias(...) is a method on mlflow.tracking.MlflowClient; update_endpoint(...) is on the MLflow Deployments client (mlflow.deployments.get_deploy_client("databricks")). As written, a single client can't do both - whichever it is, the other call raises AttributeError.
Fix: use two distinctly-named clients, e.g. mlflow_client.set_registered_model_alias(...) and deploy_client.update_endpoint(...).
| scored = features.withColumn("risk_score", predict(*[features[c] for c in feature_cols])) | ||
|
|
||
| # Overwrite-per-run pattern for "latest score per entity": | ||
| scored.select("turbine_id", "risk_score", F.current_timestamp().alias("scored_at")) \ |
There was a problem hiding this comment.
Two copy-paste snags: feature_cols is used but never defined, and F.current_timestamp() needs from pyspark.sql import functions as F (not auto-imported). Worth fixing so the example runs verbatim.
- ml-training/SKILL.md:
- Autolog without registered_model_name; retrain best params explicitly and
register that single model (avoids max-by-version landing on last trial).
- Add `from pyspark.sql import functions as F` import; define feature_cols
in the batch-scoring block.
- Drop the bogus `#one-time-runs-jobs-submit--async-pattern-for-notebooks`
anchor; reword the jobs-submit traps inline and point at databricks-jobs.
- Normalize `../../skills/X/SKILL.md` -> `../X/SKILL.md` (install-flatten).
- Pin `mlflow==3.1.0`.
- ml-training/references/custom-pyfunc.md: pin `mlflow==3.1.0`; normalize
the model-serving link to install-flatten path.
- ml-training/references/genai-agents.md: pin `mlflow==3.1.0`; replace stale
`training-and-serving.md` pointer with `databricks-model-serving`; replace
`sys.argv[...]` with `dbutils.widgets` for notebook param wiring; update
default deploy tag to `ai_generated_source=databricks-agent-skills`.
- model-serving/SKILL.md: split shared `client` into `registry` (MlflowClient)
and `deploy` (Deployments client) — the two surfaces are different objects
even though both happened to be called `client` before.
Co-authored-by: Isaac
|
Thanks for the thorough review, @simonfaltum — addressed in 68bf9e5: experimental/databricks-ml-training/SKILL.md
references/custom-pyfunc.md
references/genai-agents.md
skills/databricks-model-serving/SKILL.md
Description length: left as-is per discussion — the long discriminator helps Claude pick the right skill when there are several adjacent ML skills. |
…l-training-split # Conflicts: # manifest.json # skills/databricks-model-serving/SKILL.md
…able After merging upstream/main (which brought in PR databricks#84's pre-split references/{training-and-serving,custom-pyfunc,genai-agents}.md and a stale "What's Next" table pointing at them), clean up the residuals and tighten both skill descriptions so an LLM orchestrator routes correctly: - skills/databricks-model-serving/SKILL.md: - Drop the duplicate `### Develop & deploy new models` block left by the upstream merge. The earlier copy pointed at the now-removed `references/training-and-serving.md`, `references/custom-pyfunc.md`, `references/genai-agents.md`. Keep the single pointer to databricks-ml-training, and normalize its path to the install-flatten convention (`../databricks-ml-training/SKILL.md`). - Expand the YAML `description:` to mention OpenAPI schema retrieval, serving logs/metrics/permissions inspection, and off-platform streaming (Vercel AI SDK v6 / standalone Node.js into AI Gateway). Those are all already in the body / off-platform-streaming.md but weren't in the description's trigger list, so user phrasings like "stream from my Next.js app to a Databricks model" or "get the OpenAPI spec for my endpoint" wouldn't route here. - experimental/databricks-ml-training/SKILL.md: - Add `submit a train-and-deploy notebook as a databricks jobs submit --no-wait serverless one-time run` to the description trigger list — that pattern has its own section in the body but wasn't in the description. Regenerated manifest.json. Co-authored-by: Isaac
|
@simonfaltum @dustin-anchorage when you get a chance, could you do a final pass? Conflict with upstream/main is resolved (merge brought in #84's pre-split refs which I removed since the dev-side content lives in the new ml-training skill), and the two skill frontmatter descriptions were also tightened to cover OpenAPI schemas, off-platform streaming, serving logs/metrics, and the jobs-submit serverless pattern — body content already covered these but the description didn't, so an LLM orchestrator would have missed those user phrasings. Tip is e9692af. Thanks! |
Both descriptions were the longest in the repo (model-serving 895 chars,
ml-training 949) vs an experimental median of ~250 and a next-longest of
~559. The "Use when / NOT for" triage is load-bearing for routing and
stays, but the framing was over-verbose.
- Collapse the CRUD verb list ("create, query, update, scale, or delete
serving endpoints") to "CRUD serving endpoints" — same trigger surface,
much denser.
- Drop redundant qualifiers in NOT-for lists ("no-code ... use X" →
"(X)"; "MLflow evaluation / scorers" → "MLflow evaluation"; etc.).
- ml-training: drop the framework enumeration ("MLflow, sklearn,
XGBoost, ..."), fold hyperparameter-tuning into the train trigger,
shorten "Models from Code" and `ResponsesAgent` qualifiers — the body
retains the full detail; description only needs to route.
Result: model-serving 895→717 chars, ml-training 949→586. Still on the
high end (these skills genuinely have many triggers) but no longer
outliers.
Co-authored-by: Isaac
Add `classification/regression` and the four most common framework names (XGBoost, scikit-learn, LightGBM, PyTorch) to the train trigger. The body already covers these, but agents routing on a phrase like "train an XGBoost classifier" or "regression model" benefit from explicit keyword hits in the description. Co-authored-by: Isaac
|
@simonfaltum @dustin-anchorage when you have a minute, could you do a final pass? Since the last round:
Tip is a8ebfbb. Thanks! |
dustinvannoy-db
left a comment
There was a problem hiding this comment.
One suggestion I'd like you to accept to fix a statement that contradicts docs. Then you can merge.
|
|
||
| ## Train and register (the 90% case) | ||
|
|
||
| `mlflow.autolog()` captures params, metrics, code, and the model artifact for every run; `registered_model_name=...` auto-registers the best run to UC (auto-incremented version). Wrap training with **Optuna** so each trial is a child run and the best one is what gets registered. |
There was a problem hiding this comment.
| `mlflow.autolog()` captures params, metrics, code, and the model artifact for every run; `registered_model_name=...` auto-registers the best run to UC (auto-incremented version). Wrap training with **Optuna** so each trial is a child run and the best one is what gets registered. | |
| `mlflow.autolog()` captures params, metrics, code, and the model artifact for every run. Wrap training with **Optuna** so each trial is a nested run. **Don't** set `registered_model_name=…` on autolog — it registers a new UC version on *every* trial; instead retrain once on `study.best_params` and register only that winning model (below). |
Why this PR exists
PR #84 lands the model-serving content (endpoint create, query, update, traffic config, AI Gateway, Foundation Model API discovery) into
databricks-model-serving. That's the right shape for a serving-ops skill, and it's what reviewers should expect a skill called "model serving" to contain.The remaining a-d-k content — training a model with MLflow autolog, registering it to Unity Catalog, promoting versions via
@prodaliases, custom PyFunc authoring, hand-rolledResponsesAgentcode — is a different lifecycle. It runs before an endpoint exists, often in a notebook submitted as a serverless job, and an agent asked to "train an XGBoost model and deploy it" needs both concerns surfaced cleanly rather than blended into one skill description.This PR lands the dev-side content as a separate
databricks-ml-trainingexperimental skill, and weaves a few small but high-leverage serving-side fixes from the original a-d-k content intodatabricks-model-servingwhere they belong.What this PR improves
A focused dev-side skill. New
experimental/databricks-ml-training/owns the training → register → consume narrative: MLflow autolog with Optuna for hyperparameter tuning,mlflow.set_registry_uri(\"databricks-uc\")+ experiment-parent-folder pre-creation, alias-based promotion (@prod/@challenger), batch scoring viamlflow.pyfunc.spark_udf, custom PyFunc with the file-based "Models from Code" pattern, hand-rolledResponsesAgentwith LangGraph + UC Function + Vector Search tools, and thedatabricks jobs submit --no-waittrain-and-deploy pattern.Frontmatter triggers that actually triage. Each skill's description lists what it IS for and what it explicitly is NOT for, with cross-pointers (
databricks-ml-trainingsays "use databricks-model-serving for endpoint ops";databricks-model-servingsays "use databricks-ml-training for training and PyFunc authoring"). When the user says "train a model and deploy it," the orchestrator pulls both skills exactly once each.Cross-skill links that resolve. Every
databricks-model-serving→databricks-ml-traininglink and every reverse link uses the right relative path for the stable-skills/↔ experimental-experimental/layout. No broken anchors, no stale paths to the oldtraining-and-serving.mdfilename anywhere.Five small but high-leverage gaps closed in
databricks-model-serving. The original a-d-k port left a few non-obvious serving behaviors implicit. Each fix is woven into the existing section that already covers the topic — no new mega-sections, no duplication of MLflow boilerplate an LLM already knows from training data. The result: serving-side behavior an agent would otherwise have to guess at is now explicit and signposted.Summary of changes
experimental/databricks-ml-training/with SKILL.md + agents/ + assets/ + references/{custom-pyfunc.md, genai-agents.md}. Owns the full dev-side narrative (autolog + Optuna, UC registration, alias promotion, batch scoring, custom PyFunc, custom ResponsesAgent, train-and-deploy serverless job pattern).databricks-model-serving/SKILL.mdinline (was previously linked into the relocated training file).databricks-model-serving/SKILL.md: MLflow Deployments client gotchas (tags=top-level,served_model_namederivation), zero-downtime version-swap pattern (alias-repoint ANDupdate_endpoint), two-state-field readiness rationale (state.readylies during version-swap), classical-MLdataframe_recordsquery example, Serving-UI "Owned by me" SP-filter troubleshooting row. Each merged into the existing section that already covered the topic.databricks-model-servingbumped to 0.4.0 (description retightened, gaps closed). Newdatabricks-ml-trainingat 0.1.0 underexperimental/. Manifest regenerated; 27 skills total.Reviewer aid
The split is on the natural seam — anything that runs before
mlflow.deployments.get_deploy_client(...).create_endpoint(...)is dev-side and lives indatabricks-ml-training, anything fromcreate_endpointonward is ops-side and lives indatabricks-model-serving. The Pythoncreate_endpoint(...)/update_endpoint(...)call itself is canonically a serving operation and is documented there with the two non-obvious gotchas.Validation:
python3 scripts/skills.py validatepasses; zero broken links across both touched skills.This pull request and its description were written by Isaac.