databrickslabs · FiifiB · Jun 25, 2026 · Jun 25, 2026 · Jun 25, 2026 · Jun 25, 2026
@@ -0,0 +1,81 @@
+# SPEC: agent_supervisor
+
+> Required by `.cursor/12-ai-feature-lifecycle.mdc`.
+
+## 1. Purpose
+
+`agent_supervisor` is a Databricks Agent Bricks Multi-Agent Supervisor (MAS) that
+orchestrates OntoBricks entity/relationship mapping. It deterministically scores a
+domain's complexity (from source metadata + ontology) and routes the mapping task
+to either the heavyweight PGE engine (`agent_mapping_pge`) or the original simple
+single-agent engine (`agent_auto_assignment`). The routing decision is computed by
+a Unity Catalog function (`assess_domain_complexity`) and acted on via the
+supervisor's natural-language instructions.
+
+## 2. Identity
+
+| Field | Value |
+|---|---|
+| `agent_name` | `agent_supervisor` |
+| `module_path` | `src/agents/agent_supervisor/` |
+| `model_endpoint` | Agent Bricks MAS endpoint (provisioned via `mas.py`) |
+| `temperature` | `0.0` (assessment is deterministic; routing is rule-driven) |
+| `mlflow_experiment` | `/Shared/ontobricks/agents/supervisor` |
+
+## 3. Tool surface
+
+| Tool name | Input | Output | Purpose |
+|---|---|---|---|
+| `assess_domain_complexity` (UC fn) | `metadata_json`, `ontology_json` | JSON `{score, tier, recommended_engine, signals, rationale}` | Deterministic engine recommendation |
+| `pge_mapping` (endpoint) | mapping `custom_inputs` | mapping result + PGE extras | Run `agent_mapping_pge` |
+| `simple_mapping` (endpoint) | mapping `custom_inputs` | mapping result | Run `agent_auto_assignment` |
+
+## 4. Success criteria
+
+1. A 3-source domain sharing an NHS-number key with ~17 classes is routed to `pge`.
+2. A single-table, 2-class domain is routed to `simple`.
+3. The supervisor always calls `assess_domain_complexity` before routing and never
+   overrides its `recommended_engine`.
+
+## 5. Eval dimensions
+
+| Dimension | Metric | Threshold | Weight | Judge |
+|---|---|---|---|---|
+| `routing_accuracy` | predicted engine == expected engine over the baseline set | `0.95` | `0.50` | rule-based (`complexity.assess`) |
+| `determinism` | identical input yields identical recommendation across runs | `1.00` | `0.20` | rule-based |
+| `assessor_called_first` | supervisor calls `assess_domain_complexity` before any engine | `1.00` | `0.20` | trace inspection |
+| `latency_p95` | assessment seconds (excludes the engine run) | `<= 2.0` | `0.10` | wall-clock |
+
+**Aggregate threshold:** ≥ `0.90` to pass.
+
+## 6. Failure modes
+
+| Symptom | Detection | Mitigation |
+|---|---|---|
+| Supervisor skips the assessor and guesses | trace shows no `assess_domain_complexity` call | strengthen instructions; the assessor verdict is authoritative |
+| Complex domain routed to simple engine | `routing_accuracy` drop on cross-source cases | re-tune weights/threshold in `complexity.py` + `uc_function.sql` (keep in sync) |
+| UC function / Python drift | `test_uc_function_parity` shared-constant check | edit both files together |
+
+## 7. Eval dataset
+
+- **Baseline:** `tests/eval/datasets/agent_supervisor/baseline.jsonl` (≥20 examples;
+  mix of single-source/simple and multi-source/complex domains with the expected
+  engine).
+- **Regression:** added on first production mis-route.
+
+## 8. MLflow tracing
+
+The mapping-engine ResponsesAgents (`responses_agent.py`) trace via the shared
+MLflow `ResponsesAgent` plumbing; the assessment is logged at INFO. The MAS
+endpoint is traced by Agent Bricks.
+
+## 9. Plan reference
+
+`docs/plans/2026-06-25-goal-loop-and-pge-eval-design.md` (PGE family) + the PR-split
+plan tracked in session memory.
+
+## 10. Sign-off
+
+- [x] Sections 4, 5, 6, 7 filled.
+- [ ] Baseline eval run URI pasted into PR body.
+- [x] Aggregate threshold declared in §5.
@@ -0,0 +1,181 @@
+# 2026-06-25 — feat(ontology): PGE Evaluator stage for owl-generator
+
+## Context
+
+The owl-generator agent had a single-shot generation + a pitfall-tool fix loop,
+but no deterministic Evaluator stage — so structural defects (orphan classes,
+dangling domain/range, naming violations, duplicate classes) could survive into
+the delivered ontology. This change turns owl-generation into a real
+Planner→Generator→Evaluator (PGE) loop: after the pitfall loop settles, a
+deterministic Stage-1 evaluator scores the ontology against the source metadata
+and feeds concrete retry-hints back to the generator, bounded by a hard cap.
+
+The Evaluator reuses a small, usecase-agnostic ontology-metrics module
+(`agents.pge_eval.ontology_metrics`) — gold-free, computed purely from the
+generated ontology + source schema. Only the ontology slice of the metrics
+package is introduced here; the full scorecard/CLI lands separately.
+
+## Changes
+
+1. `src/agents/agent_owl_generator/engine.py`
+   - Add `MAX_OWL_EVAL_ROUNDS` (bounded Evaluator retry cap) and
+     `_evaluate_ontology_stage()` — parses the Turtle, runs the deterministic
+     Tier-1 ontology checks, and returns a retry-hint string on hard defects
+     (orphan / dangling domain-range / naming / duplicate). Fails open: any
+     parse/dep error returns `None` so a check failure never blocks delivery.
+   - Wire the Evaluator into the agent loop after the pitfall loop; only retry
+     when an iteration remains, so a usable ontology is never discarded by
+     exhausting `MAX_ITERATIONS`.
+   - Raise `max_tokens` to `MAX_OUTPUT_TOKENS = 16000` so exhaustive attribute
+     coverage isn't silently truncated past the old 4096 ceiling.
+   - Strengthen the system prompt: `# ATTRIBUTE COVERAGE` section + a
+     `get_table_detail`-per-table workflow step driving exhaustive (not curated)
+     datatype-property coverage.
+2. `src/agents/pge_eval/__init__.py` — new package (minimal root; importers
+   depend on the concrete submodule to avoid coupling to later modules).
+3. `src/agents/pge_eval/normalize.py` — shared name/metadata/ontology
+   normalization primitives (stdlib-only).
+4. `src/agents/pge_eval/ontology_metrics.py` — `evaluate_ontology()`:
+   deterministic Stage-1 checks + footprint coverage, no stored reference.
+5. Tests: `tests/units/pge_eval/{__init__,_fixtures}.py`,
+   `test_ontology_metrics.py`, `test_owl_evaluator_stage.py`.
+
+## Modified / added files
+
+- M src/agents/agent_owl_generator/engine.py
+- A src/agents/pge_eval/__init__.py
+- A src/agents/pge_eval/normalize.py
+- A src/agents/pge_eval/ontology_metrics.py
+- A tests/units/pge_eval/__init__.py
+- A tests/units/pge_eval/_fixtures.py
+- A tests/units/pge_eval/test_ontology_metrics.py
+- A tests/units/pge_eval/test_owl_evaluator_stage.py
+
+## Tests
+
+`uv run pytest tests/units/pge_eval/test_ontology_metrics.py
+tests/units/pge_eval/test_owl_evaluator_stage.py
+tests/units/ontology/test_owl_generator.py -q` → **39 passed**.
+# 2026-06-25 — feat(mapping): PGE loop for entity/relationship mapping
+
+## Context
+
+Entity/relationship mapping previously ran through `agent_auto_assignment` —
+a single-agent "implementer marks its own homework" loop with no planning or
+independent evaluation. This change introduces `agent_mapping_pge`, a
+Planner→Generator→Evaluator (PGE) mapping engine, **additively**: the original
+`agent_auto_assignment` engine is retained and still reachable via
+`AgentClient.run_auto_assignment`, so a downstream orchestrator can choose which
+engine to run.
+
+The PGE engine plans a source-model, generates entity and relationship SQL per
+ontology item, and gates each with a deterministic evaluator + a semantic
+critic. Coverage is engine-enforced (computed from the ontology, not left to LLM
+discretion), with abstract-superclass UNION derivation and a synthetic-endpoint
+fallback so a single failed hub never cascades to drop all relationships.
+
+## Changes
+
+1. NEW package `src/agents/agent_mapping_pge/` — Planner (`planner.py`),
+   generators (`generators/{entity,relationship}.py`), evaluator
+   (`evaluator/{deterministic,critic,report}.py`), engine orchestrator
+   (`engine.py`, bounded ThreadPool walk + monotonic progress), `contracts.py`
+   (SourceModel/EvalReport), and `coverage.py` (deterministic ontology-derived
+   coverage; `skip[]` is advisory and never removes an item).
+2. NEW `src/agents/tools/planner.py` + `src/agents/tools/evaluation.py` —
+   planner/evaluation terminal tools (submit_source_model, submit_evaluation,
+   normalized_value_overlap) used by the PGE agents.
+3. `src/agents/tools/context.py` — ADD `source_model` + `semantic_eval_report`
+   fields (forward-ref typed to avoid a circular import). `warehouse_id` and all
+   existing fields are preserved.
+4. `src/agents/tools/mapping.py` — additive PGE tool-schema plumbing
+   (`unmapped_attributes`, `MAPPING_TOOL_DEFINITIONS_BY_NAME`).
+5. `src/back/core/agents/AgentClient.py` — ADD `run_mapping_pge()` gateway
+   (→ `agent_mapping_pge`). `run_auto_assignment()` is unchanged and still
+   points at `agent_auto_assignment` (the simple engine is retained).
+6. `src/back/objects/mapping/Mapping.py` — run the PGE engine in the auto-assign
+   flow and accumulate the PGE extras (`source_model`, `mapping_evaluations`,
+   `mapping_run_log`) across chunks and single-item runs;
+   `save_mappings_to_session` gains three OPTIONAL params (default `None`, so the
+   legacy path is unaffected). The upstream `_canonicalize_imported_uris` helper
+   is preserved.
+7. Tests: `tests/agents/agent_mapping_pge/` — contracts, coverage, planner,
+   entity/relationship generators, deterministic evaluator, critic, engine.
+
+## Modified / added files
+
+27 files changed, 12047 insertions(+), 8 deletions(-). New `agent_mapping_pge`
+package (12 modules) + 2 new tools + 9 test modules; 4 additive modifications
+(`context.py`, `mapping.py`, `AgentClient.py`, `Mapping.py`).
+
+## Tests
+
+- `uv run pytest tests/agents/agent_mapping_pge -q` → **90 passed**.
+- `uv run pytest tests/units/agents tests/units/mapping -q` → **208 passed**.
+- Imports resolve on the upstream base (origin/master, v0.5.2).
+
+# 2026-06-25 — feat(agents): Agent Bricks Supervisor for engine selection
+
+## Context
+
+PR1 (ontology PGE) and PR2 (mapping PGE) introduce the heavyweight PGE engines
+alongside the retained simple engine. This change adds the orchestration layer:
+a Databricks **Agent Bricks Multi-Agent Supervisor (MAS)** that, per domain,
+**deterministically** assesses complexity and routes the mapping task to the PGE
+engine (`agent_mapping_pge`) or the simple engine (`agent_auto_assignment`).
+
+Routing is the requested hybrid: a deterministic Unity Catalog function provides
+the hard recommendation, and the supervisor's natural-language instructions act
+on it. (Stacked on PR1 + PR2.)
+
+## Changes
+
+1. NEW `src/agents/agent_supervisor/`:
+   - `complexity.py` — `ComplexityAssessor`: weighted, deterministic score over
+     #tables, #columns, #classes, #relationships, cross-source key-sharing, and
+     schema-naming heterogeneity → tier + recommended engine. Reuses
+     `pge_eval.normalize` for input parsing.
+   - `engine.py` — `SupervisorEngine`: assess → select → dispatch via
+     `AgentClient` (mapping has the genuine PGE-vs-simple choice; ontology uses
+     the single owl-generator).
+   - `responses_agent.py` — `MappingEngineResponsesAgent`: MLflow ResponsesAgent
+     serving one engine per endpoint (`assess`/`run` modes; long runs handled by
+     the caller as a task).
+   - `mas.py` — `SupervisorProvisioner.build_config` (pure) + `provision`; the
+     MAS wires the complexity UC function + the two engine endpoints with NL
+     routing instructions.
+   - `uc_function.sql` — `assess_domain_complexity` UC function, a self-contained
+     mirror of `complexity.py` (constants guarded by `test_uc_function_parity`).
+   - `log_model.py` — logs both engine endpoints.
+2. `scripts/provision_supervisor.py` — end-to-end provisioning orchestration.
+3. `.planning/agents/agent_supervisor/SPEC.md` + eval dataset
+   `tests/eval/datasets/agent_supervisor/baseline.jsonl` (20 examples).
+4. Tests: `tests/agents/agent_supervisor/{test_complexity,test_engine}.py`.
+
+## Tests
+
+- `uv run pytest tests/agents/agent_supervisor -q` → **35 passed** (incl. baseline
+  routing-accuracy 20/20 and Python↔SQL constant parity).
+- Full stacked-branch regression `tests/agents tests/units/{agents,mapping,pge_eval,ontology}`
+  → **759 passed, 11 skipped**.
+
+# 2026-06-25 — refactor(agents): simplify supervisor engine for reviewability
+
+## Context
+
+Post-review simplification pass on the new supervisor code (behavior-preserving).
+
+## Changes
+
+1. `src/agents/agent_supervisor/engine.py` — Remove Middle Man: deleted the
+   `_run_mapping(**kw)` indirection that packed→unpacked→repacked identical
+   kwargs; `run()` now selects `run_mapping_pge` vs `run_auto_assignment` inline
+   (−15 lines), keeping the dispatch decision beside its call.
+2. `src/agents/agent_supervisor/responses_agent.py` — moved the `assess()` call
+   into the `assess` branch that consumes it (the `run` path recomputes it in
+   `SupervisorEngine.run`), inlined a one-use local, tightened `_text_event`
+   `custom_outputs` to `Optional[dict]`.
+
+## Tests
+
+`uv run pytest tests/agents/agent_supervisor -q` → **35 passed** (unchanged).
@@ -0,0 +1,102 @@
+"""Provision the OntoBricks mapping Supervisor (Agent Bricks MAS) end to end.
+
+Run from the repo root after PR1+PR2 land. Steps:
+
+1. Register the deterministic complexity UC function from ``uc_function.sql``
+   (substituting ${CATALOG}/${SCHEMA}).
+2. Log + deploy the two mapping-engine ResponsesAgents as Model Serving endpoints.
+3. Build the Supervisor (MAS) config and create/update it via Agent Bricks.
+
+This script does workspace I/O and is intended to run inside a configured
+Databricks environment (CLI profile or SP creds). It is deliberately thin — the
+testable logic lives in ``agents.agent_supervisor.{complexity,engine,mas}``.
+
+Usage::
+
+    CATALOG=fiifi_cdm_demo_catalog SCHEMA=ontobricks \\
+    PGE_ENDPOINT=ob-mapping-pge SIMPLE_ENDPOINT=ob-mapping-simple \\
+    python scripts/provision_supervisor.py
+"""
+
+import os
+import sys
+
+sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath(__file__)), "..", "src"))
+
+from agents.agent_supervisor.mas import SupervisorProvisioner  # noqa: E402
+from back.core.logging import get_logger  # noqa: E402
+
+logger = get_logger(__name__)
+
+
+def register_uc_function(catalog: str, schema: str, warehouse_id: str) -> None:
+    """Execute uc_function.sql with the catalog/schema substituted."""
+    from databricks import sql as dbsql  # local import: deploy-time dep
+
+    sql_path = os.path.join(
+        os.path.dirname(os.path.abspath(__file__)),
+        "..",
+        "src",
+        "agents",
+        "agent_supervisor",
+        "uc_function.sql",
+    )
+    with open(sql_path) as fh:
+        ddl = fh.read().replace("${CATALOG}", catalog).replace("${SCHEMA}", schema)
+
+    host = os.environ["DATABRICKS_HOST"].replace("https://", "")
+    with dbsql.connect(
+        server_hostname=host,
+        http_path=f"/sql/1.0/warehouses/{warehouse_id}",
+        access_token=os.environ["DATABRICKS_TOKEN"],
+    ) as conn:
+        with conn.cursor() as cur:
+            cur.execute(ddl)
+    logger.info("Registered %s.%s.assess_domain_complexity", catalog, schema)
+
+
+def deploy_engine_endpoints(experiment: str) -> dict:
+    """Log + deploy both mapping-engine ResponsesAgents. Returns endpoint names."""
+    from agents.agent_supervisor.log_model import log_engine_agent
+
+    endpoints = {}
+    for engine, env_key, default in (
+        ("pge", "PGE_ENDPOINT", "ob-mapping-pge"),
+        ("simple", "SIMPLE_ENDPOINT", "ob-mapping-simple"),
+    ):
+        uri = log_engine_agent(engine, experiment)
+        endpoint = os.environ.get(env_key, default)
+        logger.info("Logged %s engine -> %s; deploy as endpoint %r", engine, uri, endpoint)
+        # Deployment to Model Serving is done via databricks.agents.deploy(uri,
+        # endpoint) or the agents SDK; left to the operator so this script stays
+        # idempotent and credential-agnostic.
+        endpoints[engine] = endpoint
+    return endpoints
+
+
+def main() -> None:
+    catalog = os.environ.get("CATALOG", "main")
+    schema = os.environ.get("SCHEMA", "ontobricks")
+    warehouse_id = os.environ.get("WAREHOUSE_ID", "")
+    experiment = os.environ.get("ONTOBRICKS_MLFLOW_EXPERIMENT", "ontobricks-agents")
+
+    if warehouse_id:
+        register_uc_function(catalog, schema, warehouse_id)
+    else:
+        logger.warning("WAREHOUSE_ID unset — skipping UC function registration")
+
+    endpoints = deploy_engine_endpoints(experiment)
+
+    config = SupervisorProvisioner.build_config(
+        catalog=catalog,
+        schema=schema,
+        pge_endpoint=endpoints["pge"],
+        simple_endpoint=endpoints["simple"],
+    )
+    logger.info("Supervisor config built with %d agents", len(config["agents"]))
+    tile_id = SupervisorProvisioner.provision(config)
+    logger.info("Supervisor provisioned — tile_id=%s", tile_id)
+
+
+if __name__ == "__main__":
+    main()