Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 81 additions & 0 deletions .planning/agents/agent_supervisor/SPEC.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# SPEC: agent_supervisor

> Required by `.cursor/12-ai-feature-lifecycle.mdc`.

## 1. Purpose

`agent_supervisor` is a Databricks Agent Bricks Multi-Agent Supervisor (MAS) that
orchestrates OntoBricks entity/relationship mapping. It deterministically scores a
domain's complexity (from source metadata + ontology) and routes the mapping task
to either the heavyweight PGE engine (`agent_mapping_pge`) or the original simple
single-agent engine (`agent_auto_assignment`). The routing decision is computed by
a Unity Catalog function (`assess_domain_complexity`) and acted on via the
supervisor's natural-language instructions.

## 2. Identity

| Field | Value |
|---|---|
| `agent_name` | `agent_supervisor` |
| `module_path` | `src/agents/agent_supervisor/` |
| `model_endpoint` | Agent Bricks MAS endpoint (provisioned via `mas.py`) |
| `temperature` | `0.0` (assessment is deterministic; routing is rule-driven) |
| `mlflow_experiment` | `/Shared/ontobricks/agents/supervisor` |

## 3. Tool surface

| Tool name | Input | Output | Purpose |
|---|---|---|---|
| `assess_domain_complexity` (UC fn) | `metadata_json`, `ontology_json` | JSON `{score, tier, recommended_engine, signals, rationale}` | Deterministic engine recommendation |
| `pge_mapping` (endpoint) | mapping `custom_inputs` | mapping result + PGE extras | Run `agent_mapping_pge` |
| `simple_mapping` (endpoint) | mapping `custom_inputs` | mapping result | Run `agent_auto_assignment` |

## 4. Success criteria

1. A 3-source domain sharing an NHS-number key with ~17 classes is routed to `pge`.
2. A single-table, 2-class domain is routed to `simple`.
3. The supervisor always calls `assess_domain_complexity` before routing and never
overrides its `recommended_engine`.

## 5. Eval dimensions

| Dimension | Metric | Threshold | Weight | Judge |
|---|---|---|---|---|
| `routing_accuracy` | predicted engine == expected engine over the baseline set | `0.95` | `0.50` | rule-based (`complexity.assess`) |
| `determinism` | identical input yields identical recommendation across runs | `1.00` | `0.20` | rule-based |
| `assessor_called_first` | supervisor calls `assess_domain_complexity` before any engine | `1.00` | `0.20` | trace inspection |
| `latency_p95` | assessment seconds (excludes the engine run) | `<= 2.0` | `0.10` | wall-clock |

**Aggregate threshold:** ≥ `0.90` to pass.

## 6. Failure modes

| Symptom | Detection | Mitigation |
|---|---|---|
| Supervisor skips the assessor and guesses | trace shows no `assess_domain_complexity` call | strengthen instructions; the assessor verdict is authoritative |
| Complex domain routed to simple engine | `routing_accuracy` drop on cross-source cases | re-tune weights/threshold in `complexity.py` + `uc_function.sql` (keep in sync) |
| UC function / Python drift | `test_uc_function_parity` shared-constant check | edit both files together |

## 7. Eval dataset

- **Baseline:** `tests/eval/datasets/agent_supervisor/baseline.jsonl` (≥20 examples;
mix of single-source/simple and multi-source/complex domains with the expected
engine).
- **Regression:** added on first production mis-route.

## 8. MLflow tracing

The mapping-engine ResponsesAgents (`responses_agent.py`) trace via the shared
MLflow `ResponsesAgent` plumbing; the assessment is logged at INFO. The MAS
endpoint is traced by Agent Bricks.

## 9. Plan reference

`docs/plans/2026-06-25-goal-loop-and-pge-eval-design.md` (PGE family) + the PR-split
plan tracked in session memory.

## 10. Sign-off

- [x] Sections 4, 5, 6, 7 filled.
- [ ] Baseline eval run URI pasted into PR body.
- [x] Aggregate threshold declared in §5.
181 changes: 181 additions & 0 deletions changelogs/v0.5.2/FiifiB_2026-06-25.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
# 2026-06-25 — feat(ontology): PGE Evaluator stage for owl-generator

## Context

The owl-generator agent had a single-shot generation + a pitfall-tool fix loop,
but no deterministic Evaluator stage — so structural defects (orphan classes,
dangling domain/range, naming violations, duplicate classes) could survive into
the delivered ontology. This change turns owl-generation into a real
Planner→Generator→Evaluator (PGE) loop: after the pitfall loop settles, a
deterministic Stage-1 evaluator scores the ontology against the source metadata
and feeds concrete retry-hints back to the generator, bounded by a hard cap.

The Evaluator reuses a small, usecase-agnostic ontology-metrics module
(`agents.pge_eval.ontology_metrics`) — gold-free, computed purely from the
generated ontology + source schema. Only the ontology slice of the metrics
package is introduced here; the full scorecard/CLI lands separately.

## Changes

1. `src/agents/agent_owl_generator/engine.py`
- Add `MAX_OWL_EVAL_ROUNDS` (bounded Evaluator retry cap) and
`_evaluate_ontology_stage()` — parses the Turtle, runs the deterministic
Tier-1 ontology checks, and returns a retry-hint string on hard defects
(orphan / dangling domain-range / naming / duplicate). Fails open: any
parse/dep error returns `None` so a check failure never blocks delivery.
- Wire the Evaluator into the agent loop after the pitfall loop; only retry
when an iteration remains, so a usable ontology is never discarded by
exhausting `MAX_ITERATIONS`.
- Raise `max_tokens` to `MAX_OUTPUT_TOKENS = 16000` so exhaustive attribute
coverage isn't silently truncated past the old 4096 ceiling.
- Strengthen the system prompt: `# ATTRIBUTE COVERAGE` section + a
`get_table_detail`-per-table workflow step driving exhaustive (not curated)
datatype-property coverage.
2. `src/agents/pge_eval/__init__.py` — new package (minimal root; importers
depend on the concrete submodule to avoid coupling to later modules).
3. `src/agents/pge_eval/normalize.py` — shared name/metadata/ontology
normalization primitives (stdlib-only).
4. `src/agents/pge_eval/ontology_metrics.py` — `evaluate_ontology()`:
deterministic Stage-1 checks + footprint coverage, no stored reference.
5. Tests: `tests/units/pge_eval/{__init__,_fixtures}.py`,
`test_ontology_metrics.py`, `test_owl_evaluator_stage.py`.

## Modified / added files

- M src/agents/agent_owl_generator/engine.py
- A src/agents/pge_eval/__init__.py
- A src/agents/pge_eval/normalize.py
- A src/agents/pge_eval/ontology_metrics.py
- A tests/units/pge_eval/__init__.py
- A tests/units/pge_eval/_fixtures.py
- A tests/units/pge_eval/test_ontology_metrics.py
- A tests/units/pge_eval/test_owl_evaluator_stage.py

## Tests

`uv run pytest tests/units/pge_eval/test_ontology_metrics.py
tests/units/pge_eval/test_owl_evaluator_stage.py
tests/units/ontology/test_owl_generator.py -q` → **39 passed**.
# 2026-06-25 — feat(mapping): PGE loop for entity/relationship mapping

## Context

Entity/relationship mapping previously ran through `agent_auto_assignment` —
a single-agent "implementer marks its own homework" loop with no planning or
independent evaluation. This change introduces `agent_mapping_pge`, a
Planner→Generator→Evaluator (PGE) mapping engine, **additively**: the original
`agent_auto_assignment` engine is retained and still reachable via
`AgentClient.run_auto_assignment`, so a downstream orchestrator can choose which
engine to run.

The PGE engine plans a source-model, generates entity and relationship SQL per
ontology item, and gates each with a deterministic evaluator + a semantic
critic. Coverage is engine-enforced (computed from the ontology, not left to LLM
discretion), with abstract-superclass UNION derivation and a synthetic-endpoint
fallback so a single failed hub never cascades to drop all relationships.

## Changes

1. NEW package `src/agents/agent_mapping_pge/` — Planner (`planner.py`),
generators (`generators/{entity,relationship}.py`), evaluator
(`evaluator/{deterministic,critic,report}.py`), engine orchestrator
(`engine.py`, bounded ThreadPool walk + monotonic progress), `contracts.py`
(SourceModel/EvalReport), and `coverage.py` (deterministic ontology-derived
coverage; `skip[]` is advisory and never removes an item).
2. NEW `src/agents/tools/planner.py` + `src/agents/tools/evaluation.py` —
planner/evaluation terminal tools (submit_source_model, submit_evaluation,
normalized_value_overlap) used by the PGE agents.
3. `src/agents/tools/context.py` — ADD `source_model` + `semantic_eval_report`
fields (forward-ref typed to avoid a circular import). `warehouse_id` and all
existing fields are preserved.
4. `src/agents/tools/mapping.py` — additive PGE tool-schema plumbing
(`unmapped_attributes`, `MAPPING_TOOL_DEFINITIONS_BY_NAME`).
5. `src/back/core/agents/AgentClient.py` — ADD `run_mapping_pge()` gateway
(→ `agent_mapping_pge`). `run_auto_assignment()` is unchanged and still
points at `agent_auto_assignment` (the simple engine is retained).
6. `src/back/objects/mapping/Mapping.py` — run the PGE engine in the auto-assign
flow and accumulate the PGE extras (`source_model`, `mapping_evaluations`,
`mapping_run_log`) across chunks and single-item runs;
`save_mappings_to_session` gains three OPTIONAL params (default `None`, so the
legacy path is unaffected). The upstream `_canonicalize_imported_uris` helper
is preserved.
7. Tests: `tests/agents/agent_mapping_pge/` — contracts, coverage, planner,
entity/relationship generators, deterministic evaluator, critic, engine.

## Modified / added files

27 files changed, 12047 insertions(+), 8 deletions(-). New `agent_mapping_pge`
package (12 modules) + 2 new tools + 9 test modules; 4 additive modifications
(`context.py`, `mapping.py`, `AgentClient.py`, `Mapping.py`).

## Tests

- `uv run pytest tests/agents/agent_mapping_pge -q` → **90 passed**.
- `uv run pytest tests/units/agents tests/units/mapping -q` → **208 passed**.
- Imports resolve on the upstream base (origin/master, v0.5.2).

# 2026-06-25 — feat(agents): Agent Bricks Supervisor for engine selection

## Context

PR1 (ontology PGE) and PR2 (mapping PGE) introduce the heavyweight PGE engines
alongside the retained simple engine. This change adds the orchestration layer:
a Databricks **Agent Bricks Multi-Agent Supervisor (MAS)** that, per domain,
**deterministically** assesses complexity and routes the mapping task to the PGE
engine (`agent_mapping_pge`) or the simple engine (`agent_auto_assignment`).

Routing is the requested hybrid: a deterministic Unity Catalog function provides
the hard recommendation, and the supervisor's natural-language instructions act
on it. (Stacked on PR1 + PR2.)

## Changes

1. NEW `src/agents/agent_supervisor/`:
- `complexity.py` — `ComplexityAssessor`: weighted, deterministic score over
#tables, #columns, #classes, #relationships, cross-source key-sharing, and
schema-naming heterogeneity → tier + recommended engine. Reuses
`pge_eval.normalize` for input parsing.
- `engine.py` — `SupervisorEngine`: assess → select → dispatch via
`AgentClient` (mapping has the genuine PGE-vs-simple choice; ontology uses
the single owl-generator).
- `responses_agent.py` — `MappingEngineResponsesAgent`: MLflow ResponsesAgent
serving one engine per endpoint (`assess`/`run` modes; long runs handled by
the caller as a task).
- `mas.py` — `SupervisorProvisioner.build_config` (pure) + `provision`; the
MAS wires the complexity UC function + the two engine endpoints with NL
routing instructions.
- `uc_function.sql` — `assess_domain_complexity` UC function, a self-contained
mirror of `complexity.py` (constants guarded by `test_uc_function_parity`).
- `log_model.py` — logs both engine endpoints.
2. `scripts/provision_supervisor.py` — end-to-end provisioning orchestration.
3. `.planning/agents/agent_supervisor/SPEC.md` + eval dataset
`tests/eval/datasets/agent_supervisor/baseline.jsonl` (20 examples).
4. Tests: `tests/agents/agent_supervisor/{test_complexity,test_engine}.py`.

## Tests

- `uv run pytest tests/agents/agent_supervisor -q` → **35 passed** (incl. baseline
routing-accuracy 20/20 and Python↔SQL constant parity).
- Full stacked-branch regression `tests/agents tests/units/{agents,mapping,pge_eval,ontology}`
→ **759 passed, 11 skipped**.

# 2026-06-25 — refactor(agents): simplify supervisor engine for reviewability

## Context

Post-review simplification pass on the new supervisor code (behavior-preserving).

## Changes

1. `src/agents/agent_supervisor/engine.py` — Remove Middle Man: deleted the
`_run_mapping(**kw)` indirection that packed→unpacked→repacked identical
kwargs; `run()` now selects `run_mapping_pge` vs `run_auto_assignment` inline
(−15 lines), keeping the dispatch decision beside its call.
2. `src/agents/agent_supervisor/responses_agent.py` — moved the `assess()` call
into the `assess` branch that consumes it (the `run` path recomputes it in
`SupervisorEngine.run`), inlined a one-use local, tightened `_text_event`
`custom_outputs` to `Optional[dict]`.

## Tests

`uv run pytest tests/agents/agent_supervisor -q` → **35 passed** (unchanged).
102 changes: 102 additions & 0 deletions scripts/provision_supervisor.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
"""Provision the OntoBricks mapping Supervisor (Agent Bricks MAS) end to end.

Run from the repo root after PR1+PR2 land. Steps:

1. Register the deterministic complexity UC function from ``uc_function.sql``
(substituting ${CATALOG}/${SCHEMA}).
2. Log + deploy the two mapping-engine ResponsesAgents as Model Serving endpoints.
3. Build the Supervisor (MAS) config and create/update it via Agent Bricks.

This script does workspace I/O and is intended to run inside a configured
Databricks environment (CLI profile or SP creds). It is deliberately thin — the
testable logic lives in ``agents.agent_supervisor.{complexity,engine,mas}``.

Usage::

CATALOG=fiifi_cdm_demo_catalog SCHEMA=ontobricks \\
PGE_ENDPOINT=ob-mapping-pge SIMPLE_ENDPOINT=ob-mapping-simple \\
python scripts/provision_supervisor.py
"""

import os
import sys

sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath(__file__)), "..", "src"))

from agents.agent_supervisor.mas import SupervisorProvisioner # noqa: E402
from back.core.logging import get_logger # noqa: E402

logger = get_logger(__name__)


def register_uc_function(catalog: str, schema: str, warehouse_id: str) -> None:
"""Execute uc_function.sql with the catalog/schema substituted."""
from databricks import sql as dbsql # local import: deploy-time dep

sql_path = os.path.join(
os.path.dirname(os.path.abspath(__file__)),
"..",
"src",
"agents",
"agent_supervisor",
"uc_function.sql",
)
with open(sql_path) as fh:
ddl = fh.read().replace("${CATALOG}", catalog).replace("${SCHEMA}", schema)

host = os.environ["DATABRICKS_HOST"].replace("https://", "")
with dbsql.connect(
server_hostname=host,
http_path=f"/sql/1.0/warehouses/{warehouse_id}",
access_token=os.environ["DATABRICKS_TOKEN"],
) as conn:
with conn.cursor() as cur:
cur.execute(ddl)
logger.info("Registered %s.%s.assess_domain_complexity", catalog, schema)


def deploy_engine_endpoints(experiment: str) -> dict:
"""Log + deploy both mapping-engine ResponsesAgents. Returns endpoint names."""
from agents.agent_supervisor.log_model import log_engine_agent

endpoints = {}
for engine, env_key, default in (
("pge", "PGE_ENDPOINT", "ob-mapping-pge"),
("simple", "SIMPLE_ENDPOINT", "ob-mapping-simple"),
):
uri = log_engine_agent(engine, experiment)
endpoint = os.environ.get(env_key, default)
logger.info("Logged %s engine -> %s; deploy as endpoint %r", engine, uri, endpoint)
# Deployment to Model Serving is done via databricks.agents.deploy(uri,
# endpoint) or the agents SDK; left to the operator so this script stays
# idempotent and credential-agnostic.
endpoints[engine] = endpoint
return endpoints


def main() -> None:
catalog = os.environ.get("CATALOG", "main")
schema = os.environ.get("SCHEMA", "ontobricks")
warehouse_id = os.environ.get("WAREHOUSE_ID", "")
experiment = os.environ.get("ONTOBRICKS_MLFLOW_EXPERIMENT", "ontobricks-agents")

if warehouse_id:
register_uc_function(catalog, schema, warehouse_id)
else:
logger.warning("WAREHOUSE_ID unset — skipping UC function registration")

endpoints = deploy_engine_endpoints(experiment)

config = SupervisorProvisioner.build_config(
catalog=catalog,
schema=schema,
pge_endpoint=endpoints["pge"],
simple_endpoint=endpoints["simple"],
)
logger.info("Supervisor config built with %d agents", len(config["agents"]))
tile_id = SupervisorProvisioner.provision(config)
logger.info("Supervisor provisioned — tile_id=%s", tile_id)


if __name__ == "__main__":
main()
Loading