diff --git a/docs/osi-compatibility.md b/docs/osi-compatibility.md new file mode 100644 index 00000000..e03c56d3 --- /dev/null +++ b/docs/osi-compatibility.md @@ -0,0 +1,115 @@ +# BSL <> OSI Compatibility Gap Analysis + +**OSI Spec Version**: 0.1.1 (2025-12-11) +**BSL Version**: 0.3.11 + +## Overview + +The [Open Semantic Interchange (OSI)](https://github.com/open-semantic-interchange/OSI) spec defines +a vendor-neutral YAML format for semantic model definitions. This document maps BSL's current format +to OSI and identifies gaps that need bridging for full compatibility. + +## Structural Differences + +| Aspect | OSI | BSL | Status | +|--------|-----|-----|--------| +| Top-level wrapper | `semantic_model: [{name, datasets, ...}]` | Flat model-per-key | DIFFERENT | +| Version field | `version: "0.1.1"` (required) | None | MISSING | +| Dataset/model | `datasets` array with `name` + `source` | Top-level keys with `table` | DIFFERENT | +| Fields | `fields` array (name + expression object) | `dimensions` dict with `_.expr` syntax | DIFFERENT | +| Metrics | `metrics` array at semantic_model level | `measures`/`calculated_measures` dicts at model level | DIFFERENT | +| Relationships | `relationships` array with `from`/`to` + column arrays | `joins` dict with `model`/`type`/`left_on`/`right_on` | DIFFERENT | + +## Field-Level Gaps + +### Missing in BSL (needed for OSI export) + +| OSI Field | Location | Description | Priority | +|-----------|----------|-------------|----------| +| `ai_context` | Every level | String or structured object with instructions/synonyms/examples | HIGH | +| `primary_key` | Dataset | Array of column names forming PK | MEDIUM | +| `unique_keys` | Dataset | Array of unique key arrays | LOW | +| `custom_extensions` | Every level | Vendor-specific metadata (`vendor_name` + `data` JSON) | LOW | +| `label` | Field | Categorization label (e.g., "filter") | LOW | +| Multi-dialect `expression` | Field/Metric | `dialects: [{dialect, expression}]` | MEDIUM | + +### BSL has, OSI doesn't + +| BSL Field | Description | Handling | +|-----------|-------------|----------| +| `profile` | Database connection config | Omit from OSI export | +| `filter` | Model-level filter expression | Store in `custom_extensions` | +| `is_entity` | Entity/PK marker on dimension | Map to `primary_key` on dataset | +| `is_event_timestamp` | Event timestamp marker | Store in `custom_extensions` | +| `smallest_time_grain` | Time granularity | Store in `custom_extensions` | +| `derived_dimensions` | Auto-derived time parts | Store in `custom_extensions` | +| `calculated_measures` | Derived metrics referencing other measures | Export as metrics with `custom_extensions` | +| `join.type` / `join.how` | Join cardinality + method | Inferred from relationship + `custom_extensions` | + +## Concept Mapping + +### Dimensions -> Fields + +``` +BSL dimension with is_time_dimension=true + -> OSI field with dimension.is_time=true + +BSL dimension with is_entity=true + -> OSI dataset.primary_key includes this field's column + +BSL dimension expression: _.column_name + -> OSI expression.dialects[{dialect: "ANSI_SQL", expression: "column_name"}] + +BSL computed dimension: _.first_name.concat(' ', _.last_name) + -> OSI expression: "first_name || ' ' || last_name" (SQL form) +``` + +### Measures -> Metrics + +``` +BSL measure: _.column.sum() + -> OSI metric expression: "SUM(dataset_name.column)" + +BSL measure: _.count() + -> OSI metric expression: "COUNT(*)" + +BSL calculated_measure: _.meas1 / _.meas2 + -> OSI metric with custom_extension noting it's derived +``` + +### Joins -> Relationships + +``` +BSL join: + carriers: + model: carriers + type: one + left_on: carrier + right_on: code + + -> OSI relationship: + name: flights_to_carriers + from: flights + to: carriers + from_columns: [carrier] + to_columns: [code] +``` + +## Implementation Plan + +### Phase 1: Bidirectional Converter (this PR) + +1. **`osi.py`** - New module with `to_osi()` and `from_osi()` functions +2. **`ai_context`** on `Dimension` and `Measure` - Optional field for round-trip fidelity +3. **Tests** - Round-trip conversion tests +4. **Example** - flights.yml converted to OSI format + +### Expression Translation Strategy + +BSL uses Ibis Deferred expressions (`_.column`), while OSI uses SQL strings. +For simple column references, extraction is straightforward. For complex expressions, +we serialize to ANSI SQL via Ibis's SQL compiler. + +**Simple**: `_.column_name` -> `"column_name"` +**Computed**: `_.first_name.concat(' ', _.last_name)` -> `"first_name || ' ' || last_name"` +**Aggregate**: `_.amount.sum()` -> `"SUM(amount)"` diff --git a/examples/flights_osi.yaml b/examples/flights_osi.yaml new file mode 100644 index 00000000..5d6260ae --- /dev/null +++ b/examples/flights_osi.yaml @@ -0,0 +1,436 @@ +# OSI (Open Semantic Interchange) version of the flights example +# Spec: https://github.com/open-semantic-interchange/OSI (v0.1.1) +# +# This is the OSI-compliant equivalent of flights.yml. +# Use from_osi_yaml() to load this into BSL: +# +# from boring_semantic_layer.osi import from_osi_yaml +# models = from_osi_yaml("flights_osi.yaml", tables=tables) + +version: "0.1.1" + +semantic_model: + - name: flights_analytics + description: Flight data analytics model with carriers, aircraft, and airports + ai_context: + instructions: > + Use this model for analyzing US domestic flight data including + delays, distances, carriers, aircraft, and airports. + examples: + - "What are the top 10 carriers by total flights?" + - "What is the average flight distance by origin airport?" + - "Show me the distribution of departure delays" + + datasets: + # ---------------------------------------------------------------- + # Carriers + # ---------------------------------------------------------------- + - name: carriers + source: carriers_tbl + primary_key: [code] + description: Airline carrier information + fields: + - name: code + expression: + dialects: + - dialect: ANSI_SQL + expression: code + dimension: + is_time: false + description: Airline carrier code + + - name: name + expression: + dialects: + - dialect: ANSI_SQL + expression: name + dimension: + is_time: false + description: Full airline name + + - name: nickname + expression: + dialects: + - dialect: ANSI_SQL + expression: nickname + dimension: + is_time: false + + # ---------------------------------------------------------------- + # Aircraft Models + # ---------------------------------------------------------------- + - name: aircraft_models + source: aircraft_models_tbl + primary_key: [aircraft_model_code] + description: Aircraft model specifications + fields: + - name: aircraft_model_code + expression: + dialects: + - dialect: ANSI_SQL + expression: aircraft_model_code + dimension: + is_time: false + + - name: manufacturer + expression: + dialects: + - dialect: ANSI_SQL + expression: manufacturer + dimension: + is_time: false + + - name: model + expression: + dialects: + - dialect: ANSI_SQL + expression: model + dimension: + is_time: false + + - name: aircraft_type_id + expression: + dialects: + - dialect: ANSI_SQL + expression: aircraft_type_id + dimension: + is_time: false + + - name: aircraft_engine_type_id + expression: + dialects: + - dialect: ANSI_SQL + expression: aircraft_engine_type_id + dimension: + is_time: false + + - name: aircraft_category_id + expression: + dialects: + - dialect: ANSI_SQL + expression: aircraft_category_id + dimension: + is_time: false + + # ---------------------------------------------------------------- + # Aircraft (Individual Planes) + # ---------------------------------------------------------------- + - name: aircraft + source: aircraft_tbl + primary_key: [tail_num] + description: Individual aircraft registry + fields: + - name: tail_num + expression: + dialects: + - dialect: ANSI_SQL + expression: tail_num + dimension: + is_time: false + + - name: aircraft_serial + expression: + dialects: + - dialect: ANSI_SQL + expression: aircraft_serial + dimension: + is_time: false + + - name: aircraft_model_code + expression: + dialects: + - dialect: ANSI_SQL + expression: aircraft_model_code + dimension: + is_time: false + + - name: aircraft_engine_code + expression: + dialects: + - dialect: ANSI_SQL + expression: aircraft_engine_code + dimension: + is_time: false + + - name: aircraft_type_id + expression: + dialects: + - dialect: ANSI_SQL + expression: aircraft_type_id + dimension: + is_time: false + + # ---------------------------------------------------------------- + # Airports + # ---------------------------------------------------------------- + - name: airports + source: airports_tbl + primary_key: [code] + description: Airport locations and information + ai_context: + synonyms: + - "airport" + - "airfield" + - "aerodrome" + fields: + - name: code + expression: + dialects: + - dialect: ANSI_SQL + expression: code + dimension: + is_time: false + + - name: city + expression: + dialects: + - dialect: ANSI_SQL + expression: city + dimension: + is_time: false + + - name: county + expression: + dialects: + - dialect: ANSI_SQL + expression: county + dimension: + is_time: false + + - name: state + expression: + dialects: + - dialect: ANSI_SQL + expression: state + dimension: + is_time: false + + - name: full_name + expression: + dialects: + - dialect: ANSI_SQL + expression: full_name + dimension: + is_time: false + + - name: latitude + expression: + dialects: + - dialect: ANSI_SQL + expression: latitude + dimension: + is_time: false + + - name: longitude + expression: + dialects: + - dialect: ANSI_SQL + expression: longitude + dimension: + is_time: false + + - name: elevation + expression: + dialects: + - dialect: ANSI_SQL + expression: elevation + dimension: + is_time: false + + # ---------------------------------------------------------------- + # Flights + # ---------------------------------------------------------------- + - name: flights + source: flights_tbl + description: Flight data with origin, destination, and metrics + ai_context: + synonyms: + - "flight records" + - "air travel" + - "flight segments" + fields: + - name: origin + expression: + dialects: + - dialect: ANSI_SQL + expression: origin + dimension: + is_time: false + description: Origin airport code + ai_context: + synonyms: + - "departure airport" + - "from" + + - name: destination + expression: + dialects: + - dialect: ANSI_SQL + expression: destination + dimension: + is_time: false + description: Destination airport code + ai_context: + synonyms: + - "arrival airport" + - "to" + + - name: carrier + expression: + dialects: + - dialect: ANSI_SQL + expression: carrier + dimension: + is_time: false + + - name: tail_num + expression: + dialects: + - dialect: ANSI_SQL + expression: tail_num + dimension: + is_time: false + + - name: dep_time + expression: + dialects: + - dialect: ANSI_SQL + expression: dep_time + dimension: + is_time: false + description: Departure timestamp + + - name: arr_time + expression: + dialects: + - dialect: ANSI_SQL + expression: arr_time + dimension: + is_time: true + description: Arrival timestamp + custom_extensions: + - vendor_name: COMMON + data: '{"is_event_timestamp": false, "smallest_time_grain": "TIME_GRAIN_DAY"}' + + - name: distance + expression: + dialects: + - dialect: ANSI_SQL + expression: distance + dimension: + is_time: false + description: Flight distance in miles + + relationships: + - name: flights_to_carriers + from: flights + to: carriers + from_columns: [carrier] + to_columns: [code] + + - name: aircraft_to_aircraft_models + from: aircraft + to: aircraft_models + from_columns: [aircraft_model_code] + to_columns: [aircraft_model_code] + + - name: flights_to_aircraft + from: flights + to: aircraft + from_columns: [tail_num] + to_columns: [tail_num] + + - name: flights_to_origin_airport + from: flights + to: airports + from_columns: [origin] + to_columns: [code] + + metrics: + - name: carrier_count + expression: + dialects: + - dialect: ANSI_SQL + expression: COUNT(*) + description: Number of carriers + + - name: model_count + expression: + dialects: + - dialect: ANSI_SQL + expression: COUNT(*) + description: Number of aircraft models + + - name: avg_seats + expression: + dialects: + - dialect: ANSI_SQL + expression: AVG(aircraft_models.seats) + description: Average number of seats + + - name: aircraft_count + expression: + dialects: + - dialect: ANSI_SQL + expression: COUNT(*) + description: Number of aircraft + + - name: airport_count + expression: + dialects: + - dialect: ANSI_SQL + expression: COUNT(*) + description: Number of airports + + - name: avg_elevation + expression: + dialects: + - dialect: ANSI_SQL + expression: AVG(airports.elevation) + description: Average airport elevation + + - name: flight_count + expression: + dialects: + - dialect: ANSI_SQL + expression: COUNT(*) + description: Total number of flights + ai_context: + synonyms: + - "number of flights" + - "total flights" + + - name: total_distance + expression: + dialects: + - dialect: ANSI_SQL + expression: SUM(flights.distance) + description: Total distance flown + + - name: avg_distance + expression: + dialects: + - dialect: ANSI_SQL + expression: AVG(flights.distance) + description: Average flight distance + + - name: max_distance + expression: + dialects: + - dialect: ANSI_SQL + expression: MAX(flights.distance) + description: Maximum flight distance + + - name: avg_delay + expression: + dialects: + - dialect: ANSI_SQL + expression: AVG(flights.dep_delay) + description: Average departure delay + + - name: max_delay + expression: + dialects: + - dialect: ANSI_SQL + expression: MAX(flights.dep_delay) + description: Maximum departure delay diff --git a/src/boring_semantic_layer/__init__.py b/src/boring_semantic_layer/__init__.py index fe92c358..608d05bd 100644 --- a/src/boring_semantic_layer/__init__.py +++ b/src/boring_semantic_layer/__init__.py @@ -43,6 +43,10 @@ from_config, from_yaml, ) +from .osi import ( + to_osi, + to_osi_yaml, +) __all__ = [ "to_semantic_table", @@ -57,6 +61,8 @@ "Measure", "from_config", "from_yaml", + "to_osi", + "to_osi_yaml", "MCPSemanticModel", "LangGraphBackend", "options", diff --git a/src/boring_semantic_layer/api.py b/src/boring_semantic_layer/api.py index b3564c20..6bfb2003 100644 --- a/src/boring_semantic_layer/api.py +++ b/src/boring_semantic_layer/api.py @@ -18,7 +18,10 @@ def to_semantic_table( - ibis_table: ir.Table, name: str | None = None, description: str | None = None + ibis_table: ir.Table, + name: str | None = None, + description: str | None = None, + ai_context: str | dict | None = None, ) -> SemanticModel: """Create a SemanticModel from an Ibis table. @@ -26,6 +29,8 @@ def to_semantic_table( ibis_table: An Ibis table expression (can be regular ibis or xorq vendored ibis) name: Optional name for the semantic table description: Optional description for the semantic table + ai_context: Optional AI context (string or structured object with + instructions, synonyms, examples) per the OSI specification Returns: A new SemanticModel wrapping the table @@ -41,6 +46,7 @@ def to_semantic_table( calc_measures=None, name=name, description=description, + ai_context=ai_context, ) diff --git a/src/boring_semantic_layer/expr.py b/src/boring_semantic_layer/expr.py index b08f9984..8ff5a4d0 100644 --- a/src/boring_semantic_layer/expr.py +++ b/src/boring_semantic_layer/expr.py @@ -433,6 +433,7 @@ def __init__( calc_measures: Mapping[str, Any] | None = None, name: str | None = None, description: str | None = None, + ai_context: str | dict | None = None, _source_join: Any | None = None, ) -> None: # Keep tables in regular ibis - only convert to xorq at execution time if needed @@ -452,6 +453,11 @@ def __init__( derived_name = name or _derive_name(table) + # Serialize dict ai_context to JSON string for ibis hashability + import json as _json + + _ai_ctx = _json.dumps(ai_context, sort_keys=True) if isinstance(ai_context, dict) else ai_context + op = SemanticTableOp( table=table, dimensions=dims, @@ -459,6 +465,7 @@ def __init__( calc_measures=calc_meas, name=derived_name, description=description, + ai_context=_ai_ctx, _source_join=_source_join, ) @@ -525,6 +532,7 @@ def with_dimensions(self, **dims) -> SemanticModel: calc_measures=self.get_calculated_measures(), name=self.name, description=self.description, + ai_context=self.op().ai_context, ) def with_measures(self, **meas) -> SemanticModel: @@ -548,6 +556,7 @@ def with_measures(self, **meas) -> SemanticModel: calc_measures=new_calc_meas, name=self.name, description=self.description, + ai_context=self.op().ai_context, ) def join_one( diff --git a/src/boring_semantic_layer/ops.py b/src/boring_semantic_layer/ops.py index 3702816f..73e62c1e 100644 --- a/src/boring_semantic_layer/ops.py +++ b/src/boring_semantic_layer/ops.py @@ -626,22 +626,24 @@ def _infer_unnest(fn: Callable, table: Any) -> tuple[str, ...]: return () -def _extract_measure_metadata(fn_or_expr: Any) -> tuple[Any, str | None, tuple]: +def _extract_measure_metadata(fn_or_expr: Any) -> tuple[Any, str | None, tuple, Any]: """Extract metadata from various measure representations.""" if isinstance(fn_or_expr, dict): return ( fn_or_expr["expr"], fn_or_expr.get("description"), tuple(fn_or_expr.get("requires_unnest", [])), + fn_or_expr.get("ai_context"), ) elif isinstance(fn_or_expr, Measure): return ( fn_or_expr.expr, fn_or_expr.description, fn_or_expr.requires_unnest, + fn_or_expr.ai_context, ) else: - return (fn_or_expr, None, ()) + return (fn_or_expr, None, (), None) _AGG_METHODS = frozenset({"sum", "mean", "avg", "count", "min", "max"}) @@ -733,6 +735,7 @@ def _make_base_measure( expr: Any, description: str | None, requires_unnest: tuple, + ai_context: Any = None, ) -> Measure: """Create a base measure with proper callable wrapping using functional patterns.""" @@ -788,6 +791,7 @@ def wrapped_expr(t): description=description, requires_unnest=requires_unnest, original_expr=raw_expr, + ai_context=ai_context, ) if callable(expr): @@ -806,6 +810,7 @@ def wrapped_expr(t): description=description, requires_unnest=requires_unnest, original_expr=raw_expr, + ai_context=ai_context, ) else: return Measure( @@ -813,12 +818,13 @@ def wrapped_expr(t): description=description, requires_unnest=requires_unnest, original_expr=raw_expr, + ai_context=ai_context, ) def _classify_measure(fn_or_expr: Any, scope: Any) -> tuple[str, Any]: """Classify measure as 'calc' or 'base' with appropriate handling.""" - expr, description, requires_unnest = _extract_measure_metadata(fn_or_expr) + expr, description, requires_unnest, ai_context = _extract_measure_metadata(fn_or_expr) resolved = safe(lambda: _resolve_expr(expr, scope))().map( lambda val: ("calc", val) if _is_calculated_measure(val) else None @@ -833,7 +839,7 @@ def _classify_measure(fn_or_expr: Any, scope: Any) -> tuple[str, Any]: inferred_unnest = _infer_unnest(expr, table) requires_unnest = requires_unnest or inferred_unnest - return ("base", _make_base_measure(expr, description, requires_unnest)) + return ("base", _make_base_measure(expr, description, requires_unnest, ai_context)) def _build_json_definition( @@ -953,6 +959,8 @@ class Dimension: is_event_timestamp: bool = False smallest_time_grain: str | None = None derived_dimensions: tuple[str, ...] = () + ai_context: str | dict | None = None + label: str | None = None def __call__(self, table: ir.Table, _dims: dict | None = None) -> ir.Value: try: @@ -990,6 +998,10 @@ def to_json(self) -> Mapping[str, Any]: base["smallest_time_grain"] = self.smallest_time_grain if self.derived_dimensions: base["derived_dimensions"] = list(self.derived_dimensions) + if self.ai_context: + base["ai_context"] = self.ai_context + if self.label: + base["label"] = self.label return base def __hash__(self) -> int: @@ -1011,6 +1023,7 @@ class Measure: description: str | None = None requires_unnest: tuple[str, ...] = () # Internal: Arrays that must be unnested original_expr: Any = field(default=None, eq=False, hash=False) + ai_context: str | dict | None = None def __call__(self, table: ir.Table) -> ir.Value: return self.expr.resolve(table) if _is_deferred(self.expr) else self.expr(table) @@ -1026,6 +1039,8 @@ def to_json(self) -> Mapping[str, Any]: base["locality"] = self.locality if self.requires_unnest: base["requires_unnest"] = list(self.requires_unnest) + if self.ai_context: + base["ai_context"] = self.ai_context return base def __hash__(self) -> int: @@ -1047,10 +1062,25 @@ class SemanticTableOp(Relation): calc_measures: FrozenDict[str, Any] name: str | None = None description: str | None = None + ai_context: str | None = None # JSON string when dict; plain string otherwise _source_join: Any = field( default=None, repr=False ) # Track if this wraps a join (SemanticJoinOp) for optimization + def get_ai_context(self) -> str | dict | None: + """Return ai_context, deserializing JSON-encoded dicts.""" + val = self.ai_context + if val is None: + return None + try: + import json + parsed = json.loads(val) + if isinstance(parsed, dict): + return parsed + except (json.JSONDecodeError, TypeError, ValueError): + pass + return val + def __init__( self, table: ir.Table, @@ -1059,6 +1089,7 @@ def __init__( calc_measures: dict[str, Any] | FrozenDict[str, Any], name: str | None = None, description: str | None = None, + ai_context: str | dict | None = None, _source_join: Any = None, ) -> None: # Accept both regular ibis and xorq tables without conversion @@ -1074,6 +1105,7 @@ def __init__( else calc_measures, name=name, description=description, + ai_context=ai_context, _source_join=_source_join, ) diff --git a/src/boring_semantic_layer/osi.py b/src/boring_semantic_layer/osi.py new file mode 100644 index 00000000..b4c1b3c2 --- /dev/null +++ b/src/boring_semantic_layer/osi.py @@ -0,0 +1,434 @@ +""" +OSI (Open Semantic Interchange) support for Boring Semantic Layer. + +Export: ``to_osi`` / ``to_osi_yaml`` convert BSL models to OSI v0.1.1 YAML. +Import: ``from_yaml`` / ``from_config`` natively detect and parse OSI YAML + (no separate import step needed). ``from_osi`` / ``from_osi_yaml`` + are kept as convenience aliases. + +See: https://github.com/open-semantic-interchange/OSI +""" + +from __future__ import annotations + +import re +from collections.abc import Mapping +from typing import Any + +from ibis.common.deferred import Deferred + +from .expr import SemanticModel, SemanticTable +from .ops import Dimension, Measure, SemanticTableOp, _is_deferred +from .utils import expr_to_ibis_string + +OSI_VERSION = "0.1.1" + + +# --------------------------------------------------------------------------- +# Expression helpers (BSL -> SQL for OSI export) +# --------------------------------------------------------------------------- + + +def _deferred_to_sql(expr: Deferred) -> str: + """Convert an Ibis Deferred expression to a simple SQL-like string.""" + return _ibis_string_to_sql(str(expr)) + + +def _ibis_string_to_sql(s: str) -> str: + """Convert an Ibis deferred string repr to SQL expression.""" + s = s.strip() + + if s == "_.count()": + return "COUNT(*)" + + agg_map = {"sum": "SUM", "mean": "AVG", "max": "MAX", "min": "MIN"} + for ibis_fn, sql_fn in agg_map.items(): + m = re.match(rf"^_\.(.+)\.{ibis_fn}\(\)$", s) + if m: + return f"{sql_fn}({m.group(1)})" + + m = re.match(r"^_\.(.+)\.nunique\(\)$", s) + if m: + return f"COUNT(DISTINCT {m.group(1)})" + + m = re.match(r"^_\.(\w+)$", s) + if m: + return m.group(1) + + # For non-trivial expressions (concat, case, etc.) that don't match + # known patterns, return None to signal the caller that we can't produce + # valid SQL. Stripping "_." would leak Ibis method syntax. + return None + + +def _expr_to_sql_string(expr: Any) -> str | None: + """Best-effort conversion of a BSL expression to a SQL string.""" + if _is_deferred(expr): + return _deferred_to_sql(expr) + + from returns.result import Success + + result = expr_to_ibis_string(expr) + if isinstance(result, Success): + val = result.unwrap() + if val is not None: + return _ibis_string_to_sql(val) + + return None + + +# --------------------------------------------------------------------------- +# Export helpers +# --------------------------------------------------------------------------- + + +def _json_dumps(obj: Any) -> str: + import json + return json.dumps(obj) + + +def _make_osi_expression(sql_expr: str, dialect: str = "ANSI_SQL") -> dict: + return {"dialects": [{"dialect": dialect, "expression": sql_expr}]} + + +def _dimension_to_osi_field(name: str, dim: Dimension) -> dict: + """Convert a BSL Dimension to an OSI field dict.""" + sql = _expr_to_sql_string(dim.expr) + field: dict[str, Any] = { + "name": name, + "expression": _make_osi_expression(sql or name), + } + + field["dimension"] = {"is_time": bool(dim.is_time_dimension or dim.is_event_timestamp)} + + if dim.description: + field["description"] = dim.description + if dim.label: + field["label"] = dim.label + if dim.ai_context: + field["ai_context"] = dim.ai_context + + bsl_data: dict[str, Any] = {} + if dim.is_entity: + bsl_data["is_entity"] = True + if dim.is_event_timestamp: + bsl_data["is_event_timestamp"] = True + if dim.smallest_time_grain: + bsl_data["smallest_time_grain"] = dim.smallest_time_grain + if dim.derived_dimensions: + bsl_data["derived_dimensions"] = list(dim.derived_dimensions) + + if bsl_data: + field["custom_extensions"] = [ + {"vendor_name": "COMMON", "data": _json_dumps(bsl_data)} + ] + + return field + + +def _prefix_columns_in_sql(sql: str, dataset: str) -> str: + """``SUM(amount)`` -> ``SUM(dataset.amount)``""" + if sql == "COUNT(*)": + return sql + + def _prefix_match(m: re.Match) -> str: + fn, inner = m.group(1), m.group(2).strip() + if inner.upper().startswith("DISTINCT "): + col = inner[9:].strip() + if "." not in col: + return f"{fn}(DISTINCT {dataset}.{col})" + return m.group(0) + if "." not in inner and inner != "*": + return f"{fn}({dataset}.{inner})" + return m.group(0) + + return re.sub(r"(\w+)\(([^)]+)\)", _prefix_match, sql) + + +def _measure_to_osi_metric(name: str, measure: Measure, dataset_name: str | None = None) -> dict: + sql = _expr_to_sql_string(measure.expr) + if sql and dataset_name: + sql = _prefix_columns_in_sql(sql, dataset_name) + + metric: dict[str, Any] = { + "name": name, + "expression": _make_osi_expression(sql or name), + } + if measure.description: + metric["description"] = measure.description + if measure.ai_context: + metric["ai_context"] = measure.ai_context + return metric + + +def _walk_predicate_for_columns( + op_node, left_cols: set[str], right_cols: set[str] +) -> tuple[list[str], list[str]]: + """Extract (from_columns, to_columns) from an ibis Equals expression tree.""" + from_list: list[str] = [] + to_list: list[str] = [] + + def _extract_eq(node): + """Handle Equals(Field, Field) nodes.""" + cls_name = type(node).__name__ + if cls_name == "Equals": + args = node.args + names = [] + for arg in args: + if hasattr(arg, "name") and isinstance(arg.name, str): + names.append(arg.name) + if len(names) == 2: + a, b = names + if a in left_cols and b in right_cols: + from_list.append(a) + to_list.append(b) + elif b in left_cols and a in right_cols: + from_list.append(b) + to_list.append(a) + elif cls_name == "And": + for arg in node.args: + _extract_eq(arg) + # Recurse into args that are ops + if not from_list: + for arg in getattr(node, "args", []): + if hasattr(arg, "args"): + _extract_eq(arg) + + _extract_eq(op_node) + return from_list, to_list + + +def _extract_join_info(model: SemanticModel) -> list[dict]: + """Extract relationship info from a model's join chain.""" + from .ops import SemanticJoinOp + + relationships: list[dict] = [] + op = model.op() + + def _name(node) -> str: + if isinstance(node, SemanticTableOp): + return node.name or "unnamed" + if hasattr(node, "name") and node.name: + return node.name + if hasattr(node, "table"): + inner = node.table if not hasattr(node.table, "op") else node.table.op() + return _name(inner) + return "unnamed" + + def _extract_join_cols(node) -> tuple[list[str], list[str]]: + """Try to extract join column names from a SemanticJoinOp predicate.""" + if node.on is None: + return [], [] + try: + # Build mock tables from left/right schemas + import ibis + left_tbl = node.left.to_expr() if hasattr(node.left, "to_expr") else None + right_tbl = node.right.to_expr() if hasattr(node.right, "to_expr") else None + if left_tbl is None or right_tbl is None: + return [], [] + # Evaluate the predicate to get an ibis expression + pred = node.on(left_tbl, right_tbl) + # Walk the expression tree to find Equals(field, field) patterns + op_node = pred.op() + from_cols, to_cols = _walk_predicate_for_columns( + op_node, set(left_tbl.columns), set(right_tbl.columns) + ) + return from_cols, to_cols + except Exception: + return [], [] + + def _walk(node): + if isinstance(node, SemanticJoinOp): + from_cols, to_cols = _extract_join_cols(node) + rel: dict[str, Any] = { + "name": f"{_name(node.left)}_{_name(node.right)}", + "from": _name(node.left), + "to": _name(node.right), + "from_columns": from_cols or ["unknown"], + "to_columns": to_cols or ["unknown"], + } + if hasattr(node, "cardinality"): + rel["custom_extensions"] = [ + {"vendor_name": "COMMON", "data": _json_dumps({"cardinality": node.cardinality})} + ] + relationships.append(rel) + _walk(node.left) + _walk(node.right) + elif hasattr(node, "table"): + _walk(node.table if not hasattr(node.table, "op") else node.table.op()) + + _walk(op) + return relationships + + +# --------------------------------------------------------------------------- +# Public: Export BSL -> OSI +# --------------------------------------------------------------------------- + + +def to_osi( + models: dict[str, SemanticModel] | SemanticModel, + name: str = "semantic_model", + description: str | None = None, + ai_context: str | dict | None = None, +) -> dict[str, Any]: + """Convert BSL SemanticModel(s) to an OSI-compliant dict. + + Args: + models: A single SemanticModel or dict of name -> SemanticModel + name: Name for the OSI semantic model + description: Optional description + ai_context: Optional AI context + + Returns: + Dict that can be serialized to OSI YAML via ``yaml.dump()`` + + Example:: + + >>> from boring_semantic_layer import from_yaml + >>> from boring_semantic_layer.osi import to_osi + >>> models = from_yaml("flights.yml") + >>> osi = to_osi(models, name="flights_analytics") + """ + if isinstance(models, (SemanticModel, SemanticTable)): + op = models.op() + model_name = op.name or "model" + models = {model_name: models} + + datasets = [] + all_metrics: list[dict] = [] + all_relationships: list[dict] = [] + seen_relationship_names: set[str] = set() + + for model_name, model in models.items(): + op = model.op() + + dataset: dict[str, Any] = {"name": model_name} + + try: + source_table = op.to_untagged() + if hasattr(source_table, "get_name"): + dataset["source"] = source_table.get_name() + elif hasattr(source_table, "op") and hasattr(source_table.op(), "name"): + dataset["source"] = source_table.op().name or model_name + else: + dataset["source"] = model_name + except Exception: + dataset["source"] = model_name + + pk_cols = [] + dims = op.get_dimensions() + for dim_name, dim in dims.items(): + if dim.is_entity: + sql = _expr_to_sql_string(dim.expr) + pk_cols.append(sql or dim_name) + if pk_cols: + dataset["primary_key"] = pk_cols + + if op.description: + dataset["description"] = op.description + ds_ai_ctx = op.get_ai_context() + if ds_ai_ctx: + dataset["ai_context"] = ds_ai_ctx + + fields = [_dimension_to_osi_field(n, d) for n, d in dims.items()] + if fields: + dataset["fields"] = fields + + datasets.append(dataset) + + for meas_name, meas in op.get_measures().items(): + all_metrics.append(_measure_to_osi_metric(meas_name, meas, model_name)) + + for cm_name, cm_fn in op.get_calculated_measures().items(): + # Try to extract the formula from the calculated measure + cm_expr_str = None + if isinstance(cm_fn, Measure) and cm_fn.original_expr is not None: + cm_expr_str = _expr_to_sql_string(cm_fn.original_expr) + if cm_expr_str is None and isinstance(cm_fn, Measure): + cm_expr_str = _expr_to_sql_string(cm_fn.expr) + # Try inspecting closure for the source expression string + if cm_expr_str is None and callable(cm_fn): + import inspect + try: + closure = inspect.getclosurevars(cm_fn) + if "source" in closure.nonlocals: + src = closure.nonlocals["source"] + # Convert _.meas_a / _.meas_b style to readable form + cm_expr_str = src.replace("_.", "") + except Exception: + pass + + metric: dict[str, Any] = { + "name": cm_name, + "expression": _make_osi_expression(cm_expr_str or cm_name), + } + if isinstance(cm_fn, Measure) and cm_fn.description: + metric["description"] = cm_fn.description + metric.setdefault("custom_extensions", []).append( + {"vendor_name": "COMMON", "data": _json_dumps({"bsl_type": "calculated_measure"})} + ) + all_metrics.append(metric) + + for rel in _extract_join_info(model): + if rel["name"] not in seen_relationship_names: + all_relationships.append(rel) + seen_relationship_names.add(rel["name"]) + + semantic_model: dict[str, Any] = {"name": name, "datasets": datasets} + if description: + semantic_model["description"] = description + if ai_context: + semantic_model["ai_context"] = ai_context + if all_relationships: + semantic_model["relationships"] = all_relationships + if all_metrics: + semantic_model["metrics"] = all_metrics + + return {"version": OSI_VERSION, "semantic_model": [semantic_model]} + + +def to_osi_yaml( + models: dict[str, SemanticModel] | SemanticModel, + name: str = "semantic_model", + description: str | None = None, + ai_context: str | dict | None = None, +) -> str: + """Convert BSL models to an OSI YAML string.""" + import yaml + return yaml.dump( + to_osi(models, name=name, description=description, ai_context=ai_context), + sort_keys=False, + default_flow_style=False, + ) + + +# --------------------------------------------------------------------------- +# Convenience aliases — import delegates to from_config (which auto-detects) +# --------------------------------------------------------------------------- + + +def from_osi( + osi_config: dict[str, Any], + tables: Mapping[str, Any] | None = None, +) -> dict[str, SemanticModel]: + """Parse an OSI config dict into BSL models. + + This is a convenience alias for ``from_config(osi_config, tables=tables)``. + You can also call ``from_config`` or ``from_yaml`` directly — they + auto-detect OSI format. + """ + from .yaml import from_config + return from_config(osi_config, tables=tables) + + +def from_osi_yaml( + yaml_path: str, + tables: Mapping[str, Any] | None = None, +) -> dict[str, SemanticModel]: + """Load BSL models from an OSI YAML file. + + This is a convenience alias for ``from_yaml(yaml_path, tables=tables)``. + """ + from .yaml import from_yaml + return from_yaml(yaml_path, tables=tables) diff --git a/src/boring_semantic_layer/tests/test_osi.py b/src/boring_semantic_layer/tests/test_osi.py new file mode 100644 index 00000000..51ee8fed --- /dev/null +++ b/src/boring_semantic_layer/tests/test_osi.py @@ -0,0 +1,546 @@ +"""Tests for OSI (Open Semantic Interchange) support. + +Import tests use ``from_config`` (the native entry point) to verify that OSI +format is auto-detected and parsed correctly — no separate ``from_osi`` call +needed. Export tests use ``to_osi`` / ``to_osi_yaml``. +""" + +import json + +import ibis +import pytest + +from boring_semantic_layer import ( + Dimension, + Measure, + from_config, + to_semantic_table, +) +from boring_semantic_layer.osi import ( + OSI_VERSION, + _deferred_to_sql, + _ibis_string_to_sql, + to_osi, + to_osi_yaml, +) +from boring_semantic_layer.yaml import ( + _is_osi_config, + _sql_to_deferred, + _strip_dataset_prefix, +) + + +# --------------------------------------------------------------------------- +# Expression conversion tests +# --------------------------------------------------------------------------- + + +class TestIbisStringToSql: + def test_simple_column(self): + assert _ibis_string_to_sql("_.column_name") == "column_name" + + def test_count(self): + assert _ibis_string_to_sql("_.count()") == "COUNT(*)" + + def test_sum(self): + assert _ibis_string_to_sql("_.amount.sum()") == "SUM(amount)" + + def test_mean(self): + assert _ibis_string_to_sql("_.amount.mean()") == "AVG(amount)" + + def test_max(self): + assert _ibis_string_to_sql("_.amount.max()") == "MAX(amount)" + + def test_min(self): + assert _ibis_string_to_sql("_.amount.min()") == "MIN(amount)" + + def test_nunique(self): + assert _ibis_string_to_sql("_.customer_id.nunique()") == "COUNT(DISTINCT customer_id)" + + +class TestDeferredToSql: + def test_simple_column(self): + assert _deferred_to_sql(ibis._.column_name) == "column_name" + + def test_count(self): + assert _deferred_to_sql(ibis._.count()) == "COUNT(*)" + + def test_sum(self): + assert _deferred_to_sql(ibis._.amount.sum()) == "SUM(amount)" + + def test_mean(self): + assert _deferred_to_sql(ibis._.amount.mean()) == "AVG(amount)" + + +class TestSqlToDeferred: + def test_simple_column(self): + d = _sql_to_deferred("column_name") + assert "column_name" in str(d) + + def test_count_star(self): + d = _sql_to_deferred("COUNT(*)") + assert "count" in str(d).lower() + + def test_sum(self): + d = _sql_to_deferred("SUM(amount)") + assert "sum" in str(d).lower() or "amount" in str(d).lower() + + def test_avg(self): + d = _sql_to_deferred("AVG(price)") + assert "mean" in str(d).lower() or "price" in str(d).lower() + + +class TestStripDatasetPrefix: + def test_sum_with_prefix(self): + assert _strip_dataset_prefix("SUM(flights.distance)") == "SUM(distance)" + + def test_count_star(self): + assert _strip_dataset_prefix("COUNT(*)") == "COUNT(*)" + + def test_count_distinct_with_prefix(self): + assert _strip_dataset_prefix("COUNT(DISTINCT customers.id)") == "COUNT(DISTINCT id)" + + def test_no_prefix(self): + assert _strip_dataset_prefix("SUM(distance)") == "SUM(distance)" + + +# --------------------------------------------------------------------------- +# Format detection +# --------------------------------------------------------------------------- + + +class TestFormatDetection: + def test_detects_osi(self): + assert _is_osi_config({"version": "0.1.1", "semantic_model": []}) + + def test_rejects_bsl(self): + assert not _is_osi_config({"flights": {"table": "flights_tbl"}}) + + def test_rejects_partial(self): + assert not _is_osi_config({"version": "0.1.1"}) + assert not _is_osi_config({"semantic_model": []}) + + +# --------------------------------------------------------------------------- +# Export tests: BSL -> OSI +# --------------------------------------------------------------------------- + + +@pytest.fixture +def simple_model(): + """A simple BSL model with dimensions and measures.""" + table = ibis.table( + {"order_id": "int64", "customer_id": "int64", "amount": "float64", "created_at": "timestamp"}, + name="orders", + ) + model = to_semantic_table(table, name="orders", description="Order transactions") + model = model.with_dimensions( + order_id=Dimension(expr=ibis._.order_id, description="Order ID", is_entity=True), + customer_id=Dimension(expr=ibis._.customer_id, description="Customer ID"), + created_at=Dimension( + expr=ibis._.created_at, + description="Order creation timestamp", + is_time_dimension=True, + smallest_time_grain="TIME_GRAIN_DAY", + ), + ) + model = model.with_measures( + order_count=Measure(expr=ibis._.count(), description="Total orders"), + total_amount=Measure(expr=ibis._.amount.sum(), description="Total order amount"), + avg_amount=Measure(expr=ibis._.amount.mean(), description="Average order amount"), + ) + return model + + +@pytest.fixture +def model_with_ai_context(): + """A BSL model with ai_context on dimensions and measures.""" + table = ibis.table( + {"product_id": "int64", "name": "string", "price": "float64"}, + name="products", + ) + model = to_semantic_table(table, name="products", description="Product catalog") + model = model.with_dimensions( + product_id=Dimension( + expr=ibis._.product_id, + description="Product identifier", + is_entity=True, + ai_context={"synonyms": ["SKU", "item ID"]}, + ), + name=Dimension( + expr=ibis._.name, + description="Product name", + ai_context="Product display name shown to customers", + ), + ) + model = model.with_measures( + avg_price=Measure( + expr=ibis._.price.mean(), + description="Average product price", + ai_context={"synonyms": ["mean price", "price average"]}, + ), + ) + return model + + +class TestToOsi: + def test_basic_structure(self, simple_model): + osi = to_osi(simple_model, name="test_model") + assert osi["version"] == OSI_VERSION + assert len(osi["semantic_model"]) == 1 + sm = osi["semantic_model"][0] + assert sm["name"] == "test_model" + assert "datasets" in sm + + def test_dataset_fields(self, simple_model): + osi = to_osi(simple_model) + ds = osi["semantic_model"][0]["datasets"][0] + assert ds["name"] == "orders" + assert ds["description"] == "Order transactions" + field_names = {f["name"] for f in ds["fields"]} + assert {"order_id", "customer_id", "created_at"} <= field_names + + def test_primary_key_from_entity(self, simple_model): + osi = to_osi(simple_model) + ds = osi["semantic_model"][0]["datasets"][0] + assert "order_id" in ds["primary_key"] + + def test_time_dimension(self, simple_model): + osi = to_osi(simple_model) + ds = osi["semantic_model"][0]["datasets"][0] + created_at = next(f for f in ds["fields"] if f["name"] == "created_at") + assert created_at["dimension"]["is_time"] is True + + def test_non_time_dimension(self, simple_model): + osi = to_osi(simple_model) + ds = osi["semantic_model"][0]["datasets"][0] + customer = next(f for f in ds["fields"] if f["name"] == "customer_id") + assert customer["dimension"]["is_time"] is False + + def test_metrics(self, simple_model): + osi = to_osi(simple_model) + metric_names = {m["name"] for m in osi["semantic_model"][0]["metrics"]} + assert {"order_count", "total_amount", "avg_amount"} <= metric_names + + def test_metric_expressions(self, simple_model): + osi = to_osi(simple_model) + count_metric = next(m for m in osi["semantic_model"][0]["metrics"] if m["name"] == "order_count") + assert count_metric["expression"]["dialects"][0]["dialect"] == "ANSI_SQL" + + def test_field_expression_format(self, simple_model): + osi = to_osi(simple_model) + ds = osi["semantic_model"][0]["datasets"][0] + f = next(f for f in ds["fields"] if f["name"] == "order_id") + assert f["expression"]["dialects"][0] == {"dialect": "ANSI_SQL", "expression": "order_id"} + + def test_custom_extensions_for_bsl_metadata(self, simple_model): + osi = to_osi(simple_model) + ds = osi["semantic_model"][0]["datasets"][0] + f = next(f for f in ds["fields"] if f["name"] == "order_id") + data = json.loads(f["custom_extensions"][0]["data"]) + assert data["is_entity"] is True + + def test_time_grain_in_custom_extensions(self, simple_model): + osi = to_osi(simple_model) + ds = osi["semantic_model"][0]["datasets"][0] + f = next(f for f in ds["fields"] if f["name"] == "created_at") + data = json.loads(f["custom_extensions"][0]["data"]) + assert data["smallest_time_grain"] == "TIME_GRAIN_DAY" + + def test_ai_context_on_dimensions(self, model_with_ai_context): + osi = to_osi(model_with_ai_context) + ds = osi["semantic_model"][0]["datasets"][0] + f = next(f for f in ds["fields"] if f["name"] == "product_id") + assert f["ai_context"]["synonyms"] == ["SKU", "item ID"] + + def test_ai_context_on_metrics(self, model_with_ai_context): + osi = to_osi(model_with_ai_context) + m = next(m for m in osi["semantic_model"][0]["metrics"] if m["name"] == "avg_price") + assert m["ai_context"]["synonyms"] == ["mean price", "price average"] + + def test_with_description_and_ai_context(self, simple_model): + osi = to_osi(simple_model, name="my_model", description="A test", ai_context={"instructions": "test"}) + sm = osi["semantic_model"][0] + assert sm["description"] == "A test" + assert sm["ai_context"]["instructions"] == "test" + + def test_multiple_models(self): + m1 = to_semantic_table(ibis.table({"id": "int64", "amount": "float64"}, name="o"), name="orders") + m1 = m1.with_dimensions(id=Dimension(expr=ibis._.id)) + m1 = m1.with_measures(total=Measure(expr=ibis._.amount.sum())) + m2 = to_semantic_table(ibis.table({"id": "int64", "name": "string"}, name="c"), name="customers") + m2 = m2.with_dimensions(id=Dimension(expr=ibis._.id)) + osi = to_osi({"orders": m1, "customers": m2}, name="ecommerce") + assert len(osi["semantic_model"][0]["datasets"]) == 2 + + +class TestToOsiYaml: + def test_yaml_output(self, simple_model): + s = to_osi_yaml(simple_model, name="test") + assert "version:" in s + assert "semantic_model:" in s + assert "datasets:" in s + + +# --------------------------------------------------------------------------- +# Import tests: OSI parsed natively via from_config +# --------------------------------------------------------------------------- + + +@pytest.fixture +def osi_config(): + """A minimal OSI config dict.""" + return { + "version": "0.1.1", + "semantic_model": [ + { + "name": "test_model", + "datasets": [ + { + "name": "orders", + "source": "orders_table", + "primary_key": ["order_id"], + "description": "Order transactions", + "fields": [ + { + "name": "order_id", + "expression": {"dialects": [{"dialect": "ANSI_SQL", "expression": "order_id"}]}, + "dimension": {"is_time": False}, + "description": "Order identifier", + "custom_extensions": [{"vendor_name": "COMMON", "data": json.dumps({"is_entity": True})}], + }, + { + "name": "created_at", + "expression": {"dialects": [{"dialect": "ANSI_SQL", "expression": "created_at"}]}, + "dimension": {"is_time": True}, + "description": "Creation timestamp", + }, + { + "name": "amount", + "expression": {"dialects": [{"dialect": "ANSI_SQL", "expression": "amount"}]}, + "dimension": {"is_time": False}, + }, + ], + }, + ], + "metrics": [ + { + "name": "total_amount", + "expression": {"dialects": [{"dialect": "ANSI_SQL", "expression": "SUM(orders.amount)"}]}, + "description": "Total order amount", + }, + { + "name": "order_count", + "expression": {"dialects": [{"dialect": "ANSI_SQL", "expression": "COUNT(*)"}]}, + "description": "Number of orders", + }, + ], + } + ], + } + + +class TestFromConfig_OSI: + """Test that from_config auto-detects and parses OSI format.""" + + def test_basic_import(self, osi_config): + models = from_config(osi_config) + assert "orders" in models + + def test_model_description(self, osi_config): + models = from_config(osi_config) + assert models["orders"].op().description == "Order transactions" + + def test_dimensions_imported(self, osi_config): + dims = from_config(osi_config)["orders"].op().get_dimensions() + assert {"order_id", "created_at", "amount"} == set(dims.keys()) + + def test_dimension_descriptions(self, osi_config): + dims = from_config(osi_config)["orders"].op().get_dimensions() + assert dims["order_id"].description == "Order identifier" + assert dims["created_at"].description == "Creation timestamp" + + def test_time_dimension_flag(self, osi_config): + dims = from_config(osi_config)["orders"].op().get_dimensions() + assert dims["created_at"].is_time_dimension is True + assert dims["order_id"].is_time_dimension is False + + def test_entity_from_custom_extensions(self, osi_config): + dims = from_config(osi_config)["orders"].op().get_dimensions() + assert dims["order_id"].is_entity is True + + def test_measures_imported(self, osi_config): + measures = from_config(osi_config)["orders"].op().get_measures() + assert {"total_amount", "order_count"} == set(measures.keys()) + + def test_measure_descriptions(self, osi_config): + measures = from_config(osi_config)["orders"].op().get_measures() + assert measures["total_amount"].description == "Total order amount" + + def test_with_real_table(self, osi_config): + con = ibis.duckdb.connect() + con.raw_sql("CREATE TABLE orders_table (order_id INT, created_at TIMESTAMP, amount DOUBLE)") + models = from_config(osi_config, tables={"orders": con.table("orders_table")}) + assert "orders" in models + + def test_ai_context_preserved(self): + config = { + "version": "0.1.1", + "semantic_model": [{ + "name": "test", + "datasets": [{ + "name": "items", + "source": "items", + "fields": [{ + "name": "item_id", + "expression": {"dialects": [{"dialect": "ANSI_SQL", "expression": "item_id"}]}, + "ai_context": {"synonyms": ["SKU", "product_id"]}, + }], + }], + }], + } + dims = from_config(config)["items"].op().get_dimensions() + assert dims["item_id"].ai_context == {"synonyms": ["SKU", "product_id"]} + + + def test_primary_key_sets_is_entity(self): + """primary_key field names should become is_entity=True dimensions.""" + config = { + "version": "0.1.1", + "semantic_model": [{ + "name": "test", + "datasets": [{ + "name": "users", + "source": "users", + "primary_key": ["user_id"], + "fields": [ + { + "name": "user_id", + "expression": {"dialects": [{"dialect": "ANSI_SQL", "expression": "user_id"}]}, + }, + { + "name": "email", + "expression": {"dialects": [{"dialect": "ANSI_SQL", "expression": "email"}]}, + }, + ], + }], + }], + } + dims = from_config(config)["users"].op().get_dimensions() + assert dims["user_id"].is_entity is True + assert dims["email"].is_entity is False + + def test_dataset_ai_context(self): + """Dataset-level ai_context should be stored on the model.""" + config = { + "version": "0.1.1", + "semantic_model": [{ + "name": "test", + "datasets": [{ + "name": "products", + "source": "products", + "ai_context": {"synonyms": ["items", "SKUs"]}, + "fields": [{ + "name": "id", + "expression": {"dialects": [{"dialect": "ANSI_SQL", "expression": "id"}]}, + }], + }], + }], + } + op = from_config(config)["products"].op() + assert op.get_ai_context() == {"synonyms": ["items", "SKUs"]} + + def test_field_label(self): + """Field label should be stored on the Dimension.""" + config = { + "version": "0.1.1", + "semantic_model": [{ + "name": "test", + "datasets": [{ + "name": "events", + "source": "events", + "fields": [{ + "name": "status", + "expression": {"dialects": [{"dialect": "ANSI_SQL", "expression": "status"}]}, + "label": "filter", + }], + }], + }], + } + dims = from_config(config)["events"].op().get_dimensions() + assert dims["status"].label == "filter" + + def test_label_round_trip(self): + """label survives BSL -> OSI -> BSL.""" + import ibis + t = ibis.table({"status": "string"}, name="events") + m = to_semantic_table(t, name="events") + m = m.with_dimensions(status=Dimension(expr=ibis._.status, label="filter")) + osi = to_osi(m) + models = from_config(osi) + assert models["events"].op().get_dimensions()["status"].label == "filter" + + def test_dataset_ai_context_round_trip(self): + """Dataset ai_context survives BSL -> OSI -> BSL.""" + import ibis + t = ibis.table({"id": "int64"}, name="items") + m = to_semantic_table(t, name="items", ai_context={"synonyms": ["products"]}) + m = m.with_dimensions(id=Dimension(expr=ibis._.id)) + osi = to_osi(m) + op = from_config(osi)["items"].op() + assert op.get_ai_context() == {"synonyms": ["products"]} + + +# --------------------------------------------------------------------------- +# Round-trip tests +# --------------------------------------------------------------------------- + + +class TestRoundTrip: + def test_bsl_to_osi_to_bsl(self, simple_model): + """BSL -> OSI -> BSL preserves key semantics.""" + osi = to_osi(simple_model, name="round_trip_test") + models = from_config(osi) # auto-detects OSI + assert "orders" in models + + orig_dims = simple_model.op().get_dimensions() + new_dims = models["orders"].op().get_dimensions() + assert set(orig_dims.keys()) == set(new_dims.keys()) + + for name in orig_dims: + assert orig_dims[name].description == new_dims[name].description + + assert new_dims["created_at"].is_time_dimension is True + assert new_dims["order_id"].is_entity is True + + def test_bsl_to_osi_to_bsl_measures(self, simple_model): + osi = to_osi(simple_model) + models = from_config(osi) + + orig = simple_model.op().get_measures() + new = models["orders"].op().get_measures() + assert set(orig.keys()) == set(new.keys()) + for name in orig: + assert orig[name].description == new[name].description + + def test_osi_to_bsl_to_osi(self, osi_config): + models = from_config(osi_config) + osi_out = to_osi(models, name="round_trip") + + assert osi_out["version"] == OSI_VERSION + ds_out = osi_out["semantic_model"][0]["datasets"][0] + assert ds_out["name"] == "orders" + assert ds_out["description"] == "Order transactions" + ds_in = osi_config["semantic_model"][0]["datasets"][0] + assert len(ds_out["fields"]) == len(ds_in["fields"]) + + def test_ai_context_round_trip(self, model_with_ai_context): + osi = to_osi(model_with_ai_context) + models = from_config(osi) + + dims = models["products"].op().get_dimensions() + assert dims["product_id"].ai_context == {"synonyms": ["SKU", "item ID"]} + assert dims["name"].ai_context == "Product display name shown to customers" + + measures = models["products"].op().get_measures() + assert measures["avg_price"].ai_context == {"synonyms": ["mean price", "price average"]} diff --git a/src/boring_semantic_layer/yaml.py b/src/boring_semantic_layer/yaml.py index 71af2bfc..f2690d86 100644 --- a/src/boring_semantic_layer/yaml.py +++ b/src/boring_semantic_layer/yaml.py @@ -1,7 +1,13 @@ """ YAML loader for Boring Semantic Layer models using the semantic API. + +Supports both BSL native YAML format and OSI (Open Semantic Interchange) +v0.1.1 format. The format is auto-detected based on the presence of +``version`` and ``semantic_model`` keys. """ +import json +import re from collections.abc import Mapping from typing import Any @@ -14,6 +20,344 @@ from .utils import read_yaml_file, safe_eval +# --------------------------------------------------------------------------- +# Format detection +# --------------------------------------------------------------------------- + + +def _is_osi_config(config: Mapping[str, Any]) -> bool: + """Return True if *config* looks like an OSI YAML document.""" + return "semantic_model" in config and "version" in config + + +# --------------------------------------------------------------------------- +# OSI expression helpers (SQL <-> Ibis Deferred) +# --------------------------------------------------------------------------- + + +def _sql_to_deferred(sql: str): + """Convert a simple SQL expression to an Ibis Deferred. + + Handles: + "column_name" -> _.column_name + "SUM(column)" -> _.column.sum() + "AVG(column)" -> _.column.mean() + "COUNT(*)" -> _.count() + "COUNT(DISTINCT column)" -> _.column.nunique() + """ + sql = sql.strip() + + if sql == "COUNT(*)": + return safe_eval("_.count()", context={"_": _}).unwrap() + + # COUNT(DISTINCT col) + m = re.match(r"^COUNT\(DISTINCT\s+(\w+)\)$", sql, re.IGNORECASE) + if m: + return safe_eval(f"_.{m.group(1)}.nunique()", context={"_": _}).unwrap() + + # AGG(col) patterns + sql_to_ibis = {"SUM": "sum", "AVG": "mean", "MAX": "max", "MIN": "min"} + for sql_fn, ibis_fn in sql_to_ibis.items(): + m = re.match(rf"^{sql_fn}\((\w+)\)$", sql, re.IGNORECASE) + if m: + return safe_eval(f"_.{m.group(1)}.{ibis_fn}()", context={"_": _}).unwrap() + + # Simple column reference + if re.match(r"^\w+$", sql): + return safe_eval(f"_.{sql}", context={"_": _}).unwrap() + + # Fallback: try eval as-is with underscore prefix + try: + return safe_eval(f"_.{sql}", context={"_": _}).unwrap() + except Exception: + return safe_eval( + f"_.{sql.split('.')[0] if '.' in sql else sql}", context={"_": _} + ).unwrap() + + +def _parse_osi_expression(expr_obj: dict, prefer_dialect: str = "ANSI_SQL") -> str: + """Extract the SQL expression string from an OSI expression object.""" + dialects = expr_obj.get("dialects", []) + if not dialects: + raise ValueError("OSI expression has no dialects") + for d in dialects: + if d.get("dialect") == prefer_dialect: + return d["expression"] + return dialects[0]["expression"] + + +def _strip_dataset_prefix(sql: str) -> str: + """Remove dataset.column prefixes from SQL aggregates. + + ``SUM(flights.distance)`` -> ``SUM(distance)`` + """ + + def _strip_match(m: re.Match) -> str: + fn = m.group(1) + inner = m.group(2).strip() + if inner.upper().startswith("DISTINCT "): + rest = inner[9:].strip() + if "." in rest: + return f"{fn}(DISTINCT {rest.split('.')[-1]})" + return m.group(0) + if "." in inner and inner != "*": + return f"{fn}({inner.split('.')[-1]})" + return m.group(0) + + return re.sub(r"(\w+)\(([^)]+)\)", _strip_match, sql) + + +# --------------------------------------------------------------------------- +# OSI field / metric -> BSL Dimension / Measure +# --------------------------------------------------------------------------- + + +def _osi_field_to_dimension( + field: dict, primary_key_cols: set[str] | None = None +) -> tuple[str, Dimension]: + """Convert an OSI field dict to a ``(name, Dimension)`` pair. + + Args: + field: OSI field definition. + primary_key_cols: Column names from the dataset's ``primary_key``. + If the field's expression matches one of these, the dimension + is marked ``is_entity=True``. + """ + name = field["name"] + sql_expr = _parse_osi_expression(field["expression"]) + deferred = _sql_to_deferred(sql_expr) + + kwargs: dict[str, Any] = { + "expr": deferred, + "description": field.get("description"), + } + + dim_meta = field.get("dimension", {}) + if dim_meta.get("is_time"): + kwargs["is_time_dimension"] = True + + if "ai_context" in field: + kwargs["ai_context"] = field["ai_context"] + + if field.get("label"): + kwargs["label"] = field["label"] + + # Mark as entity if the field appears in the dataset's primary_key + if primary_key_cols and (sql_expr in primary_key_cols or name in primary_key_cols): + kwargs["is_entity"] = True + + # Recover BSL-specific metadata stored in custom_extensions + for ext in field.get("custom_extensions", []): + if ext.get("vendor_name") == "COMMON": + try: + data = json.loads(ext["data"]) + if data.get("is_entity"): + kwargs["is_entity"] = True + if data.get("is_event_timestamp"): + kwargs["is_event_timestamp"] = True + if data.get("smallest_time_grain"): + kwargs["smallest_time_grain"] = data["smallest_time_grain"] + if data.get("derived_dimensions"): + kwargs["derived_dimensions"] = tuple(data["derived_dimensions"]) + except (json.JSONDecodeError, KeyError): + pass + + return name, Dimension(**kwargs) + + +def _osi_metric_to_measure(metric: dict) -> tuple[str, Measure]: + """Convert an OSI metric dict to a ``(name, Measure)`` pair.""" + name = metric["name"] + sql_expr = _parse_osi_expression(metric["expression"]) + sql_expr = _strip_dataset_prefix(sql_expr) + deferred = _sql_to_deferred(sql_expr) + + kwargs: dict[str, Any] = { + "expr": deferred, + "description": metric.get("description"), + } + if "ai_context" in metric: + kwargs["ai_context"] = metric["ai_context"] + + return name, Measure(**kwargs) + + +# --------------------------------------------------------------------------- +# OSI config -> BSL models (called from from_config when OSI is detected) +# --------------------------------------------------------------------------- + + +def _create_placeholder_table(dataset: dict): + """Create a placeholder ibis table from OSI field definitions.""" + import ibis + + fields = dataset.get("fields", []) + if not fields: + return None + schema = {f["name"]: "string" for f in fields} + try: + return ibis.table(schema, name=dataset["name"]) + except Exception: + return None + + +def _from_osi_config( + config: Mapping[str, Any], + tables: Mapping[str, Any] | None = None, + profile: str | None = None, + profile_path: str | None = None, +) -> dict[str, SemanticModel]: + """Parse an OSI config dict into BSL SemanticModel instances. + + This is an internal entry-point invoked by :func:`from_config` when it + detects OSI format. Users should call ``from_config`` / ``from_yaml`` + directly — those work for *both* BSL and OSI files. + """ + tables = dict(tables) if tables else {} + + # Load tables from profile if not provided + if not tables: + profile_config = profile or config.get("profile") + if profile_config or profile_path: + connection = get_connection( + profile_config or profile_path, + profile_file=profile_path if profile_config else None, + ) + tables = {name: connection.table(name) for name in connection.list_tables()} + + semantic_models = config.get("semantic_model", []) + if not semantic_models: + raise ValueError("No semantic_model found in OSI config") + + result: dict[str, SemanticModel] = {} + + for sm in semantic_models: + datasets = sm.get("datasets", []) + metrics = sm.get("metrics", []) + relationships = sm.get("relationships", []) + dataset_names = {ds["name"] for ds in datasets} + + for ds in datasets: + ds_name = ds["name"] + + # Resolve backing table + if ds_name in tables: + table = tables[ds_name] + elif ds.get("source") and ds["source"] in tables: + table = tables[ds["source"]] + else: + table = _create_placeholder_table(ds) + if table is None: + continue + + model = to_semantic_table( + table, + name=ds_name, + description=ds.get("description"), + ai_context=ds.get("ai_context"), + ) + + # primary_key column names for is_entity detection + pk_cols = set(ds.get("primary_key") or []) + + # Fields -> Dimensions + dimensions: dict[str, Dimension] = {} + for field in ds.get("fields", []): + dim_name, dim = _osi_field_to_dimension(field, pk_cols) + dimensions[dim_name] = dim + if dimensions: + model = model.with_dimensions(**dimensions) + + # Metrics -> Measures (assign to the dataset they reference) + ds_measures: dict[str, Measure] = {} + for metric in metrics: + sql_expr = _parse_osi_expression(metric["expression"]) + # Explicitly references this dataset + if f"{ds_name}." in sql_expr: + meas_name, meas = _osi_metric_to_measure(metric) + ds_measures[meas_name] = meas + if ds_measures: + model = model.with_measures(**ds_measures) + + result[ds_name] = model + + # Second pass: assign unqualified metrics (no dataset. prefix) to the + # first dataset only, to avoid duplicating them across all datasets. + if datasets and metrics: + first_ds_name = datasets[0]["name"] + if first_ds_name in result: + unqualified: dict[str, Measure] = {} + for metric in metrics: + sql_expr = _parse_osi_expression(metric["expression"]) + # Skip if it explicitly references any dataset + if any(f"{dn}." in sql_expr for dn in dataset_names): + continue + meas_name, meas = _osi_metric_to_measure(metric) + # Only add if not already present on this model + if meas_name not in result[first_ds_name].op().get_measures(): + unqualified[meas_name] = meas + if unqualified: + result[first_ds_name] = result[first_ds_name].with_measures( + **unqualified + ) + + # Relationships -> Joins (supports multi-column keys + cardinality) + if tables and relationships: + for rel in relationships: + from_ds = rel.get("from", "") + to_ds = rel.get("to", "") + if from_ds in result and to_ds in result: + from_cols = rel.get("from_columns", []) + to_cols = rel.get("to_columns", []) + if ( + from_cols + and to_cols + and len(from_cols) == len(to_cols) + and from_cols[0] != "unknown" + ): + + def _make_join_cond(lcols, rcols): + def cond(left, right): + pairs = [ + getattr(left, lc) == getattr(right, rc) + for lc, rc in zip(lcols, rcols) + ] + res = pairs[0] + for p in pairs[1:]: + res = res & p + return res + + return cond + + # Detect cardinality from custom_extensions + cardinality = "one" + for ext in rel.get("custom_extensions", []): + if ext.get("vendor_name") == "COMMON": + try: + data = json.loads(ext["data"]) + if data.get("cardinality") in ("one", "many"): + cardinality = data["cardinality"] + except (json.JSONDecodeError, KeyError): + pass + + on_cond = _make_join_cond(from_cols, to_cols) + if cardinality == "many": + result[from_ds] = result[from_ds].join_many( + result[to_ds], on=on_cond + ) + else: + result[from_ds] = result[from_ds].join_one( + result[to_ds], on=on_cond + ) + + return result + + +# --------------------------------------------------------------------------- +# BSL native YAML helpers +# --------------------------------------------------------------------------- + + def _parse_expression_config(name: str, config: str | dict, metric_type: str): """Extract expression string, description, and extra kwargs from config.""" if isinstance(config, str): @@ -30,6 +374,8 @@ def _parse_expression_config(name: str, config: str | dict, metric_type: str): extra_kwargs["is_time_dimension"] = config.get("is_time_dimension", False) extra_kwargs["smallest_time_grain"] = config.get("smallest_time_grain") extra_kwargs["derived_dimensions"] = tuple(config.get("derived_dimensions") or ()) + if "ai_context" in config: + extra_kwargs["ai_context"] = config["ai_context"] return config["expr"], config.get("description"), extra_kwargs else: raise ValueError(f"Invalid {metric_type} format for '{name}'. Must be a string or dict") @@ -55,11 +401,14 @@ def _parse_dimension_or_measure( expr_str, description, extra_kwargs = _parse_expression_config(name, config, metric_type) deferred = safe_eval(expr_str, context={"_": _}).unwrap() base_kwargs = {"expr": deferred, "description": description} - return ( - Dimension(**base_kwargs, **extra_kwargs) - if metric_type == "dimension" - else Measure(**base_kwargs) - ) + if metric_type == "dimension": + return Dimension(**base_kwargs, **extra_kwargs) + else: + # Pass through ai_context for measures too + meas_kwargs = base_kwargs + if "ai_context" in extra_kwargs: + meas_kwargs["ai_context"] = extra_kwargs["ai_context"] + return Measure(**meas_kwargs) def _parse_calc_measure(name: str, config: str | dict) -> Measure: @@ -320,6 +669,11 @@ def _load_table_for_yaml_model( return tables, tables[table_name] +# --------------------------------------------------------------------------- +# Public API +# --------------------------------------------------------------------------- + + def from_config( config: Mapping[str, Any], tables: Mapping[str, Any] | None = None, @@ -329,48 +683,27 @@ def from_config( """ Load semantic tables from a configuration dictionary. - This is useful when you have already loaded your configuration through - custom logic (e.g., Kedro catalog, external config management) and want - to construct SemanticTable objects without going through YAML file loading. + Accepts **both** BSL native YAML format and OSI (Open Semantic Interchange) + v0.1.1 format. The format is auto-detected: if the dict contains + ``version`` and ``semantic_model`` keys it is treated as OSI, otherwise + as BSL native. Args: - config: Configuration dictionary with model definitions + config: Configuration dictionary (BSL or OSI format) tables: Optional mapping of table names to ibis table expressions profile: Optional profile name to load tables from profile_path: Optional path to profile file Returns: Dict mapping model names to SemanticModel instances - - Example config format: - { - "flights": { - "table": "flights_tbl", - "description": "Flight data model", - "database": ["analytics", "prod"], # optional: catalog.schema - "dimensions": { - "origin": {"expr": "_.origin", "description": "Origin airport"}, - "destination": "_.destination", - }, - "measures": { - "flight_count": "_.count()", - "avg_distance": "_.distance.mean()", - }, - } - } - - The optional 'database' field can be a string or list for multi-part identifiers - (e.g., ["catalog", "schema"] for catalog.schema.table). This is passed to - ibis connection.table() and is useful for loading tables from different - databases/schemas under the same connection. - - Example usage with pre-loaded tables: - >>> import ibis - >>> con = ibis.duckdb.connect() - >>> flights_tbl = con.table("flights") - >>> config = {"flights": {"table": "flights_tbl", "dimensions": {...}}} - >>> models = from_config(config, tables={"flights_tbl": flights_tbl}) """ + # ---- Auto-detect OSI format ---- + if _is_osi_config(config): + return _from_osi_config( + config, tables=tables, profile=profile, profile_path=profile_path + ) + + # ---- BSL native format ---- tables = _load_tables_from_references(dict(tables) if tables else {}) # Load tables from profile if not provided @@ -451,12 +784,13 @@ def from_yaml( profile_path: str | None = None, ) -> dict[str, SemanticModel]: """ - Load semantic tables from a YAML file with optional profile-based table loading. + Load semantic tables from a YAML file. - This is a convenience wrapper around from_config() that loads the YAML file first. + Accepts **both** BSL native YAML format and OSI (Open Semantic Interchange) + v0.1.1 format. The format is auto-detected. Args: - yaml_path: Path to the YAML configuration file + yaml_path: Path to the YAML configuration file (BSL or OSI format) tables: Optional mapping of table names to ibis table expressions profile: Optional profile name to load tables from profile_path: Optional path to profile file @@ -464,45 +798,14 @@ def from_yaml( Returns: Dict mapping model names to SemanticModel instances - Example YAML format: - flights: - table: flights_tbl - description: "Flight data model" - database: # optional: for loading from specific database/schema - - analytics - - prod - dimensions: - origin: - expr: _.origin - description: "Origin airport code" - is_entity: true - destination: _.destination - carrier: _.carrier - arr_time: - expr: _.arr_time - description: "Arrival time" - is_event_timestamp: true - is_time_dimension: true - smallest_time_grain: "TIME_GRAIN_DAY" - measures: - flight_count: _.count() - avg_distance: _.distance.mean() - total_distance: - expr: _.distance.sum() - description: "Total distance flown" - calculated_measures: - avg_per_flight: - expr: _.total_distance / _.flight_count - description: "Average distance per flight" - pct_of_total: - expr: _.total_distance / _.all(_.total_distance) * 100 - description: "Percentage of total distance" - joins: - carriers: - model: carriers - type: one - left_on: carrier - right_on: code + Examples: + Load a BSL native YAML file:: + + models = from_yaml("flights.yml") + + Load an OSI YAML file:: + + models = from_yaml("flights_osi.yaml", tables=tables) """ yaml_configs = read_yaml_file(yaml_path) return from_config(yaml_configs, tables=tables, profile=profile, profile_path=profile_path)