Add datatype field to Field and Metric; reframe is_time as role marker by jonmmease · Pull Request #113 · open-semantic-interchange/OSI

jonmmease · 2026-04-24T19:00:15Z

AI Disclosure: I drove this PR and description below iteratively with Claude Code, and have reviewed it fully

Motivation

Closes #84.

This PR is the outcome of the discussion I started in #25 (converted to #84), also related to #17.

The issue originally framed this as "use datatype rather than is_time". After a bit more thought and investigation into existing semantic layers implementations, I think it makes more sense to add data types as a new field/metric attribute, and keep is_time as a role attribute. This fits cleanly with the example in this repo where d_year, d_quarter_name, d_month_name have different data types, but temporal roles. Snowflake Semantic Views (YAML form) and LookML both split the same two axes.

So this PR adds datatype alongside is_time and documents them as independent properties.

What changes

New datatype enum on Field and Metric (optional, top-level). Values: string, integer, number, boolean, date, time, timestamp, timestamp_tz, other. Logical types, not SQL-physical. Use other plus custom_extensions for types outside the enum.
dimension.is_time stays valid. It is documented as a temporal-role marker, independent of datatype. A field may carry datatype: integer with is_time: true (e.g. d_year), or datatype: timestamp with no is_time (a plain timestamp column), or is_time: true alone (legacy-style).
Default rule for is_time. When unset, is_time defaults to true if datatype is one of date, time, timestamp, timestamp_tz, and false otherwise. Explicit is_time always wins. This lets authors opt a temporal-typed column (e.g. an audit created_at) out of time-dimension classification with is_time: false.
Snowflake converter updated. _classify_field now honors explicit is_time first, then falls back to the temporal-datatype default. Nine new tests cover the datatype paths, the mixed-metadata case (d_year with datatype: integer and is_time: true), and the opt-out case. The five original is_time tests are unchanged, proving back-compat.
Example updated. examples/tpcds_semantic_model.yaml demonstrates three coexistence patterns: datatype only (majority of fields), datatype plus is_time: true coexisting (d_year), and is_time: true only (legacy style, retained on two fields as a compatibility demonstration). 25 fields and 5 metrics gained datatype.
Docs. converters/index.md explains the type-vs-role split and the converter classification rule; docs/index.md glossary mentions datatype on the Field entry. A new Datatype and is_time: type vs. role section in core-spec/spec.md covers the combinations, default rule, consumer guidance, and migration recipe, with a Precedent note citing Snowflake Semantic Views and LookML.
Also added .gitignore for pycache directories.

Impact on existing implementations

The change is additive and backward-compatible:

Models with only is_time. Continue to validate and convert unchanged. The Snowflake converter's time_dimension classification is preserved for every is_time: true field, including non-temporal grain columns like d_year.
New models with only datatype. Work as expected. A temporal datatype on a field with a dimension block classifies as a time_dimension via the default rule.
Models that set is_time: false explicitly on a temporal-typed column. Classification is now dimension, not time_dimension. This is intentional: the author has opted out. The only way to reach this case before this PR was to author an unusual model where is_time: false was on a column that was implicitly temporal in the consumer's view. If such models exist, they will flip classification; we judged this the right behavior because ignoring an explicit is_time: false is what consumers had implicitly been doing, and that silent override is worse than the opt-out.

I didn't update the Snowflake exporter to add the data type itself, as Snowflake has the types of everything internally. If we were to add a Snowflake importer, it would be important to map the Snowflake data types onto these OSI data types

Introduces a top-level `datatype` on Field and Metric with a closed logical enum: string, integer, number, boolean, date, time, timestamp, timestamp_tz, other. Addresses issue open-semantic-interchange#84. `datatype` and `dimension.is_time` are independent and orthogonal: - `datatype` declares the field's logical data type (casting/serialization). - `is_time` is a temporal-role marker (time-series analysis, temporal filtering). A field with `is_time: true` may carry any `datatype` (e.g. integer for a year grain, string for a month name, date for a calendar date). When `is_time` is unset, it defaults to `true` for temporal datatypes (`date`, `time`, `timestamp`, `timestamp_tz`) and `false` otherwise. Explicit `is_time` always wins, so authors can set `is_time: false` on an audit `created_at` to keep it off the time axis. Taxonomy and type/role split were chosen after benchmarking 14 peer semantic layers and 5 portable type standards. Notable precedent: Snowflake Semantic Views' YAML authoring form has a `time_dimensions:` collection whose entries can carry any `data_type` (the published example annotates `order_year` with `data_type: NUMBER`); LookML's `dimension_group` accepts `date`, `datetime`, `timestamp`, `epoch`, and `yyyymmdd`. Snowflake converter updated: `_classify_field` honors explicit `is_time` first, then falls back to the temporal-datatype default. 9 new tests cover the datatype paths and the mixed-metadata cases (`d_year` with `datatype: integer` and `is_time: true`, audit timestamp opt-out, etc.). tpcds_semantic_model.yaml demonstrates three coexistence patterns: datatype-only, datatype + is_time, and is_time-only. Also added .gitignore for __pycache__ directories. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

KSDaemon · 2026-04-27T11:30:06Z

+|----------|-------------|
+| `string` | Variable-length Unicode character data. |
+| `integer` | Signed integer with no scale. |
+| `number` | Real number (floating-point or decimal) with unspecified precision. |


One concern here is that using a generic number type isn’t really a valid interoperability strategy. Different data warehouses interpret and implement it differently, so it’s not interchangeable in practice.

For example, you can’t take a model built against data in Snowflake and expect it to behave identically in Databricks just because both use a number type. As we’ve already discussed in other issues/threads, precision and scale vary across systems, and that directly affects results.

Without explicitly specifying those parameters, this approach is likely to introduce subtle bugs. If strict type checking is enforced and there’s no automatic conversion, things will simply break. Even with implicit or explicit casting at the physical layer, you can still end up with precision loss, which is not acceptable for many use cases.

So I think we need a different approach here. Either we make the type definition explicit (e.g., always specifying precision and scale for numeric types), or we introduce some abstraction that preserves correctness across backends instead of relying on a loosely defined number.

Thanks for the feedback @KSDaemon. The frame of reference I'm coming from is that these types need to be precise enough to generate SQL on top of (e.g. a BI tool building on top of the semantic definitions). In my experience, the SQL required for integers vs float/decimal can differ for warehouses that use truncation semantics for integers. I have not come across cases where I've needed to generate different SQL depending on whether the underlying type is a float or decimal.

But it is worth taking a step back to ask what else folks would rely on these logical data types for. Do you have any particular use cases in mind?

Following up. Are there cases beyond the generation of SQL for semantic queries that you have in mind to support here? I'm happy to add decimal as data type, but I'd like to understand a motivating scenario for how the spec would be used in such a way that number and decimal would be treated differently.

Using JSON Schema data types (https://json-schema.org/understanding-json-schema/reference/type) as a minimum would be a good idea.

xavipereztr · 2026-04-27T14:31:18Z

Hi @jonmmease !

Thanks for your work on this PR! Some comments on this PR based on comments and discussions we have opened in regards of including spatial semantics in the OSI spec:

Spatial dimension type: extending dimension with a spatial descriptor for geometry/geography and spatial index data #114
Support field datatype rather than is_time #84

On geometry and geography as first-class datatype values

We'd like to make the case for adding geometry and geography to the
datatype enum rather than leaving spatial columns to fall through to other.

Both are first-class native types in the leading cloud data warehouses:

Snowflake: GEOGRAPHY and GEOMETRY
BigQuery: GEOGRAPHY (WGS84 point/line/polygon)
Databricks: GEOMETRY
PostGIS: geometry and geography

The geometry / geography distinction is also meaningful and standardized
(OGC/ISO SQL): geography operates on a spherical earth model with real-world
units, while geometry operates on a flat Cartesian plane. Consumers need to
know which they're dealing with to generate correct spatial SQL.

Leaving these as other + custom_extensions means every OSI implementation
dealing with spatial data falls back to vendor-specific metadata, something that at CARTO we believe it should not happen.

On dimension.spatial as the spatial role marker

The type/role split you've introduced gives spatial data a natural place to
live. We're working on a proposal (related to #69)
that would add a spatial object to the dimension block as the spatial
equivalent of is_time — a role marker that carries the metadata consuming
tools need: geometry type, SRID, spatial index system and resolution, and
geographic level. It would look like this:

- name: geom
  datatype: geometry              # proposed new enum value
  dimension:
    spatial:                      # proposed role marker + metadata
      type: polygon
      srid: 4326
      geographic_level: "census_block_group"
      geographic_hierarchy:
        parent: "census_tract"
        children: []
  description: "Block group boundary polygon (WGS84)"

- name: h3
  datatype: string                # H3 index is stored as a string
  dimension:
    spatial:
      type: spatial_index
      index:
        system: h3
        resolution: 8
        rollup_resolutions: [7, 6, 5]
      geographic_level: "h3_cell_res8"
  description: "H3 cell index at resolution 8"

This maps exactly onto the pattern you're establishing: datatype for the
data type question, dimension.spatial for the role/semantics question.
We can open a separate discussion on the spatial descriptor, but wanted to
flag it here so the geometry/geography datatype values are on the radar
before the enum is locked in.

On number precision — +1 to @KSDaemon's concern. We raised the same
point on issue #84 and in a comment there. Worth resolving before this merges.

jonmmease · 2026-04-27T17:17:24Z

Hi @xavipereztr, your descriptions of geometry and geography make good sense to me as additional data types, though I'd lean toward incorporating these as type additions in your proposal in #69 rather than this PR.

I'm not opposed to adding a decimal type, but I'd be interested to articulate a case where a consumer of the semantic model would act differently for a float vs decimal column or expression type.

You mentioned in the other issue:

So a value like 1.5 in a column declared as unqualified NUMBER/NUMERIC survives on BigQuery but is silently truncated to 2 on Snowflake and Databricks. Precision ceilings also diverge (29 vs 38 total digits).

If this value is in a column with a decimal type in BigQuery or Snowflake, it's either already truncated or the precision accommodates it. If I'm writing SQL generation logic on top of this dimension, are there cases where this needs to differ when the underlying type is decimal vs a float type?

Another factor would be, for translation purposes, whether any mainstream semantic layers already distinguish these numeric types (cube and lookml do not as far as I've seen). If so, that would also be good argument to separate decimal as a dedicated type.

Grounding this in a use case would also help me think through whether decimal by itself is useful without the specific scale and precision parameters, or whether these parameters would be necessary.

xavipereztr · 2026-04-28T09:45:09Z

Hi @jonmmease - on the geometry and geography aspect, are you suggesting we create a PR based on #69 or shall we wait for further discussion with OSI members?

On the numeric simplification, I guess that you are right that between integer vs number is largely enough for most use-cases.

jonmmease · 2026-04-28T13:12:25Z

I only meant that I think it makes sense to attach the additional geometry/geography data types to the dimension. spatial_data proposal in #69, and for these to be evaluated together through the OSI standards process (of which I'm not familiar with the details).

Add Ontology & Semantic Interoperability as a top-level current effort with links to discussions open-semantic-interchange#22, open-semantic-interchange#101, open-semantic-interchange#108, open-semantic-interchange#68 and PRs open-semantic-interchange#124, open-semantic-interchange#125. Update existing sections with recently opened discussions, PRs, and converters: metric trees (open-semantic-interchange#40), primary key semantics (open-semantic-interchange#15, open-semantic-interchange#119), reusable datasets (open-semantic-interchange#103, open-semantic-interchange#109), datatype/is_time reframe (PR open-semantic-interchange#113), spatial dimension types (open-semantic-interchange#114), default_aggregation (open-semantic-interchange#115), positive direction (open-semantic-interchange#41), physical metadata (open-semantic-interchange#110), and new converter PRs for Salesforce (open-semantic-interchange#118), dbt (open-semantic-interchange#116), and Databricks (open-semantic-interchange#120). Also adds CONTRIBUTING.md (open-semantic-interchange#122) and working groups page (open-semantic-interchange#123) to Developer Experience, and lists merged converters as existing artifacts. Made-with: Cursor

Add Ontology & Semantic Interoperability as a top-level current effort with links to discussions #22, #101, #108, #68 and PRs #124, #125. Update existing sections with recently opened discussions, PRs, and converters: metric trees (#40), primary key semantics (#15, #119), reusable datasets (#103, #109), datatype/is_time reframe (PR #113), spatial dimension types (#114), default_aggregation (#115), positive direction (#41), physical metadata (#110), and new converter PRs for Salesforce (#118), dbt (#116), and Databricks (#120). Also adds CONTRIBUTING.md (#122) and working groups page (#123) to Developer Experience, and lists merged converters as existing artifacts. Made-with: Cursor

jochenchrist · 2026-06-03T17:59:10Z

+| `boolean` | Logical two-valued truth type. |
+| `date` | Calendar date with no time-of-day component. |
+| `time` | Time-of-day with no date component. |
+| `timestamp` | Instant-in-time without timezone offset (naive / local). |


maybe the default timestamp should be with tz (this is the most common use for timestamps to avoid typical errors). then Add a timestamp_ntz for the special case when really no timezone is expected.

jonmmease mentioned this pull request Apr 24, 2026

Support field datatype rather than is_time #84

Open

KSDaemon reviewed Apr 27, 2026

View reviewed changes

STHITAPRAJNAS mentioned this pull request May 8, 2026

Add OSI ↔ Databricks Unity Catalog Metric View converter #120

Open

5 tasks

jklahr mentioned this pull request May 18, 2026

Add ontology section and update roadmap with latest discussions #126

Merged

jochenchrist reviewed Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add datatype field to Field and Metric; reframe is_time as role marker#113

Add datatype field to Field and Metric; reframe is_time as role marker#113
jonmmease wants to merge 1 commit into
open-semantic-interchange:mainfrom
jonmmease:add-datatype-to-spec

jonmmease commented Apr 24, 2026

Uh oh!

KSDaemon Apr 27, 2026

Uh oh!

jonmmease Apr 27, 2026

Uh oh!

jonmmease May 5, 2026

Uh oh!

jochenchrist Jun 3, 2026

Uh oh!

xavipereztr commented Apr 27, 2026

Uh oh!

jonmmease commented Apr 27, 2026

Uh oh!

xavipereztr commented Apr 28, 2026

Uh oh!

jonmmease commented Apr 28, 2026

Uh oh!

jochenchrist Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jonmmease commented Apr 24, 2026

Motivation

What changes

Impact on existing implementations

Uh oh!

KSDaemon Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

jonmmease Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

jonmmease May 5, 2026

Choose a reason for hiding this comment

Uh oh!

jochenchrist Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

xavipereztr commented Apr 27, 2026

Uh oh!

jonmmease commented Apr 27, 2026

Uh oh!

xavipereztr commented Apr 28, 2026

Uh oh!

jonmmease commented Apr 28, 2026

Uh oh!

jochenchrist Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants