Skip to content

Add datatype field to Field and Metric; reframe is_time as role marker#113

Open
jonmmease wants to merge 1 commit into
open-semantic-interchange:mainfrom
jonmmease:add-datatype-to-spec
Open

Add datatype field to Field and Metric; reframe is_time as role marker#113
jonmmease wants to merge 1 commit into
open-semantic-interchange:mainfrom
jonmmease:add-datatype-to-spec

Conversation

@jonmmease
Copy link
Copy Markdown

AI Disclosure: I drove this PR and description below iteratively with Claude Code, and have reviewed it fully

Motivation

Closes #84.

This PR is the outcome of the discussion I started in #25 (converted to #84), also related to #17.

The issue originally framed this as "use datatype rather than is_time". After a bit more thought and investigation into existing semantic layers implementations, I think it makes more sense to add data types as a new field/metric attribute, and keep is_time as a role attribute. This fits cleanly with the example in this repo where d_year, d_quarter_name, d_month_name have different data types, but temporal roles. Snowflake Semantic Views (YAML form) and LookML both split the same two axes.

So this PR adds datatype alongside is_time and documents them as independent properties.

What changes

  1. New datatype enum on Field and Metric (optional, top-level). Values: string, integer, number, boolean, date, time, timestamp, timestamp_tz, other. Logical types, not SQL-physical. Use other plus custom_extensions for types outside the enum.

  2. dimension.is_time stays valid. It is documented as a temporal-role marker, independent of datatype. A field may carry datatype: integer with is_time: true (e.g. d_year), or datatype: timestamp with no is_time (a plain timestamp column), or is_time: true alone (legacy-style).

  3. Default rule for is_time. When unset, is_time defaults to true if datatype is one of date, time, timestamp, timestamp_tz, and false otherwise. Explicit is_time always wins. This lets authors opt a temporal-typed column (e.g. an audit created_at) out of time-dimension classification with is_time: false.

  4. Snowflake converter updated. _classify_field now honors explicit is_time first, then falls back to the temporal-datatype default. Nine new tests cover the datatype paths, the mixed-metadata case (d_year with datatype: integer and is_time: true), and the opt-out case. The five original is_time tests are unchanged, proving back-compat.

  5. Example updated. examples/tpcds_semantic_model.yaml demonstrates three coexistence patterns: datatype only (majority of fields), datatype plus is_time: true coexisting (d_year), and is_time: true only (legacy style, retained on two fields as a compatibility demonstration). 25 fields and 5 metrics gained datatype.

  6. Docs. converters/index.md explains the type-vs-role split and the converter classification rule; docs/index.md glossary mentions datatype on the Field entry. A new Datatype and is_time: type vs. role section in core-spec/spec.md covers the combinations, default rule, consumer guidance, and migration recipe, with a Precedent note citing Snowflake Semantic Views and LookML.

  7. Also added .gitignore for pycache directories.

Impact on existing implementations

The change is additive and backward-compatible:

  • Models with only is_time. Continue to validate and convert unchanged. The Snowflake converter's time_dimension classification is preserved for every is_time: true field, including non-temporal grain columns like d_year.
  • New models with only datatype. Work as expected. A temporal datatype on a field with a dimension block classifies as a time_dimension via the default rule.
  • Models that set is_time: false explicitly on a temporal-typed column. Classification is now dimension, not time_dimension. This is intentional: the author has opted out. The only way to reach this case before this PR was to author an unusual model where is_time: false was on a column that was implicitly temporal in the consumer's view. If such models exist, they will flip classification; we judged this the right behavior because ignoring an explicit is_time: false is what consumers had implicitly been doing, and that silent override is worse than the opt-out.

I didn't update the Snowflake exporter to add the data type itself, as Snowflake has the types of everything internally. If we were to add a Snowflake importer, it would be important to map the Snowflake data types onto these OSI data types

Introduces a top-level `datatype` on Field and Metric with a closed logical
enum: string, integer, number, boolean, date, time, timestamp, timestamp_tz,
other. Addresses issue open-semantic-interchange#84.

`datatype` and `dimension.is_time` are independent and orthogonal:

- `datatype` declares the field's logical data type (casting/serialization).
- `is_time` is a temporal-role marker (time-series analysis, temporal
  filtering). A field with `is_time: true` may carry any `datatype` (e.g.
  integer for a year grain, string for a month name, date for a calendar
  date).

When `is_time` is unset, it defaults to `true` for temporal datatypes
(`date`, `time`, `timestamp`, `timestamp_tz`) and `false` otherwise.
Explicit `is_time` always wins, so authors can set `is_time: false` on
an audit `created_at` to keep it off the time axis.

Taxonomy and type/role split were chosen after benchmarking 14 peer
semantic layers and 5 portable type standards. Notable precedent:
Snowflake Semantic Views' YAML authoring form has a `time_dimensions:`
collection whose entries can carry any `data_type` (the published example
annotates `order_year` with `data_type: NUMBER`); LookML's `dimension_group`
accepts `date`, `datetime`, `timestamp`, `epoch`, and `yyyymmdd`.

Snowflake converter updated: `_classify_field` honors explicit `is_time`
first, then falls back to the temporal-datatype default. 9 new tests
cover the datatype paths and the mixed-metadata cases (`d_year` with
`datatype: integer` and `is_time: true`, audit timestamp opt-out, etc.).

tpcds_semantic_model.yaml demonstrates three coexistence patterns:
datatype-only, datatype + is_time, and is_time-only.

Also added .gitignore for __pycache__ directories.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread core-spec/spec.md
|----------|-------------|
| `string` | Variable-length Unicode character data. |
| `integer` | Signed integer with no scale. |
| `number` | Real number (floating-point or decimal) with unspecified precision. |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern here is that using a generic number type isn’t really a valid interoperability strategy. Different data warehouses interpret and implement it differently, so it’s not interchangeable in practice.

For example, you can’t take a model built against data in Snowflake and expect it to behave identically in Databricks just because both use a number type. As we’ve already discussed in other issues/threads, precision and scale vary across systems, and that directly affects results.

Without explicitly specifying those parameters, this approach is likely to introduce subtle bugs. If strict type checking is enforced and there’s no automatic conversion, things will simply break. Even with implicit or explicit casting at the physical layer, you can still end up with precision loss, which is not acceptable for many use cases.

So I think we need a different approach here. Either we make the type definition explicit (e.g., always specifying precision and scale for numeric types), or we introduce some abstraction that preserves correctness across backends instead of relying on a loosely defined number.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @KSDaemon. The frame of reference I'm coming from is that these types need to be precise enough to generate SQL on top of (e.g. a BI tool building on top of the semantic definitions). In my experience, the SQL required for integers vs float/decimal can differ for warehouses that use truncation semantics for integers. I have not come across cases where I've needed to generate different SQL depending on whether the underlying type is a float or decimal.

But it is worth taking a step back to ask what else folks would rely on these logical data types for. Do you have any particular use cases in mind?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up. Are there cases beyond the generation of SQL for semantic queries that you have in mind to support here? I'm happy to add decimal as data type, but I'd like to understand a motivating scenario for how the spec would be used in such a way that number and decimal would be treated differently.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using JSON Schema data types (https://json-schema.org/understanding-json-schema/reference/type) as a minimum would be a good idea.

@xavipereztr
Copy link
Copy Markdown

Hi @jonmmease !

Thanks for your work on this PR! Some comments on this PR based on comments and discussions we have opened in regards of including spatial semantics in the OSI spec:

On geometry and geography as first-class datatype values

We'd like to make the case for adding geometry and geography to the
datatype enum rather than leaving spatial columns to fall through to other.

Both are first-class native types in the leading cloud data warehouses:

  • Snowflake: GEOGRAPHY and GEOMETRY
  • BigQuery: GEOGRAPHY (WGS84 point/line/polygon)
  • Databricks: GEOMETRY
  • PostGIS: geometry and geography

The geometry / geography distinction is also meaningful and standardized
(OGC/ISO SQL): geography operates on a spherical earth model with real-world
units, while geometry operates on a flat Cartesian plane. Consumers need to
know which they're dealing with to generate correct spatial SQL.

Leaving these as other + custom_extensions means every OSI implementation
dealing with spatial data falls back to vendor-specific metadata, something that at CARTO we believe it should not happen.

On dimension.spatial as the spatial role marker

The type/role split you've introduced gives spatial data a natural place to
live. We're working on a proposal (related to #69)
that would add a spatial object to the dimension block as the spatial
equivalent of is_time — a role marker that carries the metadata consuming
tools need: geometry type, SRID, spatial index system and resolution, and
geographic level. It would look like this:

- name: geom
  datatype: geometry              # proposed new enum value
  dimension:
    spatial:                      # proposed role marker + metadata
      type: polygon
      srid: 4326
      geographic_level: "census_block_group"
      geographic_hierarchy:
        parent: "census_tract"
        children: []
  description: "Block group boundary polygon (WGS84)"

- name: h3
  datatype: string                # H3 index is stored as a string
  dimension:
    spatial:
      type: spatial_index
      index:
        system: h3
        resolution: 8
        rollup_resolutions: [7, 6, 5]
      geographic_level: "h3_cell_res8"
  description: "H3 cell index at resolution 8"

This maps exactly onto the pattern you're establishing: datatype for the
data type question, dimension.spatial for the role/semantics question.
We can open a separate discussion on the spatial descriptor, but wanted to
flag it here so the geometry/geography datatype values are on the radar
before the enum is locked in.

On number precision — +1 to @KSDaemon's concern. We raised the same
point on issue #84 and in a comment there. Worth resolving before this merges.

@jonmmease
Copy link
Copy Markdown
Author

Hi @xavipereztr, your descriptions of geometry and geography make good sense to me as additional data types, though I'd lean toward incorporating these as type additions in your proposal in #69 rather than this PR.

I'm not opposed to adding a decimal type, but I'd be interested to articulate a case where a consumer of the semantic model would act differently for a float vs decimal column or expression type.

You mentioned in the other issue:

So a value like 1.5 in a column declared as unqualified NUMBER/NUMERIC survives on BigQuery but is silently truncated to 2 on Snowflake and Databricks. Precision ceilings also diverge (29 vs 38 total digits).

If this value is in a column with a decimal type in BigQuery or Snowflake, it's either already truncated or the precision accommodates it. If I'm writing SQL generation logic on top of this dimension, are there cases where this needs to differ when the underlying type is decimal vs a float type?

Another factor would be, for translation purposes, whether any mainstream semantic layers already distinguish these numeric types (cube and lookml do not as far as I've seen). If so, that would also be good argument to separate decimal as a dedicated type.

Grounding this in a use case would also help me think through whether decimal by itself is useful without the specific scale and precision parameters, or whether these parameters would be necessary.

@xavipereztr
Copy link
Copy Markdown

Hi @jonmmease - on the geometry and geography aspect, are you suggesting we create a PR based on #69 or shall we wait for further discussion with OSI members?

On the numeric simplification, I guess that you are right that between integer vs number is largely enough for most use-cases.

@jonmmease
Copy link
Copy Markdown
Author

I only meant that I think it makes sense to attach the additional geometry/geography data types to the dimension. spatial_data proposal in #69, and for these to be evaluated together through the OSI standards process (of which I'm not familiar with the details).

jklahr pushed a commit to jklahr/jklahr-osi that referenced this pull request May 18, 2026
Add Ontology & Semantic Interoperability as a top-level current effort
with links to discussions open-semantic-interchange#22, open-semantic-interchange#101, open-semantic-interchange#108, open-semantic-interchange#68 and PRs open-semantic-interchange#124, open-semantic-interchange#125.

Update existing sections with recently opened discussions, PRs, and
converters: metric trees (open-semantic-interchange#40), primary key semantics (open-semantic-interchange#15, open-semantic-interchange#119),
reusable datasets (open-semantic-interchange#103, open-semantic-interchange#109), datatype/is_time reframe (PR open-semantic-interchange#113),
spatial dimension types (open-semantic-interchange#114), default_aggregation (open-semantic-interchange#115), positive
direction (open-semantic-interchange#41), physical metadata (open-semantic-interchange#110), and new converter PRs for
Salesforce (open-semantic-interchange#118), dbt (open-semantic-interchange#116), and Databricks (open-semantic-interchange#120). Also adds
CONTRIBUTING.md (open-semantic-interchange#122) and working groups page (open-semantic-interchange#123) to Developer
Experience, and lists merged converters as existing artifacts.

Made-with: Cursor
jklahr pushed a commit that referenced this pull request May 19, 2026
Add Ontology & Semantic Interoperability as a top-level current effort
with links to discussions #22, #101, #108, #68 and PRs #124, #125.

Update existing sections with recently opened discussions, PRs, and
converters: metric trees (#40), primary key semantics (#15, #119),
reusable datasets (#103, #109), datatype/is_time reframe (PR #113),
spatial dimension types (#114), default_aggregation (#115), positive
direction (#41), physical metadata (#110), and new converter PRs for
Salesforce (#118), dbt (#116), and Databricks (#120). Also adds
CONTRIBUTING.md (#122) and working groups page (#123) to Developer
Experience, and lists merged converters as existing artifacts.

Made-with: Cursor
Comment thread core-spec/spec.md
| `boolean` | Logical two-valued truth type. |
| `date` | Calendar date with no time-of-day component. |
| `time` | Time-of-day with no date component. |
| `timestamp` | Instant-in-time without timezone offset (naive / local). |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the default timestamp should be with tz (this is the most common use for timestamps to avoid typical errors). then Add a timestamp_ntz for the special case when really no timezone is expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support field datatype rather than is_time

4 participants