Add datatype field to Field and Metric; reframe is_time as role marker#113
Add datatype field to Field and Metric; reframe is_time as role marker#113jonmmease wants to merge 1 commit into
Conversation
Introduces a top-level `datatype` on Field and Metric with a closed logical enum: string, integer, number, boolean, date, time, timestamp, timestamp_tz, other. Addresses issue open-semantic-interchange#84. `datatype` and `dimension.is_time` are independent and orthogonal: - `datatype` declares the field's logical data type (casting/serialization). - `is_time` is a temporal-role marker (time-series analysis, temporal filtering). A field with `is_time: true` may carry any `datatype` (e.g. integer for a year grain, string for a month name, date for a calendar date). When `is_time` is unset, it defaults to `true` for temporal datatypes (`date`, `time`, `timestamp`, `timestamp_tz`) and `false` otherwise. Explicit `is_time` always wins, so authors can set `is_time: false` on an audit `created_at` to keep it off the time axis. Taxonomy and type/role split were chosen after benchmarking 14 peer semantic layers and 5 portable type standards. Notable precedent: Snowflake Semantic Views' YAML authoring form has a `time_dimensions:` collection whose entries can carry any `data_type` (the published example annotates `order_year` with `data_type: NUMBER`); LookML's `dimension_group` accepts `date`, `datetime`, `timestamp`, `epoch`, and `yyyymmdd`. Snowflake converter updated: `_classify_field` honors explicit `is_time` first, then falls back to the temporal-datatype default. 9 new tests cover the datatype paths and the mixed-metadata cases (`d_year` with `datatype: integer` and `is_time: true`, audit timestamp opt-out, etc.). tpcds_semantic_model.yaml demonstrates three coexistence patterns: datatype-only, datatype + is_time, and is_time-only. Also added .gitignore for __pycache__ directories. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| |----------|-------------| | ||
| | `string` | Variable-length Unicode character data. | | ||
| | `integer` | Signed integer with no scale. | | ||
| | `number` | Real number (floating-point or decimal) with unspecified precision. | |
There was a problem hiding this comment.
One concern here is that using a generic number type isn’t really a valid interoperability strategy. Different data warehouses interpret and implement it differently, so it’s not interchangeable in practice.
For example, you can’t take a model built against data in Snowflake and expect it to behave identically in Databricks just because both use a number type. As we’ve already discussed in other issues/threads, precision and scale vary across systems, and that directly affects results.
Without explicitly specifying those parameters, this approach is likely to introduce subtle bugs. If strict type checking is enforced and there’s no automatic conversion, things will simply break. Even with implicit or explicit casting at the physical layer, you can still end up with precision loss, which is not acceptable for many use cases.
So I think we need a different approach here. Either we make the type definition explicit (e.g., always specifying precision and scale for numeric types), or we introduce some abstraction that preserves correctness across backends instead of relying on a loosely defined number.
There was a problem hiding this comment.
Thanks for the feedback @KSDaemon. The frame of reference I'm coming from is that these types need to be precise enough to generate SQL on top of (e.g. a BI tool building on top of the semantic definitions). In my experience, the SQL required for integers vs float/decimal can differ for warehouses that use truncation semantics for integers. I have not come across cases where I've needed to generate different SQL depending on whether the underlying type is a float or decimal.
But it is worth taking a step back to ask what else folks would rely on these logical data types for. Do you have any particular use cases in mind?
There was a problem hiding this comment.
Following up. Are there cases beyond the generation of SQL for semantic queries that you have in mind to support here? I'm happy to add decimal as data type, but I'd like to understand a motivating scenario for how the spec would be used in such a way that number and decimal would be treated differently.
There was a problem hiding this comment.
Using JSON Schema data types (https://json-schema.org/understanding-json-schema/reference/type) as a minimum would be a good idea.
|
Hi @jonmmease ! Thanks for your work on this PR! Some comments on this PR based on comments and discussions we have opened in regards of including spatial semantics in the OSI spec:
On We'd like to make the case for adding Both are first-class native types in the leading cloud data warehouses:
The Leaving these as On The type/role split you've introduced gives spatial data a natural place to - name: geom
datatype: geometry # proposed new enum value
dimension:
spatial: # proposed role marker + metadata
type: polygon
srid: 4326
geographic_level: "census_block_group"
geographic_hierarchy:
parent: "census_tract"
children: []
description: "Block group boundary polygon (WGS84)"
- name: h3
datatype: string # H3 index is stored as a string
dimension:
spatial:
type: spatial_index
index:
system: h3
resolution: 8
rollup_resolutions: [7, 6, 5]
geographic_level: "h3_cell_res8"
description: "H3 cell index at resolution 8"This maps exactly onto the pattern you're establishing: On |
|
Hi @xavipereztr, your descriptions of I'm not opposed to adding a decimal type, but I'd be interested to articulate a case where a consumer of the semantic model would act differently for a float vs decimal column or expression type. You mentioned in the other issue:
If this value is in a column with a decimal type in BigQuery or Snowflake, it's either already truncated or the precision accommodates it. If I'm writing SQL generation logic on top of this dimension, are there cases where this needs to differ when the underlying type is decimal vs a float type? Another factor would be, for translation purposes, whether any mainstream semantic layers already distinguish these numeric types (cube and lookml do not as far as I've seen). If so, that would also be good argument to separate decimal as a dedicated type. Grounding this in a use case would also help me think through whether |
|
Hi @jonmmease - on the On the |
|
I only meant that I think it makes sense to attach the additional |
Add Ontology & Semantic Interoperability as a top-level current effort with links to discussions open-semantic-interchange#22, open-semantic-interchange#101, open-semantic-interchange#108, open-semantic-interchange#68 and PRs open-semantic-interchange#124, open-semantic-interchange#125. Update existing sections with recently opened discussions, PRs, and converters: metric trees (open-semantic-interchange#40), primary key semantics (open-semantic-interchange#15, open-semantic-interchange#119), reusable datasets (open-semantic-interchange#103, open-semantic-interchange#109), datatype/is_time reframe (PR open-semantic-interchange#113), spatial dimension types (open-semantic-interchange#114), default_aggregation (open-semantic-interchange#115), positive direction (open-semantic-interchange#41), physical metadata (open-semantic-interchange#110), and new converter PRs for Salesforce (open-semantic-interchange#118), dbt (open-semantic-interchange#116), and Databricks (open-semantic-interchange#120). Also adds CONTRIBUTING.md (open-semantic-interchange#122) and working groups page (open-semantic-interchange#123) to Developer Experience, and lists merged converters as existing artifacts. Made-with: Cursor
Add Ontology & Semantic Interoperability as a top-level current effort with links to discussions #22, #101, #108, #68 and PRs #124, #125. Update existing sections with recently opened discussions, PRs, and converters: metric trees (#40), primary key semantics (#15, #119), reusable datasets (#103, #109), datatype/is_time reframe (PR #113), spatial dimension types (#114), default_aggregation (#115), positive direction (#41), physical metadata (#110), and new converter PRs for Salesforce (#118), dbt (#116), and Databricks (#120). Also adds CONTRIBUTING.md (#122) and working groups page (#123) to Developer Experience, and lists merged converters as existing artifacts. Made-with: Cursor
| | `boolean` | Logical two-valued truth type. | | ||
| | `date` | Calendar date with no time-of-day component. | | ||
| | `time` | Time-of-day with no date component. | | ||
| | `timestamp` | Instant-in-time without timezone offset (naive / local). | |
There was a problem hiding this comment.
maybe the default timestamp should be with tz (this is the most common use for timestamps to avoid typical errors). then Add a timestamp_ntz for the special case when really no timezone is expected.
AI Disclosure: I drove this PR and description below iteratively with Claude Code, and have reviewed it fully
Motivation
Closes #84.
This PR is the outcome of the discussion I started in #25 (converted to #84), also related to #17.
The issue originally framed this as "use
datatyperather thanis_time". After a bit more thought and investigation into existing semantic layers implementations, I think it makes more sense to add data types as a newfield/metricattribute, and keepis_timeas a role attribute. This fits cleanly with the example in this repo whered_year,d_quarter_name,d_month_namehave different data types, but temporal roles. Snowflake Semantic Views (YAML form) and LookML both split the same two axes.So this PR adds
datatypealongsideis_timeand documents them as independent properties.What changes
New
datatypeenum onFieldandMetric(optional, top-level). Values:string,integer,number,boolean,date,time,timestamp,timestamp_tz,other. Logical types, not SQL-physical. Useotherpluscustom_extensionsfor types outside the enum.dimension.is_timestays valid. It is documented as a temporal-role marker, independent ofdatatype. A field may carrydatatype: integerwithis_time: true(e.g.d_year), ordatatype: timestampwith nois_time(a plain timestamp column), oris_time: truealone (legacy-style).Default rule for
is_time. When unset,is_timedefaults totrueifdatatypeis one ofdate,time,timestamp,timestamp_tz, andfalseotherwise. Explicitis_timealways wins. This lets authors opt a temporal-typed column (e.g. an auditcreated_at) out of time-dimension classification withis_time: false.Snowflake converter updated.
_classify_fieldnow honors explicitis_timefirst, then falls back to the temporal-datatype default. Nine new tests cover the datatype paths, the mixed-metadata case (d_yearwithdatatype: integerandis_time: true), and the opt-out case. The five originalis_timetests are unchanged, proving back-compat.Example updated.
examples/tpcds_semantic_model.yamldemonstrates three coexistence patterns:datatypeonly (majority of fields),datatypeplusis_time: truecoexisting (d_year), andis_time: trueonly (legacy style, retained on two fields as a compatibility demonstration). 25 fields and 5 metrics gaineddatatype.Docs.
converters/index.mdexplains the type-vs-role split and the converter classification rule;docs/index.mdglossary mentionsdatatypeon the Field entry. A newDatatype and is_time: type vs. rolesection incore-spec/spec.mdcovers the combinations, default rule, consumer guidance, and migration recipe, with a Precedent note citing Snowflake Semantic Views and LookML.Also added .gitignore for pycache directories.
Impact on existing implementations
The change is additive and backward-compatible:
is_time. Continue to validate and convert unchanged. The Snowflake converter'stime_dimensionclassification is preserved for everyis_time: truefield, including non-temporal grain columns liked_year.datatype. Work as expected. A temporaldatatypeon a field with adimensionblock classifies as atime_dimensionvia the default rule.is_time: falseexplicitly on a temporal-typed column. Classification is nowdimension, nottime_dimension. This is intentional: the author has opted out. The only way to reach this case before this PR was to author an unusual model whereis_time: falsewas on a column that was implicitly temporal in the consumer's view. If such models exist, they will flip classification; we judged this the right behavior because ignoring an explicitis_time: falseis what consumers had implicitly been doing, and that silent override is worse than the opt-out.I didn't update the Snowflake exporter to add the data type itself, as Snowflake has the types of everything internally. If we were to add a Snowflake importer, it would be important to map the Snowflake data types onto these OSI data types