Add Apache Spark converter module#93
Conversation
Introduces a Java/Maven converter under converters/spark/ that parses OSI semantic model YAML files and generates PySpark code including dataset loaders, join helpers, and metric functions.
|
A very simple converter for Apache Spark, showing the kind of code we can have in the repo. |
| /** | ||
| * Java representation of an OSI semantic model parsed from YAML. | ||
| */ | ||
| public class OsiModel { |
There was a problem hiding this comment.
The java representation code seems to be in two places in both spark and polaris modules. How about we instead follow this directory structure
core-spec/
│ ├── spec.yaml
│ ├── spec.json
│ └── spec.proto
And then use the proto version of the spec to dynamically generate the java/python/go bindings for the spec ?
I think dbt was also suggesting doing that. Many applications can also directly consume the proto version of the spec.
For the purpose for this PR, I think it is fine to add the java bindings, I would still keep the java bindings in one folder
There was a problem hiding this comment.
As the other converters are directly in the converters module (Python based for instance).
I'm happy to do a previous PR to reorg the modules with multi-languages binding.
| public class OsiModel { | ||
|
|
||
| private String version; | ||
| private List<SemanticModel> semanticModels = new ArrayList<>(); |
There was a problem hiding this comment.
Also there is no parent OSI model that has list of semantic_models.
The OSI spec represents the spec for one semantic_model.
| sb.append(" return spark.sql(\"SELECT ").append(escapeString(expr)) | ||
| .append(" AS ").append(metric.getName()).append("\")\n\n"); |
There was a problem hiding this comment.
Hey @jbonofre As far as I know spark does not have native support for querying a semantic model.
And this code is generating some SQL , but I don't think that query will just work in spark. We need to align on the query semantics on OSI and then have a query rewriter implementation in spark.
I suggest we keep CodeGen out of scope for now, and instead have the logic to load the semantic model to the spark catalog ?
There was a problem hiding this comment.
That's reasonable. I will update accordingly.
khush-bhatia
left a comment
There was a problem hiding this comment.
Left some comments, thanks
Summary
converters/spark/Java/Maven module that reads OSI semantic model YAML files and generates PySpark codespark.table()+ computed columns viaF.expr()), join helpers from relationship definitions, metric aggregation functions viaspark.sql(), and a convenienceload_all_datasets()functionTest plan
mvn clean test— 4 tests covering parsing, field extraction, code generation, and multi-dialect support)examples/tpcds_semantic_model.yamlproduces valid PySpark output