Skip to content

fix(optimizer)!: annotate type for databricks REGR_AVGY, REGR_COUNT, REGR_INTERCEPT, REGR_R2, REGR_SLOPE#7820

Open
fivetran-amrutabhimsenayachit wants to merge 2 commits into
mainfrom
type-inference-batch-3
Open

fix(optimizer)!: annotate type for databricks REGR_AVGY, REGR_COUNT, REGR_INTERCEPT, REGR_R2, REGR_SLOPE#7820
fivetran-amrutabhimsenayachit wants to merge 2 commits into
mainfrom
type-inference-batch-3

Conversation

@fivetran-amrutabhimsenayachit

@fivetran-amrutabhimsenayachit fivetran-amrutabhimsenayachit commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds Databricks type inference support for REGR_AVGY (DOUBLE), REGR_COUNT (BIGINT), REGR_INTERCEPT (DOUBLE), REGR_R2 (DOUBLE), and REGR_SLOPE (DOUBLE), plus fixture coverage for all five functions.

Issue: REGR_FUNC(DISTINCT col1, col2) raised a parse error in Databricks because the base parser's DISTINCT handler consumed all comma-separated arguments into a single node, leaving the second required argument missing.

Fix: Added a custom parser method in DatabricksParser that reads only the first argument under DISTINCT, then parses the rest normally.

Tickets

  • RD-1229638 (REGR_AVGY) — DOUBLE
  • RD-1229639 (REGR_COUNT) — BIGINT
  • RD-1229640 (REGR_INTERCEPT) — DOUBLE
  • RD-1229641 (REGR_R2) — DOUBLE
  • RD-1229642 (REGR_SLOPE) — DOUBLE

Test plan

python3 -c "import sqlglot; print(repr(sqlglot.parse_one('SELECT REGR_AVGY(DISTINCT tbl.double_col, tbl.double_col) FROM tbl', dialect='databricks').expressions[0]))"


RegrAvgy(
  this=Distinct(
    expressions=[
      Column(
        this=Identifier(this=double_col, quoted=False),
        table=Identifier(this=tbl, quoted=False))]),
  expression=Column(
    this=Identifier(this=double_col, quoted=False),
    table=Identifier(this=tbl, quoted=False))

  • make style — PASS
  • make unit — PASS

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

SQLGlot Integration Test Results

✅ All tests passed

Comparing:

  • this branch (sqlglot:type-inference-batch-3 @ sqlglot b8a2c8a)
  • baseline (main @ sqlglot aedf83a)

Overall

main: 192416 total, 153530 passed (pass rate: 79.8%)

sqlglot:type-inference-batch-3: 180222 total, 142385 passed (pass rate: 79.0%)

Transitions:
No change

Dialect pair changes: 0 previous results not found, 3 current results not found

✅ All tests passed

@geooo109 geooo109 self-assigned this Jul 2, 2026
Comment thread tests/fixtures/optimizer/annotate_functions.sql
Comment thread tests/fixtures/optimizer/annotate_functions.sql
Comment thread tests/fixtures/optimizer/annotate_functions.sql
Comment thread tests/fixtures/optimizer/annotate_functions.sql
Comment thread tests/fixtures/optimizer/annotate_functions.sql
@geooo109 geooo109 changed the title feat(typing): add databricks type inference for REGR_AVGY, REGR_COUNT, REGR_INTERCEPT, REGR_R2, REGR_SLOPE fix(optimizer)!: annotate type for databricks REGR_AVGY, REGR_COUNT, REGR_INTERCEPT, REGR_R2, REGR_SLOPE Jul 2, 2026

@geooo109 geooo109 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment, also we should add roundtrip tests for ALL/DISTINCT if they are missing.

Comment on lines +35 to +41
**SparkParser.FUNCTION_PARSERS,
"REGR_AVGX": lambda self: self._parse_regr(exp.RegrAvgx),
"REGR_AVGY": lambda self: self._parse_regr(exp.RegrAvgy),
"REGR_COUNT": lambda self: self._parse_regr(exp.RegrCount),
"REGR_INTERCEPT": lambda self: self._parse_regr(exp.RegrIntercept),
"REGR_R2": lambda self: self._parse_regr(exp.RegrR2),
"REGR_SLOPE": lambda self: self._parse_regr(exp.RegrSlope),

@geooo109 geooo109 Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you verify where the DISTINCT is applied on for each function of the REGR list ? (on 1-arg or on both args)

For example in REGR_AVGX , REGR_AVGY as it seems the distinct is applied on 1-arg (x and y respectively). On the other hand, forREGR_COUNT distinct is applied on both args (as a tuple). So, the parsing function should seperate the args based on this ^ and not seperate it for all the functions in the REGR_ list.

So, let's verify each function and parse accordingly.

return self.expression(exp.ClusterProperty(this=self._prev.text.upper()))
return super()._parse_cluster_property()

def _parse_regr(self, expr_type: type[exp.AggFunc]) -> exp.AggFunc:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty similar to _parse_quantile_function of hive right ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants