Skip to content

fix(databricks)!: string-promote COALESCE/IF/CASE per findWiderCommonType [CLAUDE]#7682

Closed
RichardHughes-amp wants to merge 2 commits into
tobymao:mainfrom
RichardHughes-amp:fix-databricks-lct-string-promotion
Closed

fix(databricks)!: string-promote COALESCE/IF/CASE per findWiderCommonType [CLAUDE]#7682
RichardHughes-amp wants to merge 2 commits into
tobymao:mainfrom
RichardHughes-amp:fix-databricks-lct-string-promotion

Conversation

@RichardHughes-amp
Copy link
Copy Markdown
Contributor

Problem

For the databricks dialect, least-common-type functions — COALESCE (and IFNULL/NVL, which parse to Coalesce), IF, and CASE — annotate as the numeric type when an argument is text and the rest are numeric. Real Databricks resolves these to string (Spark's findWiderCommonType string promotion).

from sqlglot.optimizer.annotate_types import annotate_types
from sqlglot import parse_one

schema = {"tbl": {"int_col": "INT", "str_col": "STRING"}}
e = parse_one("SELECT COALESCE(tbl.int_col, tbl.str_col) FROM tbl", dialect="databricks")
annotate_types(e, schema=schema, dialect="databricks").selects[0].type
# before: BIGINT   after: STRING

spark/spark2/hive are already correct (they string-promote via their lattice); only databricks is affected.

Root cause

Databricks defines its own COERCES_TO lattice in the opposite direction from Hive/Spark — text coerces into numeric/temporal rather than numeric/temporal into text. LCT functions fold through that lattice via _annotate_by_args_maybe_coerce, so the numeric type always wins regardless of argument order.

That lattice was introduced deliberately in #5096 on the premise that Databricks defaults to ANSI mode (where open-source Spark's AnsiTypeCoercion.findWiderTypeForString returns LONG for string + int, not string).

Why the ANSI premise doesn't hold for Databricks

Verified on a Databricks serverless warehouse (which is always ANSI):

SELECT
  typeof(coalesce(cast(1 as int), 'abc')),                  -- string
  typeof(coalesce(cast(1.5 as double), 'abc')),             -- string
  typeof(coalesce(cast('2020-01-01' as date), 'abc')),      -- string
  typeof(coalesce(interval 1 day, 'abc')),                  -- string
  typeof(if(true, cast(1 as int), 'abc')),                  -- string
  typeof(case when true then cast(1 as int) else 'abc' end);-- string

Databricks string-promotes these functions even under ANSI, contrary to open-source Spark's AnsiTypeCoercion. It follows the non-ANSI stringPromotion rule — (StringType, AtomicType) if t2 != BinaryType && t2 != BooleanType (plus interval). The excluded combinations error:

coalesce(cast(1 as boolean), 'abc')  -> [DATATYPE_MISMATCH.DATA_DIFF_TYPES]
coalesce(cast(1 as binary), 'abc')   -> [DATATYPE_MISMATCH.DATA_DIFF_TYPES]

Fix

Three Databricks-scoped annotators for Coalesce/If/Case: if a value argument is text and none is boolean/binary, the result is text; otherwise fall back to the existing numeric-widening behavior. boolean/binary + text and GREATEST/LEAST (which require a common type and error on mixed text/numeric in Databricks) are intentionally left on the fallback — there is no representable "type mismatch error" annotation, so their current best-effort type is retained.

Scoped to Databricks only; Spark/Hive and all other dialects are unchanged. Arithmetic coercion ('5' + 3 -> INT) is unaffected — it goes through BINARY_COERCIONS, not COERCES_TO.

Tests

tests/fixtures/optimizer/annotate_functions.sql: the Databricks COALESCE/IF rows for text + numeric/date/interval now assert STRING; boolean/binary rows keep their current annotation; added COALESCE/NVL/CASE cases and an all-numeric regression. Each asserted result matches the empirical typeof(...) output above.

RichardHughes-amp and others added 2 commits May 26, 2026 19:23
…LAUDE]

Databricks string-promotes least-common-type functions when an argument
is text and the rest are non-boolean/non-binary atomics. Verified on a
Databricks serverless (always-ANSI) warehouse:

  typeof(coalesce(cast(1 as int), 'abc'))  -> string
  typeof(coalesce(cast(1.5 as double), 'abc')) -> string
  typeof(coalesce(cast('2020-01-01' as date), 'abc')) -> string
  typeof(coalesce(interval 1 day, 'abc')) -> string
  typeof(if(true, cast(1 as int), 'abc')) -> string
  typeof(case when true then cast(1 as int) else 'abc' end) -> string

boolean+string and binary+string raise DATATYPE_MISMATCH in Databricks,
so those rows keep their current (best-effort) annotation.

These fixtures fail until the annotator fix lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ype [CLAUDE]

Databricks resolves least-common-type functions (COALESCE/IFNULL/NVL,
IF, CASE) with Spark's findWiderCommonType string promotion: when a value
argument is text and the rest are non-boolean/non-binary atomics, the
result is text. Previously these folded through Databricks' COERCES_TO
lattice (text coerces into numeric/temporal), so coalesce(int_col,
str_col) annotated as BIGINT instead of STRING, independent of argument
order.

Verified on a Databricks serverless (always-ANSI) warehouse: coalesce/if/
case of text + numeric/date/interval -> string; text + boolean/binary
raises DATATYPE_MISMATCH, so those defer to the existing numeric-widening
fallback. Note this diverges from open-source Spark AnsiTypeCoercion
(which promotes string+int to long under ANSI) -- Databricks SQL string-
promotes these functions regardless of ANSI mode.

Scoped to Databricks; Spark/Hive (non-ANSI, already string-promoting via
their inverted lattice) and other dialects are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@RichardHughes-amp
Copy link
Copy Markdown
Contributor Author

We're just gonna ignore this one for a while until we've got some consensus as to why I'm seeing completely different behavior in Databricks from what seems to be expected.

@geooo109
Copy link
Copy Markdown
Collaborator

@RichardHughes-amp thanks for the PR once again, let me take a look.

@geooo109
Copy link
Copy Markdown
Collaborator

geooo109 commented May 27, 2026

@RichardHughes-amp can you run in databricks the command SET -v and validate that ansi_mode is true ?

if it's not set it by SET ansi_mode = true

Databricks ANSI default mode = true

Comment on lines +412 to +414
# dialect: databricks
CASE WHEN cond THEN tbl.int_col ELSE tbl.str_col END;
STRING;
Copy link
Copy Markdown
Collaborator

@geooo109 geooo109 May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't worked on CASE in the original PR, as it seems databricks promotes the type here to BIGINT, lets check hive/spark and use promote=true if this is true.

Comment on lines 397 to +398
COALESCE(tbl.interval_col, tbl.str_col);
INTERVAL;
STRING;
Copy link
Copy Markdown
Collaborator

@geooo109 geooo109 May 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this test entirely. (isn't valid for ANSI, probably missed it)

@geooo109 geooo109 self-assigned this May 27, 2026
@RichardHughes-amp
Copy link
Copy Markdown
Contributor Author

It turns out I had ANSI mode off! All of my confusion stemmed from me previously getting incorrect instructions on how to check if my Databricks warehouse was running in ANSI mode. The pre-existing behavior is correct, and I believe this PR is unnecessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants