fix(databricks)!: string-promote COALESCE/IF/CASE per findWiderCommonType [CLAUDE]#7682
Conversation
…LAUDE]
Databricks string-promotes least-common-type functions when an argument
is text and the rest are non-boolean/non-binary atomics. Verified on a
Databricks serverless (always-ANSI) warehouse:
typeof(coalesce(cast(1 as int), 'abc')) -> string
typeof(coalesce(cast(1.5 as double), 'abc')) -> string
typeof(coalesce(cast('2020-01-01' as date), 'abc')) -> string
typeof(coalesce(interval 1 day, 'abc')) -> string
typeof(if(true, cast(1 as int), 'abc')) -> string
typeof(case when true then cast(1 as int) else 'abc' end) -> string
boolean+string and binary+string raise DATATYPE_MISMATCH in Databricks,
so those rows keep their current (best-effort) annotation.
These fixtures fail until the annotator fix lands.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ype [CLAUDE] Databricks resolves least-common-type functions (COALESCE/IFNULL/NVL, IF, CASE) with Spark's findWiderCommonType string promotion: when a value argument is text and the rest are non-boolean/non-binary atomics, the result is text. Previously these folded through Databricks' COERCES_TO lattice (text coerces into numeric/temporal), so coalesce(int_col, str_col) annotated as BIGINT instead of STRING, independent of argument order. Verified on a Databricks serverless (always-ANSI) warehouse: coalesce/if/ case of text + numeric/date/interval -> string; text + boolean/binary raises DATATYPE_MISMATCH, so those defer to the existing numeric-widening fallback. Note this diverges from open-source Spark AnsiTypeCoercion (which promotes string+int to long under ANSI) -- Databricks SQL string- promotes these functions regardless of ANSI mode. Scoped to Databricks; Spark/Hive (non-ANSI, already string-promoting via their inverted lattice) and other dialects are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
We're just gonna ignore this one for a while until we've got some consensus as to why I'm seeing completely different behavior in Databricks from what seems to be expected. |
|
@RichardHughes-amp thanks for the PR once again, let me take a look. |
|
@RichardHughes-amp can you run in databricks the command if it's not set it by |
| # dialect: databricks | ||
| CASE WHEN cond THEN tbl.int_col ELSE tbl.str_col END; | ||
| STRING; |
There was a problem hiding this comment.
I haven't worked on CASE in the original PR, as it seems databricks promotes the type here to BIGINT, lets check hive/spark and use promote=true if this is true.
| COALESCE(tbl.interval_col, tbl.str_col); | ||
| INTERVAL; | ||
| STRING; |
There was a problem hiding this comment.
Let's remove this test entirely. (isn't valid for ANSI, probably missed it)
|
It turns out I had ANSI mode off! All of my confusion stemmed from me previously getting incorrect instructions on how to check if my Databricks warehouse was running in ANSI mode. The pre-existing behavior is correct, and I believe this PR is unnecessary. |
Problem
For the
databricksdialect, least-common-type functions —COALESCE(andIFNULL/NVL, which parse toCoalesce),IF, andCASE— annotate as the numeric type when an argument is text and the rest are numeric. Real Databricks resolves these to string (Spark'sfindWiderCommonTypestring promotion).spark/spark2/hiveare already correct (they string-promote via their lattice); onlydatabricksis affected.Root cause
Databricksdefines its ownCOERCES_TOlattice in the opposite direction from Hive/Spark — text coerces into numeric/temporal rather than numeric/temporal into text. LCT functions fold through that lattice via_annotate_by_args→_maybe_coerce, so the numeric type always wins regardless of argument order.That lattice was introduced deliberately in #5096 on the premise that Databricks defaults to ANSI mode (where open-source Spark's
AnsiTypeCoercion.findWiderTypeForStringreturnsLONGforstring + int, notstring).Why the ANSI premise doesn't hold for Databricks
Verified on a Databricks serverless warehouse (which is always ANSI):
Databricks string-promotes these functions even under ANSI, contrary to open-source Spark's
AnsiTypeCoercion. It follows the non-ANSIstringPromotionrule —(StringType, AtomicType) if t2 != BinaryType && t2 != BooleanType(plus interval). The excluded combinations error:Fix
Three Databricks-scoped annotators for
Coalesce/If/Case: if a value argument is text and none is boolean/binary, the result is text; otherwise fall back to the existing numeric-widening behavior.boolean/binary+ text andGREATEST/LEAST(which require a common type and error on mixed text/numeric in Databricks) are intentionally left on the fallback — there is no representable "type mismatch error" annotation, so their current best-effort type is retained.Scoped to Databricks only; Spark/Hive and all other dialects are unchanged. Arithmetic coercion (
'5' + 3 -> INT) is unaffected — it goes throughBINARY_COERCIONS, notCOERCES_TO.Tests
tests/fixtures/optimizer/annotate_functions.sql: the DatabricksCOALESCE/IFrows for text + numeric/date/interval now assertSTRING;boolean/binaryrows keep their current annotation; addedCOALESCE/NVL/CASEcases and an all-numeric regression. Each asserted result matches the empiricaltypeof(...)output above.