fix(spark): CONCAT/PAD type leaks non-string arg when string args aren't TEXT [CLAUDE] by RichardHughes-amp · Pull Request #7661 · tobymao/sqlglot

RichardHughes-amp · 2026-05-19T20:40:46Z

SQLGlot and the native spark query analyzer were returning disparate results in testing for a query that had a clause of the form:

CONCAT(varchar2_type_column, '-', date_type_column)

...Where SQLGlot parsed this CONCAT as emitting a Date type, and the native parser saw it as emitting a Varchar type.

We tracked down the root cause to a subtle error in the logic of sqlglot/typing/spark2.py:_annotate_by_similar_args; when it swept the arguments of CONCAT (or LPAD) for types, it searched for exp.DType.TEXT, but it didn't have any special handling for other text types like VARCHAR2. So it went with whatever the last type was: in this case, date.

CONCAT can also handle concatenation of arrays. However, we backed away from fully implementing that, on account of SQLGlot intentionally skipping type coercions for nested structures today. Rather, we went for a simplified model with three possible outputs: TEXT, UNKNOWN, or BINARY.

…s are VARCHAR/CHAR [CLAUDE] `_annotate_by_similar_args` only short-circuited on an exact `is_type(TEXT)` match. VARCHAR, CHAR, NVARCHAR, and string literals (annotated as VARCHAR) are not `TEXT`, so for `CONCAT(varchar_col, '-', date_col)` the loop fell through every argument and the final `last_datatype = expr.type` left the result as the *last* arg's type (DATE), not a string. Accept a tuple of acceptable match types and use its first element as the canonical result. Spark's CONCAT/PAD now pass `(TEXT, *TEXT_TYPES)`, so any string-family arg yields TEXT. Also stop overwriting `last_datatype` on every non-matching arg — keep the first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…UDE] Replace varchar_col-based cases (which required a schema addition) with literal-based equivalents — string literals parse as VARCHAR, supplying the same TEXT_TYPES coverage without an extra schema column. Add missing cases: date-first CONCAT, array round-trips, and LPAD with date args. Remove varchar_col from the test schema. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ations [CLAUDE] Array mixed with string, binary, or a literal should resolve to UNKNOWN since Spark rejects these at analysis time with DATATYPE_MISMATCH. The annotator currently returns the wrong type for all six cases; these tests are expected to fail until the fix lands. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…e combinations [CLAUDE]" This reverts commit 494c9be.

…annotate_pad [CLAUDE] The old helper accumulated type info arg-by-arg against a target_type, which failed to recognize that VARCHAR/CHAR (TEXT_TYPES but not DType.TEXT) are valid string-concat participants. Replace with two dedicated annotators whose dispatch matches Spark's actual type rules: CONCAT: UNKNOWN-in → UNKNOWN; all-binary → BINARY; all-array of identical type → that ARRAY type; else → TEXT. PAD: ARRAY arg → UNKNOWN (invalid); else same binary/text dispatch as CONCAT, but without the array-propagation path. The ARRAY branch in _annotate_concat uses type.sql() equality so that ARRAY<STRING> and ARRAY<ARRAY<STRING>> are not treated as the same type. Also correct a pre-existing fixture expectation: CONCAT(str_col, unknown) should return UNKNOWN, not STRING — if any arg type is unknown the output type is unknown. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…CLAUDE] Per reviewer feedback (sqlglot#7661), array-overload typing is out of scope. The annotator now does: - all BINARY args → BINARY - any arg with a known, non-array, non-binary type → STRING (binary excluded to preserve binary+unknown → UNKNOWN, since Spark can't disambiguate the string vs. binary overload there) - otherwise → UNKNOWN Drops _common_array_element_type and the ARRAY<...> fixture cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

georgesittas

Much simpler, thank you! Just a question about test coverage.

georgesittas · 2026-05-26T18:13:11Z

+    - All-BINARY → BINARY (the binary overload).
+    - Otherwise, if any arg has a known, non-array, non-binary type → STRING.
+      Spark coerces scalars (dates, ints, etc.) to string when mixed with a
+      string-resolving arg. The binary exclusion preserves the binary+unknown
+      case as UNKNOWN: Spark can't disambiguate the string vs. binary overload
+      there.
+    - Else → UNKNOWN. Covers all-unknown, binary+unknown, and anything
+      involving arrays (array handling is intentionally out of scope here).


Do we have test coverage for all of these cases?

For every case except 'anything involving arrays'.

**Branch 1 — all-BINARY → BINARY** - `CONCAT(bin, bin)` → BINARY (line 233) - `LPAD(bin, 1, bin)` → BINARY (281) - `RPAD(bin, 1, bin)` → BINARY (285) **Branch 2 — any known non-array/non-binary arg → STRING** - string+string: `CONCAT(str, str)` (245) - binary+string (both orders): `CONCAT(bin, str)` / `CONCAT(str, bin)` (237, 241); plus all four LPAD/RPAD bin↔str combos (289–301) - string+unknown: `CONCAT(str, unknown)` (249) - date/int scalars: `CONCAT('x', date)` (261), `CONCAT(date, date)` (265), `CONCAT(date, int)` (273), `CONCAT('x', bin)` (269), `LPAD('x', 10, date)` (277) **Branch 3 — else → UNKNOWN** - all-unknown: `CONCAT(unknown, unknown)` (257) - binary+unknown: `CONCAT(bin, unknown)` (253) ← the load-bearing case the comment calls out

RichardHughes-amp force-pushed the fix-spark-concat-text-type branch 2 times, most recently from 750219e to bbea140 Compare May 19, 2026 21:00

RichardHughes-amp force-pushed the fix-spark-concat-text-type branch from bbea140 to 7c31366 Compare May 19, 2026 21:18

RichardHughes-amp marked this pull request as ready for review May 19, 2026 21:22

georgesittas requested changes May 20, 2026

View reviewed changes

Comment thread sqlglot/typing/spark2.py Outdated

Comment thread sqlglot/typing/spark2.py Outdated

Comment thread sqlglot/typing/spark2.py Outdated

Comment thread tests/test_optimizer.py Outdated

Comment thread sqlglot/typing/spark2.py Outdated

RichardHughes-amp closed this May 20, 2026

RichardHughes-amp reopened this May 20, 2026

RichardHughes-amp and others added 3 commits May 20, 2026 15:27

Revert "test(spark): add failing UNKNOWN cases for invalid CONCAT typ…

181ad62

…e combinations [CLAUDE]" This reverts commit 494c9be.

RichardHughes-amp force-pushed the fix-spark-concat-text-type branch 11 times, most recently from f824499 to 69139fb Compare May 21, 2026 01:18

RichardHughes-amp force-pushed the fix-spark-concat-text-type branch from 69139fb to 080d0de Compare May 21, 2026 01:56

georgesittas reviewed May 21, 2026

View reviewed changes

Comment thread sqlglot/typing/spark2.py Outdated

RichardHughes-amp requested a review from georgesittas May 22, 2026 18:21

georgesittas self-assigned this May 25, 2026

georgesittas approved these changes May 26, 2026

View reviewed changes

georgesittas merged commit c3fa1ce into tobymao:main May 27, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(spark): CONCAT/PAD type leaks non-string arg when string args aren't TEXT [CLAUDE]#7661

fix(spark): CONCAT/PAD type leaks non-string arg when string args aren't TEXT [CLAUDE]#7661
georgesittas merged 6 commits into
tobymao:mainfrom
RichardHughes-amp:fix-spark-concat-text-type

RichardHughes-amp commented May 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

georgesittas left a comment

Uh oh!

georgesittas May 26, 2026

Uh oh!

RichardHughes-amp May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RichardHughes-amp commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

georgesittas left a comment

Choose a reason for hiding this comment

Uh oh!

georgesittas May 26, 2026

Choose a reason for hiding this comment

Uh oh!

RichardHughes-amp May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RichardHughes-amp commented May 19, 2026 •

edited

Loading