fix(spark): CONCAT/PAD type leaks non-string arg when string args aren't TEXT [CLAUDE]#7661
Merged
georgesittas merged 6 commits intoMay 27, 2026
Conversation
750219e to
bbea140
Compare
…s are VARCHAR/CHAR [CLAUDE] `_annotate_by_similar_args` only short-circuited on an exact `is_type(TEXT)` match. VARCHAR, CHAR, NVARCHAR, and string literals (annotated as VARCHAR) are not `TEXT`, so for `CONCAT(varchar_col, '-', date_col)` the loop fell through every argument and the final `last_datatype = expr.type` left the result as the *last* arg's type (DATE), not a string. Accept a tuple of acceptable match types and use its first element as the canonical result. Spark's CONCAT/PAD now pass `(TEXT, *TEXT_TYPES)`, so any string-family arg yields TEXT. Also stop overwriting `last_datatype` on every non-matching arg — keep the first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bbea140 to
7c31366
Compare
georgesittas
requested changes
May 20, 2026
…UDE] Replace varchar_col-based cases (which required a schema addition) with literal-based equivalents — string literals parse as VARCHAR, supplying the same TEXT_TYPES coverage without an extra schema column. Add missing cases: date-first CONCAT, array round-trips, and LPAD with date args. Remove varchar_col from the test schema. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ations [CLAUDE] Array mixed with string, binary, or a literal should resolve to UNKNOWN since Spark rejects these at analysis time with DATATYPE_MISMATCH. The annotator currently returns the wrong type for all six cases; these tests are expected to fail until the fix lands. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e combinations [CLAUDE]" This reverts commit 494c9be.
f824499 to
69139fb
Compare
…annotate_pad [CLAUDE]
The old helper accumulated type info arg-by-arg against a target_type,
which failed to recognize that VARCHAR/CHAR (TEXT_TYPES but not DType.TEXT)
are valid string-concat participants. Replace with two dedicated annotators
whose dispatch matches Spark's actual type rules:
CONCAT: UNKNOWN-in → UNKNOWN; all-binary → BINARY;
all-array of identical type → that ARRAY type; else → TEXT.
PAD: ARRAY arg → UNKNOWN (invalid); else same binary/text dispatch as
CONCAT, but without the array-propagation path.
The ARRAY branch in _annotate_concat uses type.sql() equality so that
ARRAY<STRING> and ARRAY<ARRAY<STRING>> are not treated as the same type.
Also correct a pre-existing fixture expectation: CONCAT(str_col, unknown)
should return UNKNOWN, not STRING — if any arg type is unknown the output
type is unknown.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
69139fb to
080d0de
Compare
…CLAUDE]
Per reviewer feedback (sqlglot#7661), array-overload typing is out of
scope. The annotator now does:
- all BINARY args → BINARY
- any arg with a known, non-array, non-binary type → STRING
(binary excluded to preserve binary+unknown → UNKNOWN, since Spark
can't disambiguate the string vs. binary overload there)
- otherwise → UNKNOWN
Drops _common_array_element_type and the ARRAY<...> fixture cases.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
georgesittas
approved these changes
May 26, 2026
Collaborator
georgesittas
left a comment
There was a problem hiding this comment.
Much simpler, thank you! Just a question about test coverage.
Comment on lines
+19
to
+26
| - All-BINARY → BINARY (the binary overload). | ||
| - Otherwise, if any arg has a known, non-array, non-binary type → STRING. | ||
| Spark coerces scalars (dates, ints, etc.) to string when mixed with a | ||
| string-resolving arg. The binary exclusion preserves the binary+unknown | ||
| case as UNKNOWN: Spark can't disambiguate the string vs. binary overload | ||
| there. | ||
| - Else → UNKNOWN. Covers all-unknown, binary+unknown, and anything | ||
| involving arrays (array handling is intentionally out of scope here). |
Collaborator
There was a problem hiding this comment.
Do we have test coverage for all of these cases?
Contributor
Author
There was a problem hiding this comment.
For every case except 'anything involving arrays'.
**Branch 1 — all-BINARY → BINARY**
- `CONCAT(bin, bin)` → BINARY (line 233)
- `LPAD(bin, 1, bin)` → BINARY (281)
- `RPAD(bin, 1, bin)` → BINARY (285)
**Branch 2 — any known non-array/non-binary arg → STRING**
- string+string: `CONCAT(str, str)` (245)
- binary+string (both orders): `CONCAT(bin, str)` / `CONCAT(str, bin)` (237, 241); plus all four LPAD/RPAD bin↔str combos (289–301)
- string+unknown: `CONCAT(str, unknown)` (249)
- date/int scalars: `CONCAT('x', date)` (261), `CONCAT(date, date)` (265), `CONCAT(date, int)` (273), `CONCAT('x', bin)` (269), `LPAD('x', 10, date)` (277)
**Branch 3 — else → UNKNOWN**
- all-unknown: `CONCAT(unknown, unknown)` (257)
- binary+unknown: `CONCAT(bin, unknown)` (253) ← the load-bearing case the comment calls out
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SQLGlot and the native spark query analyzer were returning disparate results in testing for a query that had a clause of the form:
...Where SQLGlot parsed this CONCAT as emitting a Date type, and the native parser saw it as emitting a Varchar type.
We tracked down the root cause to a subtle error in the logic of
sqlglot/typing/spark2.py:_annotate_by_similar_args; when it swept the arguments of CONCAT (or LPAD) for types, it searched for exp.DType.TEXT, but it didn't have any special handling for other text types like VARCHAR2. So it went with whatever the last type was: in this case, date.CONCAT can also handle concatenation of arrays. However, we backed away from fully implementing that, on account of SQLGlot intentionally skipping type coercions for nested structures today. Rather, we went for a simplified model with three possible outputs: TEXT, UNKNOWN, or BINARY.