JSON AST with properties#1
Open
peter-leonov-ch wants to merge 29 commits into
Open
Conversation
Replace the positional `children` array on `ASTFunction` nodes with named
JSON slots so consumers no longer have to know the in-class convention to
tell `arguments` from `parameters`.
`enrichNode` now returns a `handled_children` flag; when set,
`formatASTAsJSON` skips the generic `children` walk for that node. For
`ASTFunction` it emits:
- `arguments`: always an array (possibly empty), inlining the inner
`ASTExpressionList` wrapper;
- `parameters`: only for parametric aggregates, same inlining;
- `window_definition`: the node as-is (not an `ASTExpressionList`);
- `window_name`: as before.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Expose `ASTOrderByElement` sub-nodes through named slots — `expression`, `collation`, `fill_from`, `fill_to`, `fill_step`, `fill_staleness` — and suppress the positional `children` array, reusing the `handled_children` mechanism introduced for `ASTFunction`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drive the JSON walker from `ASTSelectQuery`'s `Expression` positions: emit one named slot per clause (`with`, `select`, `tables`, `where`, `group_by`, `order_by`, `limit_length`, ...) and suppress the opaque positional `children` array. List-shaped clauses inline their `ASTExpressionList` wrapper; single-expression clauses are emitted as nodes; absent clauses are omitted. `ALIASES` / `CTE_ALIASES` are analyzer state and do not appear on the parsed ASTs returned by `EXPLAIN AST`, so they are left out of the table. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Update the design note to reflect the landed implementation: the `handled_children` mechanism, the `inlineExpressionList` / `addNodeSlot` helpers, and the named slots for `ASTFunction`, `ASTOrderByElement`, and `ASTSelectQuery`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`ASTSelectIntersectExceptQuery` derives from `ASTSelectQuery` but stores its operand selects in the positional `children` array rather than in the `Expression` slots. The `dynamic_cast<const ASTSelectQuery *>` branch therefore caught it, found no slots, and returned `handled_children = true` — silently dropping both operand selects from the JSON. Match `ASTSelectIntersectExceptQuery` explicitly before `ASTSelectQuery`, emit its `operator`, and leave `children` intact. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add stateless tests covering the JSON dump over a wider slice of SQL: literal value types, lambda functions, operators (which the parser normalizes to ordinary functions), CAST forms, the NULLS action, and compound / qualified identifiers. The union test also guards against the `ASTSelectIntersectExceptQuery` regression: its operand selects must remain present in `children`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a runnable showcase: one deeply nested query exercising most named slots and node types at once (CTE + scalar-subquery WITH, DISTINCT, lambda, window functions inline and named, CASE, CAST, JOIN over a subquery, PREWHERE, WHERE with IN-subquery and BETWEEN, GROUP BY WITH ROLLUP, HAVING, WINDOW, QUALIFY, ORDER BY WITH FILL INTERPOLATE, LIMIT BY, LIMIT OFFSET, SETTINGS). The full JSON AST is dumped without node filtering so the reference doubles as a readable example of the format. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extend the named-slots treatment from the expression/clause nodes to the
structural wrappers that produced the long single-child `children` chains:
- ASTSelectWithUnionQuery: `selects` (inlines `list_of_selects`);
- ASTSubquery: `cte_name`, `query`;
- ASTWithElement: `name`, `subquery`, `aliases`;
- ASTTablesInSelectQueryElement: `table_join`, `table_expression`,
`array_join`;
- ASTTableExpression: `database_and_table_name` / `table_function` /
`subquery`, `final`, `sample_size`, `sample_offset`, `column_aliases`;
- ASTTableJoin: `kind`, `strictness`, `locality`, `using`, `on`;
- ASTArrayJoin: `kind`, `expressions`;
- ASTWindowListElement: `name`, `definition`;
- ASTWindowDefinition: `parent_window_name`, `partition_by`, `order_by`
and the frame fields when the frame is non-default;
- ASTInterpolateElement: `column`, `expr`.
After this, `children` survives only on the homogeneous lists
(`ExpressionList`, `TablesInSelectQuery`). As a side effect this also
exposes the `ASTWindowListElement` definition, which was previously absent
from the JSON entirely (it is not stored in `children`). Design notes in
AST3.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cover joins (INNER/ON, LEFT/USING, GLOBAL ANY strictness+locality, CROSS, COMMA), LEFT ARRAY JOIN, table function and subquery table expressions, FINAL + SAMPLE, a named CTE whose subquery collapses to `query.selects`, a RANGE window frame, inlined UNION `selects`, and an interpolate element. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The last nodes rendering as a bare `{"type": ...}` held non-AST state in
private members rather than in `children`. Expose it:
- ASTSetQuery (the SETTINGS clause): `changes` as a name -> value object,
plus `default_settings`;
- ASTSampleRatio: `numerator` / `denominator` (exact rationals, emitted
as strings since they can exceed UInt64);
- ASTAsterisk / ASTQualifiedAsterisk: `expression` / `qualifier` and
`transformers`;
- COLUMNS matchers (`pattern`, `columns`) and transformers (`func_name`,
`parameters`, `lambda`, `is_strict`, replacement `name`).
A plain `SELECT *` still serializes as `{"type": "Asterisk"}` — that node
has no attached state. `children` now survives only on homogeneous lists
(ExpressionList, TablesInSelectQuery, and the EXCEPT / REPLACE column
lists). Design notes in AST3.md.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Merge the AST2.md / AST3.md working notes into a single contributor-facing AST.md: how to run it, the output contract, the per-class schema, the dispatch-ordering caveat, how to extend it to a new node, and the test list. Remove the superseded AST2.md and AST3.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A standalone set of paired <name>.sql / <name>.json files under tests/ast_json_fixtures/cases (41 cases) capturing the `EXPLAIN AST json = 1` output for a broad spread of SQL, plus a generate.sh to (re)produce them and a README describing how to use them as golden files for verifying an alternative parser. Not wired into CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new EXPLAIN AST json = 1 mode that serializes the parsed ClickHouse AST into a structured JSON format (named slots for semantic children, children kept for homogeneous lists), along with documentation, stateless golden tests, and a standalone fixture corpus for external parser/serializer validation.
Changes:
- Implement
formatASTAsJSON()(JSONBuilder-based AST serializer) and integrate it intoEXPLAIN ASTvia a newjsonoption. - Add stateless tests covering core AST node families (functions, ORDER BY, SELECT clauses, wrappers, UNION/INTERSECT/EXCEPT, leaf state).
- Add
AST.mdspec plus a separatetests/ast_json_fixtures/corpus + generator script.
Reviewed changes
Copilot reviewed 104 out of 104 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| AST.md | Documents the JSON AST contract, schema, and extension/testing guidance. |
| src/Interpreters/InterpreterExplainQuery.cpp | Adds json option to EXPLAIN AST and routes output through formatASTAsJSON(). |
| src/Parsers/DumpASTNode.h | Declares formatASTAsJSON() and documents JSON AST conventions. |
| src/Parsers/DumpASTNode.cpp | Implements JSON AST formatting and per-node enrichment logic. |
| tests/queries/0_stateless/03917_explain_ast_json_function.sh | Stateless test coverage for Function node slots. |
| tests/queries/0_stateless/03917_explain_ast_json_function.reference | Golden output for function JSON AST extraction. |
| tests/queries/0_stateless/03918_explain_ast_json_order_by.sh | Stateless test coverage for OrderByElement slots. |
| tests/queries/0_stateless/03918_explain_ast_json_order_by.reference | Golden output for ORDER BY JSON AST extraction. |
| tests/queries/0_stateless/03919_explain_ast_json_select_query.sh | Stateless test coverage for SelectQuery clause slots and ordering. |
| tests/queries/0_stateless/03919_explain_ast_json_select_query.reference | Golden output for SELECT clause slot behavior. |
| tests/queries/0_stateless/03920_explain_ast_json_expressions.sh | Stateless test coverage for literals/operators/lambda/cast/nulls/identifiers. |
| tests/queries/0_stateless/03920_explain_ast_json_expressions.reference | Golden output for expression JSON AST extraction. |
| tests/queries/0_stateless/03921_explain_ast_json_union.sh | Stateless test coverage for UNION modes and INTERSECT/EXCEPT operator+operands. |
| tests/queries/0_stateless/03921_explain_ast_json_union.reference | Golden output for set-ops JSON AST extraction. |
| tests/queries/0_stateless/03922_explain_ast_json_showcase.sh | Full AST JSON “showcase” query (no filtering). |
| tests/queries/0_stateless/03922_explain_ast_json_showcase.reference | Golden pretty-printed JSON AST for showcase query. |
| tests/queries/0_stateless/03923_explain_ast_json_wrappers.sh | Stateless test coverage for structural wrapper nodes (JOIN, subquery, window, interpolate, etc.). |
| tests/queries/0_stateless/03923_explain_ast_json_wrappers.reference | Golden output for wrapper-node JSON AST extraction. |
| tests/queries/0_stateless/03924_explain_ast_json_leaf_state.sh | Stateless test coverage for leaf-ish node state (SETTINGS, SAMPLE, asterisks, COLUMNS). |
| tests/queries/0_stateless/03924_explain_ast_json_leaf_state.reference | Golden output for leaf-state JSON AST extraction. |
| tests/ast_json_fixtures/README.md | Describes the standalone SQL→JSON-AST fixture corpus and regeneration/verification workflow. |
| tests/ast_json_fixtures/generate.sh | Script to regenerate the fixture corpus via EXPLAIN AST json = 1. |
| tests/ast_json_fixtures/cases/01_literals.sql | Fixture input: literals. |
| tests/ast_json_fixtures/cases/01_literals.json | Fixture output: literals AST JSON. |
| tests/ast_json_fixtures/cases/02_identifiers.sql | Fixture input: identifiers. |
| tests/ast_json_fixtures/cases/02_identifiers.json | Fixture output: identifiers AST JSON. |
| tests/ast_json_fixtures/cases/03_function_call.sql | Fixture input: function call. |
| tests/ast_json_fixtures/cases/03_function_call.json | Fixture output: function call AST JSON. |
| tests/ast_json_fixtures/cases/04_no_args.sql | Fixture input: zero-arg function. |
| tests/ast_json_fixtures/cases/04_no_args.json | Fixture output: zero-arg function AST JSON. |
| tests/ast_json_fixtures/cases/05_operators.sql | Fixture input: operators. |
| tests/ast_json_fixtures/cases/05_operators.json | Fixture output: operators normalized to functions. |
| tests/ast_json_fixtures/cases/06_subscript.sql | Fixture input: subscripts/tuple element access. |
| tests/ast_json_fixtures/cases/06_subscript.json | Fixture output: subscript AST JSON. |
| tests/ast_json_fixtures/cases/07_cast.sql | Fixture input: CAST forms. |
| tests/ast_json_fixtures/cases/07_cast.json | Fixture output: CAST AST JSON. |
| tests/ast_json_fixtures/cases/08_lambda.sql | Fixture input: lambda expression. |
| tests/ast_json_fixtures/cases/08_lambda.json | Fixture output: lambda AST JSON. |
| tests/ast_json_fixtures/cases/09_parametric_agg.sql | Fixture input: parametric aggregate. |
| tests/ast_json_fixtures/cases/09_parametric_agg.json | Fixture output: parametric aggregate AST JSON. |
| tests/ast_json_fixtures/cases/10_nulls_action.sql | Fixture input: NULLS action. |
| tests/ast_json_fixtures/cases/10_nulls_action.json | Fixture output: NULLS action AST JSON. |
| tests/ast_json_fixtures/cases/11_window_named.sql | Fixture input: named window. |
| tests/ast_json_fixtures/cases/11_window_named.json | Fixture output: named window AST JSON. |
| tests/ast_json_fixtures/cases/12_window_inline.sql | Fixture input: inline window definition + frame. |
| tests/ast_json_fixtures/cases/12_window_inline.json | Fixture output: inline window AST JSON. |
| tests/ast_json_fixtures/cases/13_select_star.sql | Fixture input: SELECT * with WHERE. |
| tests/ast_json_fixtures/cases/13_select_star.json | Fixture output: SELECT * AST JSON. |
| tests/ast_json_fixtures/cases/14_distinct.sql | Fixture input: DISTINCT. |
| tests/ast_json_fixtures/cases/14_distinct.json | Fixture output: DISTINCT AST JSON. |
| tests/ast_json_fixtures/cases/15_group_by_rollup.sql | Fixture input: GROUP BY ROLLUP + HAVING. |
| tests/ast_json_fixtures/cases/15_group_by_rollup.json | Fixture output: GROUP BY ROLLUP AST JSON. |
| tests/ast_json_fixtures/cases/16_grouping_sets.sql | Fixture input: GROUPING SETS. |
| tests/ast_json_fixtures/cases/16_grouping_sets.json | Fixture output: GROUPING SETS AST JSON. |
| tests/ast_json_fixtures/cases/17_qualify.sql | Fixture input: QUALIFY. |
| tests/ast_json_fixtures/cases/17_qualify.json | Fixture output: QUALIFY AST JSON. |
| tests/ast_json_fixtures/cases/18_order_by_fill.sql | Fixture input: ORDER BY WITH FILL + INTERPOLATE. |
| tests/ast_json_fixtures/cases/18_order_by_fill.json | Fixture output: ORDER BY fill/interpolate AST JSON. |
| tests/ast_json_fixtures/cases/19_limits.sql | Fixture input: LIMIT BY + LIMIT/OFFSET. |
| tests/ast_json_fixtures/cases/19_limits.json | Fixture output: LIMITs AST JSON. |
| tests/ast_json_fixtures/cases/20_settings.sql | Fixture input: SETTINGS clause. |
| tests/ast_json_fixtures/cases/20_settings.json | Fixture output: SETTINGS AST JSON. |
| tests/ast_json_fixtures/cases/21_join_on.sql | Fixture input: JOIN ... ON. |
| tests/ast_json_fixtures/cases/21_join_on.json | Fixture output: JOIN ON AST JSON. |
| tests/ast_json_fixtures/cases/22_join_using.sql | Fixture input: JOIN ... USING. |
| tests/ast_json_fixtures/cases/22_join_using.json | Fixture output: JOIN USING AST JSON. |
| tests/ast_json_fixtures/cases/23_join_global_any.sql | Fixture input: GLOBAL ANY join. |
| tests/ast_json_fixtures/cases/23_join_global_any.json | Fixture output: GLOBAL ANY join AST JSON. |
| tests/ast_json_fixtures/cases/24_cross_comma.sql | Fixture input: CROSS + COMMA joins. |
| tests/ast_json_fixtures/cases/24_cross_comma.json | Fixture output: CROSS/COMMA AST JSON. |
| tests/ast_json_fixtures/cases/25_array_join.sql | Fixture input: ARRAY JOIN. |
| tests/ast_json_fixtures/cases/25_array_join.json | Fixture output: ARRAY JOIN AST JSON. |
| tests/ast_json_fixtures/cases/26_table_function.sql | Fixture input: table function. |
| tests/ast_json_fixtures/cases/26_table_function.json | Fixture output: table function AST JSON. |
| tests/ast_json_fixtures/cases/27_subquery_from.sql | Fixture input: subquery in FROM. |
| tests/ast_json_fixtures/cases/27_subquery_from.json | Fixture output: subquery FROM AST JSON. |
| tests/ast_json_fixtures/cases/28_sample_final.sql | Fixture input: FINAL + SAMPLE. |
| tests/ast_json_fixtures/cases/28_sample_final.json | Fixture output: FINAL/SAMPLE AST JSON. |
| tests/ast_json_fixtures/cases/29_cte.sql | Fixture input: CTE. |
| tests/ast_json_fixtures/cases/29_cte.json | Fixture output: CTE AST JSON. |
| tests/ast_json_fixtures/cases/30_scalar_with.sql | Fixture input: scalar WITH binding. |
| tests/ast_json_fixtures/cases/30_scalar_with.json | Fixture output: scalar WITH AST JSON. |
| tests/ast_json_fixtures/cases/31_union_all.sql | Fixture input: UNION ALL. |
| tests/ast_json_fixtures/cases/31_union_all.json | Fixture output: UNION ALL AST JSON. |
| tests/ast_json_fixtures/cases/32_union_distinct.sql | Fixture input: UNION DISTINCT. |
| tests/ast_json_fixtures/cases/32_union_distinct.json | Fixture output: UNION DISTINCT AST JSON. |
| tests/ast_json_fixtures/cases/33_intersect.sql | Fixture input: INTERSECT. |
| tests/ast_json_fixtures/cases/33_intersect.json | Fixture output: INTERSECT AST JSON. |
| tests/ast_json_fixtures/cases/34_except.sql | Fixture input: EXCEPT. |
| tests/ast_json_fixtures/cases/34_except.json | Fixture output: EXCEPT AST JSON. |
| tests/ast_json_fixtures/cases/35_qualified_star.sql | Fixture input: qualified asterisk. |
| tests/ast_json_fixtures/cases/35_qualified_star.json | Fixture output: qualified asterisk AST JSON. |
| tests/ast_json_fixtures/cases/36_except_transform.sql | Fixture input: * EXCEPT. |
| tests/ast_json_fixtures/cases/36_except_transform.json | Fixture output: * EXCEPT AST JSON. |
| tests/ast_json_fixtures/cases/37_apply_transform.sql | Fixture input: * APPLY. |
| tests/ast_json_fixtures/cases/37_apply_transform.json | Fixture output: * APPLY AST JSON. |
| tests/ast_json_fixtures/cases/38_replace_transform.sql | Fixture input: * REPLACE. |
| tests/ast_json_fixtures/cases/38_replace_transform.json | Fixture output: * REPLACE AST JSON. |
| tests/ast_json_fixtures/cases/39_columns_regexp.sql | Fixture input: COLUMNS('regexp'). |
| tests/ast_json_fixtures/cases/39_columns_regexp.json | Fixture output: COLUMNS('regexp') AST JSON. |
| tests/ast_json_fixtures/cases/40_columns_list.sql | Fixture input: COLUMNS(a, b). |
| tests/ast_json_fixtures/cases/40_columns_list.json | Fixture output: COLUMNS(a, b) AST JSON. |
| tests/ast_json_fixtures/cases/41_showcase.sql | Fixture input: maximal showcase query. |
| tests/ast_json_fixtures/cases/41_showcase.json | Fixture output: maximal showcase AST JSON. |
Add harvest_stateless.py: it extracts statements from tests/queries/0_stateless/*.sql (a custom quote/comment-aware splitter), drops query-parameter placeholders and duplicates, and runs each through `EXPLAIN AST json = 1`, writing SQL -> JSON golden pairs. ~94k unique param-free statements remain, at ~100% parse yield. Execution is batched through one `clickhouse local` per batch with `--multiquery --ignore-error`; since EXPLAIN only parses (never executes) this is side-effect-free and resynchronises past rejects, so a full harvest runs in well under a minute. The bulk output is gitignored, not committed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add `--dedupe shape` to the harvester: keep one representative per distinct AST shape (a structural signature of node types, slot keys, enum/flag scalars and literal value_type, dropping leaf values), choosing the shortest statement so the selection is deterministic. `--shape-depth D` trades coverage for size (depth 3 -> 1440 shapes, depth 4 -> 5141, ... , unlimited -> 16733 over the stateless corpus). Commit the depth-3 set under tests/ast_json_fixtures/shapes/: 1440 structurally-distinct, shortest-representative SQL -> JSON pairs (~12 MB) — a compact, high-coverage corpus for verifying an alternative parser. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Enrich the highest-frequency DDL/DML nodes so they no longer fall back to
positional `children`:
- ASTCreateQuery: flags (attach/temporary/if_not_exists/is_*_view/
is_dictionary/replace_*/create_or_replace), database, table,
columns_list, aliases, storage, as_table_function, as_database/as_table,
select, targets, comment, dictionary_attributes, dictionary;
- ASTColumns: columns / indices / constraints / projections (inlined),
primary_key, primary_key_from_columns;
- ASTColumnDeclaration: name, data_type, default_specifier,
default_expression, null_modifier, ephemeral_default,
primary_key_specifier, comment, codec, statistics, ttl, collation,
settings;
- ASTDataType: name, arguments (inlined) — exposes the type name that was
previously hidden;
- ASTStorage: engine, partition_by, primary_key, order_by, sample_by,
ttl_table, settings;
- ASTInsertQuery: database, table, table_function, columns, format,
partition_by, settings, select, infile, compression.
Adds `IAST *` overloads of `addNodeSlot` / `inlineExpressionList` for the
raw-pointer members these classes hold alongside `children`.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The CREATE-TABLE family / INSERT nodes now expose named slots, so their AST shapes carry more structure. Regenerating the depth-3 corpus grows it from 1440 to 1812 distinct shapes — finer coverage of the DDL surface. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… ops
Group 1 (CREATE-TABLE sub-elements):
- ASTIndexDeclaration: name, expression, index_type, granularity;
- ASTConstraintDeclaration: name, constraint_type, expression;
- ASTProjectionDeclaration: name, query, index;
- ASTProjectionSelectQuery: with / select / group_by / order_by (inlined);
- ASTTTLElement: mode, ttl, destination_type/name (MOVE), where,
recompression_codec;
- ASTPartition: all, value, id.
Group 3 (lightweight DML and simple table ops):
- ASTAssignment: column, expression;
- ASTDeleteQuery / ASTUpdateQuery: table target, predicate, assignments,
partition;
- ASTDropQuery (covers DETACH / TRUNCATE): kind + flags + table target;
- ASTOptimizeQuery: table target, partition, final, deduplicate, cleanup.
Adds an `addTableTarget` helper for the shared database/table/temporary
slots and small enum mappers (TTL mode, data destination, constraint type,
drop kind). AlterQuery/AlterCommand and the dictionary nodes remain (noted
in AST.md).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Group 2 (ALTER):
- ASTAlterQuery: alter_object, table target, cluster, commands (inlined);
- ASTAlterCommand: command_type (full Type enum mapped to a name), flags,
and the applicable sub-node slots (column_declaration, column, index*,
partition, predicate, assignments, ttl, settings_changes, select,
rename_to, ...) plus the from*/to*/move_destination_name strings.
Group 4 (dictionaries / functions):
- ASTCreateFunctionQuery: function_name, function_core, flags;
- ASTDictionary + sub-elements (layout, lifetime, range, settings);
- ASTDictionaryAttributeDeclaration;
- ASTFunctionWithKeyValueArguments + ASTPair (key/value);
- ASTViewTargets.
Known wart documented in AST.md: ASTDictionary and its sub-elements share
type "Dictionary" because getID trims "Dictionary <word>" at the space;
their named slots disambiguate them.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…enrichment The ALTER family and dictionary nodes now expose named slots, so their AST shapes carry more structure. The depth-3 corpus grows from 1812 to 2024 distinct shapes. The big DDL/DML node types no longer fall back to positional children; only a rare long tail (Explain, Describe, Show*, Kill, SYSTEM, ...) remains. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Enrich the remaining common statement/element nodes: - ASTExplainQuery, ASTDescribeQuery, ASTShowTablesQuery; - ASTCreateIndexQuery, ASTDropIndexQuery, ASTCheckTableQuery, ASTUseQuery; - ASTKillQueryQuery, ASTRenameQuery, ASTSystemQuery; - ASTStatisticsDeclaration, ASTStorageOrderByElement, ASTNameTypePair; - the qualified COLUMNS matchers. Plus a generic ASTQueryWithTableAndOutput fallback (placed after every richer subclass) that gives database/table/temporary to the simple table-scoped statements — EXISTS, SHOW CREATE, UNDROP, ... Remaining unenriched: access-management (CreateUser/AuthenticationData/ ShowGrants/...) and BACKUP/RESTORE; ParallelWithQuery keeps children as a homogeneous list. Noted in AST.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nrichment Depth-3 corpus grows from 2024 to 2165 distinct shapes. The remaining positional-children nodes are just the access-management statements and BACKUP/RESTORE (plus the homogeneous ParallelWithQuery list and the intentionally child-keeping StorageOrderByElement). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `type` id is `IAST::getID(' ')` trimmed at the first space. That works
because getID packs auxiliary data after a space delimiter — but
ASTDictionary and its sub-elements return descriptive getIDs that embed a
space ("Dictionary lifetime", "Dictionary layout", ...), so trimming
collapsed all five onto "Dictionary".
Factor the type derivation into `astTypeName` and special-case the four
dictionary sub-elements (DictionaryLayout / DictionaryLifetime /
DictionaryRange / DictionarySettings); everything else keeps the trim.
JSONMap only appends, so this can't live in enrichNode (which runs after
`type` is emitted) — hence a small dedicated helper.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The dictionary sub-element nodes now carry distinct type ids, changing the shape signatures of dictionary-bearing statements. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pre-freeze format changes requested by the clickhouse-js-parser consumer
(which will pin reference fixtures against this output):
1. fieldToJSON emits UInt64/Int64 (and Settings `changes` values) as JSON
strings — values above 2^53 are corrupted by JS JSON.parse.
2. ASTQueryParameter gets a branch (`name`, `param_type`); executeQuery's
parameterized-view detection for EXPLAIN now uses getExplainedQuery() so
the params survive when EXPLAIN settings (json=1) are present.
3. Remaining positional-children nodes converted to named slots:
SelectIntersectExceptQuery -> `selects`; ColumnsExceptTransformer ->
`columns`; ColumnsReplaceTransformer -> `replacements`; Replacement ->
`expression`; StorageOrderByElement -> `expression`.
4. Asterisk / QualifiedAsterisk / COLUMNS matchers inline `transformers` as
an array (the ASTColumnsTransformerList wrapper is dropped).
5. Window frame: `frame_begin` / `frame_end` are `{type, offset?,
preceding?}` objects (preceding only when true), with `frame_type`.
6. Slot renames: `tables`->`from`, `limit_length`->`limit`,
`limit_offset`->`offset`; LIMIT BY grouped into `limit_by: {length,
offset?, by}`; ASTSetQuery type id -> `Settings`.
7. Versioned document wrapper `{ "version": 1, "ast": {...} }`; AST.md
documents the fallback FieldVisitorToString stringification, the full
enum value sets, and known source-fidelity divergences. New tests:
query parameters, big integers, and a node-type-id snapshot.
Non-JSON EXPLAIN AST text output is unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Regenerate the curated cases/ (41) and the depth-3 shapes/ (2144) corpus
under the new versioned document shape and slot changes. The harvester now
computes shape signatures on the unwrapped `ast` (not the { version, ast }
wrapper) so depth granularity is measured from the real root. README notes
the version wrapper and JS-safe integer strings.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The interpreter calls the versioned-document wrapper, not formatASTAsJSON directly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a "Downstream consumer" section explaining that the JSON is consumed by the TypeScript parser (which vendors tests/ast_json_fixtures as a reference suite), which format decisions exist to line up with its AST types, and the bump-version + regenerate-fixtures + flag-the-change workflow when the format changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
AST as JSON draft. Covers basic cases.
Contributing
Ask LLM to load context from AST.md in the repo root and move from there. The rest is already set up by the upstream repo.
Design choices
Format
The top-level value is a versioned document:
{ "version": 1, "ast": { ... } }.version(AST_JSON_FORMAT_VERSION) is bumped on any backwards-incompatible change so external consumers (e.g. clickhouse-js-parser) can pin fixtures and detect breaks.Each node has a
type; sub-nodes appear under named slots;childrensurvives only on genuinely homogeneous lists (ExpressionList,TablesInSelectQuery). Absent optional slots are omitted (nonulls). 64-bit integer literal values (and settings values) are emitted as JSON strings — values above 2^53 are corrupted by JavaScript'sJSON.parse;value_typesays how to read them. SeeAST.mdfor the full schema, the enum value sets, the fallback stringification, and known source-fidelity divergences.Examples
Run with:
clickhouse local --format TSVRaw -q "EXPLAIN AST json = 1 <query>"--format TSVRawis what gives the clean multi-line JSON (otherwise the result is a single TSV cell with newlines escaped as\n). The query only needs to parse — referenced tables/columns need not exist.A few patterns visible across all of them: operators are ordinary functions (
>=→greaterOrEquals,=→equals,is_operator: true); absent clauses are simply omitted; integer literalvalues are strings; andchildrensurvives only onTablesInSelectQueryhere — everything else is a named slot.SELECT * FROM foo WHERE x = 1{ "version": 1, "ast": { "type": "SelectWithUnionQuery", "selects": [ { "type": "SelectQuery", "select": [ { "type": "Asterisk" } ], "from": { "type": "TablesInSelectQuery", "children": [ { "type": "TablesInSelectQueryElement", "table_expression": { "type": "TableExpression", "database_and_table_name": { "type": "TableIdentifier", "name": "foo" } } } ] }, "where": { "type": "Function", "name": "equals", "is_operator": true, "arguments": [ { "type": "Identifier", "name": "x" }, { "type": "Literal", "value_type": "UInt64", "value": "1" } ] } } ] } }SELECT name, count() AS c FROM users WHERE age >= 18 GROUP BY name ORDER BY c DESC LIMIT 10Every clause is a named slot on
SelectQuery; the aggregate carries itsalias.{ "version": 1, "ast": { "type": "SelectWithUnionQuery", "selects": [ { "type": "SelectQuery", "select": [ { "type": "Identifier", "name": "name" }, { "type": "Function", "alias": "c", "name": "count", "arguments": [] } ], "from": { "type": "TablesInSelectQuery", "children": [ { "type": "TablesInSelectQueryElement", "table_expression": { "type": "TableExpression", "database_and_table_name": { "type": "TableIdentifier", "name": "users" } } } ] }, "where": { "type": "Function", "name": "greaterOrEquals", "is_operator": true, "arguments": [ { "type": "Identifier", "name": "age" }, { "type": "Literal", "value_type": "UInt64", "value": "18" } ] }, "group_by": [ { "type": "Identifier", "name": "name" } ], "order_by": [ { "type": "OrderByElement", "expression": { "type": "Identifier", "name": "c" }, "direction": "DESC" } ], "limit": { "type": "Literal", "value_type": "UInt64", "value": "10" } } ] } }SELECT u.name, o.total FROM users AS u INNER JOIN orders AS o ON u.id = o.user_idThe
from(TablesInSelectQuery) list holds the base table plus a JOIN element carrying bothtable_joinandtable_expression; compound identifiers exposename_parts, and table aliases ride on the identifier asalias.{ "version": 1, "ast": { "type": "SelectWithUnionQuery", "selects": [ { "type": "SelectQuery", "select": [ { "type": "Identifier", "name": "u.name", "name_parts": ["u", "name"] }, { "type": "Identifier", "name": "o.total", "name_parts": ["o", "total"] } ], "from": { "type": "TablesInSelectQuery", "children": [ { "type": "TablesInSelectQueryElement", "table_expression": { "type": "TableExpression", "database_and_table_name": { "type": "TableIdentifier", "alias": "u", "name": "users" } } }, { "type": "TablesInSelectQueryElement", "table_join": { "type": "TableJoin", "kind": "INNER", "on": { "type": "Function", "name": "equals", "is_operator": true, "arguments": [ { "type": "Identifier", "name": "u.id", "name_parts": ["u", "id"] }, { "type": "Identifier", "name": "o.user_id", "name_parts": ["o", "user_id"] } ] } }, "table_expression": { "type": "TableExpression", "database_and_table_name": { "type": "TableIdentifier", "alias": "o", "name": "orders" } } } ] } } ] } }INSERT INTO events (id, ts) VALUES (1, now())The inline row data is not part of the parsed AST; the column list is inlined and the
formatis exposed.{ "version": 1, "ast": { "type": "InsertQuery", "table": { "type": "Identifier", "name": "events" }, "columns": [ { "type": "Identifier", "name": "id" }, { "type": "Identifier", "name": "ts" } ], "format": "Values" } }CREATE TABLE hits (id UInt64, url String) ENGINE = MergeTree ORDER BY idEach
ColumnDeclarationexposes itsdata_type(aDataTypewith the real type name), and the engine clause becomesstorage.{ "version": 1, "ast": { "type": "CreateQuery", "table": { "type": "Identifier", "name": "hits" }, "columns_list": { "type": "Columns", "columns": [ { "type": "ColumnDeclaration", "name": "id", "data_type": { "type": "DataType", "name": "UInt64" } }, { "type": "ColumnDeclaration", "name": "url", "data_type": { "type": "DataType", "name": "String" } } ] }, "storage": { "type": "Storage", "engine": { "type": "Function", "name": "MergeTree", "kind": "TABLE_ENGINE", "arguments": [] }, "order_by": { "type": "Identifier", "name": "id" } } } }Changelog category
Changelog entry
Added
EXPLAIN AST json = 1which serializes the parsed AST of a query to JSON as a versioned document{ "version": 1, "ast": ... }. Each node exposes its sub-nodes through named slots (e.g.arguments,where,table_expression) instead of a positionalchildrenarray, which is kept only for homogeneous lists. 64-bit integer values are emitted as strings to stay safe for JSON consumers.