Skip to content

JSON AST with properties#1

Open
peter-leonov-ch wants to merge 29 commits into
masterfrom
json_ast
Open

JSON AST with properties#1
peter-leonov-ch wants to merge 29 commits into
masterfrom
json_ast

Conversation

@peter-leonov-ch

@peter-leonov-ch peter-leonov-ch commented Jun 10, 2026

Copy link
Copy Markdown
Owner

AST as JSON draft. Covers basic cases.

Contributing

Ask LLM to load context from AST.md in the repo root and move from there. The rest is already set up by the upstream repo.

Design choices

  • the node type aware serializer is a downcast switch-case chain with a default fallback (as opposed to implementing overload methods): this allows to cover non-intrusively all the core nodes and leave the AST code alone for now
  • the AST itself is object bases (as opposed to positional s-expressions): this is for readability and semantic meaning per-se (no need to introduce wrapper nodes for if/else blocks etc), the main downside is that the traverser has to be aware of all the "children" slots to be able to iterate on the tree

Format

The top-level value is a versioned document: { "version": 1, "ast": { ... } }. version (AST_JSON_FORMAT_VERSION) is bumped on any backwards-incompatible change so external consumers (e.g. clickhouse-js-parser) can pin fixtures and detect breaks.

Each node has a type; sub-nodes appear under named slots; children survives only on genuinely homogeneous lists (ExpressionList, TablesInSelectQuery). Absent optional slots are omitted (no nulls). 64-bit integer literal values (and settings values) are emitted as JSON strings — values above 2^53 are corrupted by JavaScript's JSON.parse; value_type says how to read them. See AST.md for the full schema, the enum value sets, the fallback stringification, and known source-fidelity divergences.

Examples

Run with:

clickhouse local --format TSVRaw -q "EXPLAIN AST json = 1 <query>"

--format TSVRaw is what gives the clean multi-line JSON (otherwise the result is a single TSV cell with newlines escaped as \n). The query only needs to parse — referenced tables/columns need not exist.

A few patterns visible across all of them: operators are ordinary functions (>=greaterOrEquals, =equals, is_operator: true); absent clauses are simply omitted; integer literal values are strings; and children survives only on TablesInSelectQuery here — everything else is a named slot.

SELECT * FROM foo WHERE x = 1

{
  "version": 1,
  "ast": {
    "type": "SelectWithUnionQuery",
    "selects": [
      {
        "type": "SelectQuery",
        "select": [
          { "type": "Asterisk" }
        ],
        "from": {
          "type": "TablesInSelectQuery",
          "children": [
            {
              "type": "TablesInSelectQueryElement",
              "table_expression": {
                "type": "TableExpression",
                "database_and_table_name": { "type": "TableIdentifier", "name": "foo" }
              }
            }
          ]
        },
        "where": {
          "type": "Function",
          "name": "equals",
          "is_operator": true,
          "arguments": [
            { "type": "Identifier", "name": "x" },
            { "type": "Literal", "value_type": "UInt64", "value": "1" }
          ]
        }
      }
    ]
  }
}

SELECT name, count() AS c FROM users WHERE age >= 18 GROUP BY name ORDER BY c DESC LIMIT 10

Every clause is a named slot on SelectQuery; the aggregate carries its alias.

{
  "version": 1,
  "ast": {
    "type": "SelectWithUnionQuery",
    "selects": [
      {
        "type": "SelectQuery",
        "select": [
          { "type": "Identifier", "name": "name" },
          { "type": "Function", "alias": "c", "name": "count", "arguments": [] }
        ],
        "from": {
          "type": "TablesInSelectQuery",
          "children": [
            {
              "type": "TablesInSelectQueryElement",
              "table_expression": {
                "type": "TableExpression",
                "database_and_table_name": { "type": "TableIdentifier", "name": "users" }
              }
            }
          ]
        },
        "where": {
          "type": "Function",
          "name": "greaterOrEquals",
          "is_operator": true,
          "arguments": [
            { "type": "Identifier", "name": "age" },
            { "type": "Literal", "value_type": "UInt64", "value": "18" }
          ]
        },
        "group_by": [
          { "type": "Identifier", "name": "name" }
        ],
        "order_by": [
          {
            "type": "OrderByElement",
            "expression": { "type": "Identifier", "name": "c" },
            "direction": "DESC"
          }
        ],
        "limit": { "type": "Literal", "value_type": "UInt64", "value": "10" }
      }
    ]
  }
}

SELECT u.name, o.total FROM users AS u INNER JOIN orders AS o ON u.id = o.user_id

The from (TablesInSelectQuery) list holds the base table plus a JOIN element carrying both table_join and table_expression; compound identifiers expose name_parts, and table aliases ride on the identifier as alias.

{
  "version": 1,
  "ast": {
    "type": "SelectWithUnionQuery",
    "selects": [
      {
        "type": "SelectQuery",
        "select": [
          { "type": "Identifier", "name": "u.name", "name_parts": ["u", "name"] },
          { "type": "Identifier", "name": "o.total", "name_parts": ["o", "total"] }
        ],
        "from": {
          "type": "TablesInSelectQuery",
          "children": [
            {
              "type": "TablesInSelectQueryElement",
              "table_expression": {
                "type": "TableExpression",
                "database_and_table_name": { "type": "TableIdentifier", "alias": "u", "name": "users" }
              }
            },
            {
              "type": "TablesInSelectQueryElement",
              "table_join": {
                "type": "TableJoin",
                "kind": "INNER",
                "on": {
                  "type": "Function",
                  "name": "equals",
                  "is_operator": true,
                  "arguments": [
                    { "type": "Identifier", "name": "u.id", "name_parts": ["u", "id"] },
                    { "type": "Identifier", "name": "o.user_id", "name_parts": ["o", "user_id"] }
                  ]
                }
              },
              "table_expression": {
                "type": "TableExpression",
                "database_and_table_name": { "type": "TableIdentifier", "alias": "o", "name": "orders" }
              }
            }
          ]
        }
      }
    ]
  }
}

INSERT INTO events (id, ts) VALUES (1, now())

The inline row data is not part of the parsed AST; the column list is inlined and the format is exposed.

{
  "version": 1,
  "ast": {
    "type": "InsertQuery",
    "table": { "type": "Identifier", "name": "events" },
    "columns": [
      { "type": "Identifier", "name": "id" },
      { "type": "Identifier", "name": "ts" }
    ],
    "format": "Values"
  }
}

CREATE TABLE hits (id UInt64, url String) ENGINE = MergeTree ORDER BY id

Each ColumnDeclaration exposes its data_type (a DataType with the real type name), and the engine clause becomes storage.

{
  "version": 1,
  "ast": {
    "type": "CreateQuery",
    "table": { "type": "Identifier", "name": "hits" },
    "columns_list": {
      "type": "Columns",
      "columns": [
        { "type": "ColumnDeclaration", "name": "id", "data_type": { "type": "DataType", "name": "UInt64" } },
        { "type": "ColumnDeclaration", "name": "url", "data_type": { "type": "DataType", "name": "String" } }
      ]
    },
    "storage": {
      "type": "Storage",
      "engine": { "type": "Function", "name": "MergeTree", "kind": "TABLE_ENGINE", "arguments": [] },
      "order_by": { "type": "Identifier", "name": "id" }
    }
  }
}

Changelog category

  • New Feature

Changelog entry

Added EXPLAIN AST json = 1 which serializes the parsed AST of a query to JSON as a versioned document { "version": 1, "ast": ... }. Each node exposes its sub-nodes through named slots (e.g. arguments, where, table_expression) instead of a positional children array, which is kept only for homogeneous lists. 64-bit integer values are emitted as strings to stay safe for JSON consumers.

peter-leonov-ch and others added 13 commits May 22, 2026 21:00
Replace the positional `children` array on `ASTFunction` nodes with named
JSON slots so consumers no longer have to know the in-class convention to
tell `arguments` from `parameters`.

`enrichNode` now returns a `handled_children` flag; when set,
`formatASTAsJSON` skips the generic `children` walk for that node. For
`ASTFunction` it emits:

  - `arguments`: always an array (possibly empty), inlining the inner
    `ASTExpressionList` wrapper;
  - `parameters`: only for parametric aggregates, same inlining;
  - `window_definition`: the node as-is (not an `ASTExpressionList`);
  - `window_name`: as before.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Expose `ASTOrderByElement` sub-nodes through named slots — `expression`,
`collation`, `fill_from`, `fill_to`, `fill_step`, `fill_staleness` — and
suppress the positional `children` array, reusing the `handled_children`
mechanism introduced for `ASTFunction`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drive the JSON walker from `ASTSelectQuery`'s `Expression` positions: emit
one named slot per clause (`with`, `select`, `tables`, `where`, `group_by`,
`order_by`, `limit_length`, ...) and suppress the opaque positional
`children` array. List-shaped clauses inline their `ASTExpressionList`
wrapper; single-expression clauses are emitted as nodes; absent clauses are
omitted.

`ALIASES` / `CTE_ALIASES` are analyzer state and do not appear on the
parsed ASTs returned by `EXPLAIN AST`, so they are left out of the table.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Update the design note to reflect the landed implementation: the
`handled_children` mechanism, the `inlineExpressionList` / `addNodeSlot`
helpers, and the named slots for `ASTFunction`, `ASTOrderByElement`, and
`ASTSelectQuery`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`ASTSelectIntersectExceptQuery` derives from `ASTSelectQuery` but stores
its operand selects in the positional `children` array rather than in the
`Expression` slots. The `dynamic_cast<const ASTSelectQuery *>` branch
therefore caught it, found no slots, and returned `handled_children =
true` — silently dropping both operand selects from the JSON.

Match `ASTSelectIntersectExceptQuery` explicitly before `ASTSelectQuery`,
emit its `operator`, and leave `children` intact.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add stateless tests covering the JSON dump over a wider slice of SQL:
literal value types, lambda functions, operators (which the parser
normalizes to ordinary functions), CAST forms, the NULLS action, and
compound / qualified identifiers.

The union test also guards against the `ASTSelectIntersectExceptQuery`
regression: its operand selects must remain present in `children`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a runnable showcase: one deeply nested query exercising most named
slots and node types at once (CTE + scalar-subquery WITH, DISTINCT,
lambda, window functions inline and named, CASE, CAST, JOIN over a
subquery, PREWHERE, WHERE with IN-subquery and BETWEEN, GROUP BY WITH
ROLLUP, HAVING, WINDOW, QUALIFY, ORDER BY WITH FILL INTERPOLATE, LIMIT BY,
LIMIT OFFSET, SETTINGS). The full JSON AST is dumped without node
filtering so the reference doubles as a readable example of the format.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extend the named-slots treatment from the expression/clause nodes to the
structural wrappers that produced the long single-child `children` chains:

  - ASTSelectWithUnionQuery: `selects` (inlines `list_of_selects`);
  - ASTSubquery: `cte_name`, `query`;
  - ASTWithElement: `name`, `subquery`, `aliases`;
  - ASTTablesInSelectQueryElement: `table_join`, `table_expression`,
    `array_join`;
  - ASTTableExpression: `database_and_table_name` / `table_function` /
    `subquery`, `final`, `sample_size`, `sample_offset`, `column_aliases`;
  - ASTTableJoin: `kind`, `strictness`, `locality`, `using`, `on`;
  - ASTArrayJoin: `kind`, `expressions`;
  - ASTWindowListElement: `name`, `definition`;
  - ASTWindowDefinition: `parent_window_name`, `partition_by`, `order_by`
    and the frame fields when the frame is non-default;
  - ASTInterpolateElement: `column`, `expr`.

After this, `children` survives only on the homogeneous lists
(`ExpressionList`, `TablesInSelectQuery`). As a side effect this also
exposes the `ASTWindowListElement` definition, which was previously absent
from the JSON entirely (it is not stored in `children`). Design notes in
AST3.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cover joins (INNER/ON, LEFT/USING, GLOBAL ANY strictness+locality, CROSS,
COMMA), LEFT ARRAY JOIN, table function and subquery table expressions,
FINAL + SAMPLE, a named CTE whose subquery collapses to `query.selects`, a
RANGE window frame, inlined UNION `selects`, and an interpolate element.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The last nodes rendering as a bare `{"type": ...}` held non-AST state in
private members rather than in `children`. Expose it:

  - ASTSetQuery (the SETTINGS clause): `changes` as a name -> value object,
    plus `default_settings`;
  - ASTSampleRatio: `numerator` / `denominator` (exact rationals, emitted
    as strings since they can exceed UInt64);
  - ASTAsterisk / ASTQualifiedAsterisk: `expression` / `qualifier` and
    `transformers`;
  - COLUMNS matchers (`pattern`, `columns`) and transformers (`func_name`,
    `parameters`, `lambda`, `is_strict`, replacement `name`).

A plain `SELECT *` still serializes as `{"type": "Asterisk"}` — that node
has no attached state. `children` now survives only on homogeneous lists
(ExpressionList, TablesInSelectQuery, and the EXCEPT / REPLACE column
lists). Design notes in AST3.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Merge the AST2.md / AST3.md working notes into a single contributor-facing
AST.md: how to run it, the output contract, the per-class schema, the
dispatch-ordering caveat, how to extend it to a new node, and the test
list. Remove the superseded AST2.md and AST3.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@peter-leonov-ch peter-leonov-ch changed the title Json ast JSON AST with properties Jun 10, 2026
A standalone set of paired <name>.sql / <name>.json files under
tests/ast_json_fixtures/cases (41 cases) capturing the `EXPLAIN AST
json = 1` output for a broad spread of SQL, plus a generate.sh to
(re)produce them and a README describing how to use them as golden files
for verifying an alternative parser. Not wired into CI.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new EXPLAIN AST json = 1 mode that serializes the parsed ClickHouse AST into a structured JSON format (named slots for semantic children, children kept for homogeneous lists), along with documentation, stateless golden tests, and a standalone fixture corpus for external parser/serializer validation.

Changes:

  • Implement formatASTAsJSON() (JSONBuilder-based AST serializer) and integrate it into EXPLAIN AST via a new json option.
  • Add stateless tests covering core AST node families (functions, ORDER BY, SELECT clauses, wrappers, UNION/INTERSECT/EXCEPT, leaf state).
  • Add AST.md spec plus a separate tests/ast_json_fixtures/ corpus + generator script.

Reviewed changes

Copilot reviewed 104 out of 104 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
AST.md Documents the JSON AST contract, schema, and extension/testing guidance.
src/Interpreters/InterpreterExplainQuery.cpp Adds json option to EXPLAIN AST and routes output through formatASTAsJSON().
src/Parsers/DumpASTNode.h Declares formatASTAsJSON() and documents JSON AST conventions.
src/Parsers/DumpASTNode.cpp Implements JSON AST formatting and per-node enrichment logic.
tests/queries/0_stateless/03917_explain_ast_json_function.sh Stateless test coverage for Function node slots.
tests/queries/0_stateless/03917_explain_ast_json_function.reference Golden output for function JSON AST extraction.
tests/queries/0_stateless/03918_explain_ast_json_order_by.sh Stateless test coverage for OrderByElement slots.
tests/queries/0_stateless/03918_explain_ast_json_order_by.reference Golden output for ORDER BY JSON AST extraction.
tests/queries/0_stateless/03919_explain_ast_json_select_query.sh Stateless test coverage for SelectQuery clause slots and ordering.
tests/queries/0_stateless/03919_explain_ast_json_select_query.reference Golden output for SELECT clause slot behavior.
tests/queries/0_stateless/03920_explain_ast_json_expressions.sh Stateless test coverage for literals/operators/lambda/cast/nulls/identifiers.
tests/queries/0_stateless/03920_explain_ast_json_expressions.reference Golden output for expression JSON AST extraction.
tests/queries/0_stateless/03921_explain_ast_json_union.sh Stateless test coverage for UNION modes and INTERSECT/EXCEPT operator+operands.
tests/queries/0_stateless/03921_explain_ast_json_union.reference Golden output for set-ops JSON AST extraction.
tests/queries/0_stateless/03922_explain_ast_json_showcase.sh Full AST JSON “showcase” query (no filtering).
tests/queries/0_stateless/03922_explain_ast_json_showcase.reference Golden pretty-printed JSON AST for showcase query.
tests/queries/0_stateless/03923_explain_ast_json_wrappers.sh Stateless test coverage for structural wrapper nodes (JOIN, subquery, window, interpolate, etc.).
tests/queries/0_stateless/03923_explain_ast_json_wrappers.reference Golden output for wrapper-node JSON AST extraction.
tests/queries/0_stateless/03924_explain_ast_json_leaf_state.sh Stateless test coverage for leaf-ish node state (SETTINGS, SAMPLE, asterisks, COLUMNS).
tests/queries/0_stateless/03924_explain_ast_json_leaf_state.reference Golden output for leaf-state JSON AST extraction.
tests/ast_json_fixtures/README.md Describes the standalone SQL→JSON-AST fixture corpus and regeneration/verification workflow.
tests/ast_json_fixtures/generate.sh Script to regenerate the fixture corpus via EXPLAIN AST json = 1.
tests/ast_json_fixtures/cases/01_literals.sql Fixture input: literals.
tests/ast_json_fixtures/cases/01_literals.json Fixture output: literals AST JSON.
tests/ast_json_fixtures/cases/02_identifiers.sql Fixture input: identifiers.
tests/ast_json_fixtures/cases/02_identifiers.json Fixture output: identifiers AST JSON.
tests/ast_json_fixtures/cases/03_function_call.sql Fixture input: function call.
tests/ast_json_fixtures/cases/03_function_call.json Fixture output: function call AST JSON.
tests/ast_json_fixtures/cases/04_no_args.sql Fixture input: zero-arg function.
tests/ast_json_fixtures/cases/04_no_args.json Fixture output: zero-arg function AST JSON.
tests/ast_json_fixtures/cases/05_operators.sql Fixture input: operators.
tests/ast_json_fixtures/cases/05_operators.json Fixture output: operators normalized to functions.
tests/ast_json_fixtures/cases/06_subscript.sql Fixture input: subscripts/tuple element access.
tests/ast_json_fixtures/cases/06_subscript.json Fixture output: subscript AST JSON.
tests/ast_json_fixtures/cases/07_cast.sql Fixture input: CAST forms.
tests/ast_json_fixtures/cases/07_cast.json Fixture output: CAST AST JSON.
tests/ast_json_fixtures/cases/08_lambda.sql Fixture input: lambda expression.
tests/ast_json_fixtures/cases/08_lambda.json Fixture output: lambda AST JSON.
tests/ast_json_fixtures/cases/09_parametric_agg.sql Fixture input: parametric aggregate.
tests/ast_json_fixtures/cases/09_parametric_agg.json Fixture output: parametric aggregate AST JSON.
tests/ast_json_fixtures/cases/10_nulls_action.sql Fixture input: NULLS action.
tests/ast_json_fixtures/cases/10_nulls_action.json Fixture output: NULLS action AST JSON.
tests/ast_json_fixtures/cases/11_window_named.sql Fixture input: named window.
tests/ast_json_fixtures/cases/11_window_named.json Fixture output: named window AST JSON.
tests/ast_json_fixtures/cases/12_window_inline.sql Fixture input: inline window definition + frame.
tests/ast_json_fixtures/cases/12_window_inline.json Fixture output: inline window AST JSON.
tests/ast_json_fixtures/cases/13_select_star.sql Fixture input: SELECT * with WHERE.
tests/ast_json_fixtures/cases/13_select_star.json Fixture output: SELECT * AST JSON.
tests/ast_json_fixtures/cases/14_distinct.sql Fixture input: DISTINCT.
tests/ast_json_fixtures/cases/14_distinct.json Fixture output: DISTINCT AST JSON.
tests/ast_json_fixtures/cases/15_group_by_rollup.sql Fixture input: GROUP BY ROLLUP + HAVING.
tests/ast_json_fixtures/cases/15_group_by_rollup.json Fixture output: GROUP BY ROLLUP AST JSON.
tests/ast_json_fixtures/cases/16_grouping_sets.sql Fixture input: GROUPING SETS.
tests/ast_json_fixtures/cases/16_grouping_sets.json Fixture output: GROUPING SETS AST JSON.
tests/ast_json_fixtures/cases/17_qualify.sql Fixture input: QUALIFY.
tests/ast_json_fixtures/cases/17_qualify.json Fixture output: QUALIFY AST JSON.
tests/ast_json_fixtures/cases/18_order_by_fill.sql Fixture input: ORDER BY WITH FILL + INTERPOLATE.
tests/ast_json_fixtures/cases/18_order_by_fill.json Fixture output: ORDER BY fill/interpolate AST JSON.
tests/ast_json_fixtures/cases/19_limits.sql Fixture input: LIMIT BY + LIMIT/OFFSET.
tests/ast_json_fixtures/cases/19_limits.json Fixture output: LIMITs AST JSON.
tests/ast_json_fixtures/cases/20_settings.sql Fixture input: SETTINGS clause.
tests/ast_json_fixtures/cases/20_settings.json Fixture output: SETTINGS AST JSON.
tests/ast_json_fixtures/cases/21_join_on.sql Fixture input: JOIN ... ON.
tests/ast_json_fixtures/cases/21_join_on.json Fixture output: JOIN ON AST JSON.
tests/ast_json_fixtures/cases/22_join_using.sql Fixture input: JOIN ... USING.
tests/ast_json_fixtures/cases/22_join_using.json Fixture output: JOIN USING AST JSON.
tests/ast_json_fixtures/cases/23_join_global_any.sql Fixture input: GLOBAL ANY join.
tests/ast_json_fixtures/cases/23_join_global_any.json Fixture output: GLOBAL ANY join AST JSON.
tests/ast_json_fixtures/cases/24_cross_comma.sql Fixture input: CROSS + COMMA joins.
tests/ast_json_fixtures/cases/24_cross_comma.json Fixture output: CROSS/COMMA AST JSON.
tests/ast_json_fixtures/cases/25_array_join.sql Fixture input: ARRAY JOIN.
tests/ast_json_fixtures/cases/25_array_join.json Fixture output: ARRAY JOIN AST JSON.
tests/ast_json_fixtures/cases/26_table_function.sql Fixture input: table function.
tests/ast_json_fixtures/cases/26_table_function.json Fixture output: table function AST JSON.
tests/ast_json_fixtures/cases/27_subquery_from.sql Fixture input: subquery in FROM.
tests/ast_json_fixtures/cases/27_subquery_from.json Fixture output: subquery FROM AST JSON.
tests/ast_json_fixtures/cases/28_sample_final.sql Fixture input: FINAL + SAMPLE.
tests/ast_json_fixtures/cases/28_sample_final.json Fixture output: FINAL/SAMPLE AST JSON.
tests/ast_json_fixtures/cases/29_cte.sql Fixture input: CTE.
tests/ast_json_fixtures/cases/29_cte.json Fixture output: CTE AST JSON.
tests/ast_json_fixtures/cases/30_scalar_with.sql Fixture input: scalar WITH binding.
tests/ast_json_fixtures/cases/30_scalar_with.json Fixture output: scalar WITH AST JSON.
tests/ast_json_fixtures/cases/31_union_all.sql Fixture input: UNION ALL.
tests/ast_json_fixtures/cases/31_union_all.json Fixture output: UNION ALL AST JSON.
tests/ast_json_fixtures/cases/32_union_distinct.sql Fixture input: UNION DISTINCT.
tests/ast_json_fixtures/cases/32_union_distinct.json Fixture output: UNION DISTINCT AST JSON.
tests/ast_json_fixtures/cases/33_intersect.sql Fixture input: INTERSECT.
tests/ast_json_fixtures/cases/33_intersect.json Fixture output: INTERSECT AST JSON.
tests/ast_json_fixtures/cases/34_except.sql Fixture input: EXCEPT.
tests/ast_json_fixtures/cases/34_except.json Fixture output: EXCEPT AST JSON.
tests/ast_json_fixtures/cases/35_qualified_star.sql Fixture input: qualified asterisk.
tests/ast_json_fixtures/cases/35_qualified_star.json Fixture output: qualified asterisk AST JSON.
tests/ast_json_fixtures/cases/36_except_transform.sql Fixture input: * EXCEPT.
tests/ast_json_fixtures/cases/36_except_transform.json Fixture output: * EXCEPT AST JSON.
tests/ast_json_fixtures/cases/37_apply_transform.sql Fixture input: * APPLY.
tests/ast_json_fixtures/cases/37_apply_transform.json Fixture output: * APPLY AST JSON.
tests/ast_json_fixtures/cases/38_replace_transform.sql Fixture input: * REPLACE.
tests/ast_json_fixtures/cases/38_replace_transform.json Fixture output: * REPLACE AST JSON.
tests/ast_json_fixtures/cases/39_columns_regexp.sql Fixture input: COLUMNS('regexp').
tests/ast_json_fixtures/cases/39_columns_regexp.json Fixture output: COLUMNS('regexp') AST JSON.
tests/ast_json_fixtures/cases/40_columns_list.sql Fixture input: COLUMNS(a, b).
tests/ast_json_fixtures/cases/40_columns_list.json Fixture output: COLUMNS(a, b) AST JSON.
tests/ast_json_fixtures/cases/41_showcase.sql Fixture input: maximal showcase query.
tests/ast_json_fixtures/cases/41_showcase.json Fixture output: maximal showcase AST JSON.

Comment thread AST.md Outdated
Comment thread tests/ast_json_fixtures/README.md Outdated
Comment thread src/Parsers/DumpASTNode.h Outdated
peter-leonov-ch and others added 12 commits June 10, 2026 18:21
Add harvest_stateless.py: it extracts statements from
tests/queries/0_stateless/*.sql (a custom quote/comment-aware splitter),
drops query-parameter placeholders and duplicates, and runs each through
`EXPLAIN AST json = 1`, writing SQL -> JSON golden pairs.

~94k unique param-free statements remain, at ~100% parse yield. Execution
is batched through one `clickhouse local` per batch with
`--multiquery --ignore-error`; since EXPLAIN only parses (never executes)
this is side-effect-free and resynchronises past rejects, so a full
harvest runs in well under a minute. The bulk output is gitignored, not
committed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add `--dedupe shape` to the harvester: keep one representative per distinct
AST shape (a structural signature of node types, slot keys, enum/flag
scalars and literal value_type, dropping leaf values), choosing the
shortest statement so the selection is deterministic. `--shape-depth D`
trades coverage for size (depth 3 -> 1440 shapes, depth 4 -> 5141, ... ,
unlimited -> 16733 over the stateless corpus).

Commit the depth-3 set under tests/ast_json_fixtures/shapes/: 1440
structurally-distinct, shortest-representative SQL -> JSON pairs (~12 MB) —
a compact, high-coverage corpus for verifying an alternative parser.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Enrich the highest-frequency DDL/DML nodes so they no longer fall back to
positional `children`:

  - ASTCreateQuery: flags (attach/temporary/if_not_exists/is_*_view/
    is_dictionary/replace_*/create_or_replace), database, table,
    columns_list, aliases, storage, as_table_function, as_database/as_table,
    select, targets, comment, dictionary_attributes, dictionary;
  - ASTColumns: columns / indices / constraints / projections (inlined),
    primary_key, primary_key_from_columns;
  - ASTColumnDeclaration: name, data_type, default_specifier,
    default_expression, null_modifier, ephemeral_default,
    primary_key_specifier, comment, codec, statistics, ttl, collation,
    settings;
  - ASTDataType: name, arguments (inlined) — exposes the type name that was
    previously hidden;
  - ASTStorage: engine, partition_by, primary_key, order_by, sample_by,
    ttl_table, settings;
  - ASTInsertQuery: database, table, table_function, columns, format,
    partition_by, settings, select, infile, compression.

Adds `IAST *` overloads of `addNodeSlot` / `inlineExpressionList` for the
raw-pointer members these classes hold alongside `children`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The CREATE-TABLE family / INSERT nodes now expose named slots, so their AST
shapes carry more structure. Regenerating the depth-3 corpus grows it from
1440 to 1812 distinct shapes — finer coverage of the DDL surface.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… ops

Group 1 (CREATE-TABLE sub-elements):
  - ASTIndexDeclaration: name, expression, index_type, granularity;
  - ASTConstraintDeclaration: name, constraint_type, expression;
  - ASTProjectionDeclaration: name, query, index;
  - ASTProjectionSelectQuery: with / select / group_by / order_by (inlined);
  - ASTTTLElement: mode, ttl, destination_type/name (MOVE), where,
    recompression_codec;
  - ASTPartition: all, value, id.

Group 3 (lightweight DML and simple table ops):
  - ASTAssignment: column, expression;
  - ASTDeleteQuery / ASTUpdateQuery: table target, predicate, assignments,
    partition;
  - ASTDropQuery (covers DETACH / TRUNCATE): kind + flags + table target;
  - ASTOptimizeQuery: table target, partition, final, deduplicate, cleanup.

Adds an `addTableTarget` helper for the shared database/table/temporary
slots and small enum mappers (TTL mode, data destination, constraint type,
drop kind). AlterQuery/AlterCommand and the dictionary nodes remain (noted
in AST.md).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Group 2 (ALTER):
  - ASTAlterQuery: alter_object, table target, cluster, commands (inlined);
  - ASTAlterCommand: command_type (full Type enum mapped to a name), flags,
    and the applicable sub-node slots (column_declaration, column, index*,
    partition, predicate, assignments, ttl, settings_changes, select,
    rename_to, ...) plus the from*/to*/move_destination_name strings.

Group 4 (dictionaries / functions):
  - ASTCreateFunctionQuery: function_name, function_core, flags;
  - ASTDictionary + sub-elements (layout, lifetime, range, settings);
  - ASTDictionaryAttributeDeclaration;
  - ASTFunctionWithKeyValueArguments + ASTPair (key/value);
  - ASTViewTargets.

Known wart documented in AST.md: ASTDictionary and its sub-elements share
type "Dictionary" because getID trims "Dictionary <word>" at the space;
their named slots disambiguate them.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…enrichment

The ALTER family and dictionary nodes now expose named slots, so their AST
shapes carry more structure. The depth-3 corpus grows from 1812 to 2024
distinct shapes. The big DDL/DML node types no longer fall back to
positional children; only a rare long tail (Explain, Describe, Show*, Kill,
SYSTEM, ...) remains.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Enrich the remaining common statement/element nodes:
  - ASTExplainQuery, ASTDescribeQuery, ASTShowTablesQuery;
  - ASTCreateIndexQuery, ASTDropIndexQuery, ASTCheckTableQuery, ASTUseQuery;
  - ASTKillQueryQuery, ASTRenameQuery, ASTSystemQuery;
  - ASTStatisticsDeclaration, ASTStorageOrderByElement, ASTNameTypePair;
  - the qualified COLUMNS matchers.

Plus a generic ASTQueryWithTableAndOutput fallback (placed after every
richer subclass) that gives database/table/temporary to the simple
table-scoped statements — EXISTS, SHOW CREATE, UNDROP, ...

Remaining unenriched: access-management (CreateUser/AuthenticationData/
ShowGrants/...) and BACKUP/RESTORE; ParallelWithQuery keeps children as a
homogeneous list. Noted in AST.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nrichment

Depth-3 corpus grows from 2024 to 2165 distinct shapes. The remaining
positional-children nodes are just the access-management statements and
BACKUP/RESTORE (plus the homogeneous ParallelWithQuery list and the
intentionally child-keeping StorageOrderByElement).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `type` id is `IAST::getID(' ')` trimmed at the first space. That works
because getID packs auxiliary data after a space delimiter — but
ASTDictionary and its sub-elements return descriptive getIDs that embed a
space ("Dictionary lifetime", "Dictionary layout", ...), so trimming
collapsed all five onto "Dictionary".

Factor the type derivation into `astTypeName` and special-case the four
dictionary sub-elements (DictionaryLayout / DictionaryLifetime /
DictionaryRange / DictionarySettings); everything else keeps the trim.
JSONMap only appends, so this can't live in enrichNode (which runs after
`type` is emitted) — hence a small dedicated helper.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The dictionary sub-element nodes now carry distinct type ids, changing the
shape signatures of dictionary-bearing statements.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pre-freeze format changes requested by the clickhouse-js-parser consumer
(which will pin reference fixtures against this output):

1. fieldToJSON emits UInt64/Int64 (and Settings `changes` values) as JSON
   strings — values above 2^53 are corrupted by JS JSON.parse.
2. ASTQueryParameter gets a branch (`name`, `param_type`); executeQuery's
   parameterized-view detection for EXPLAIN now uses getExplainedQuery() so
   the params survive when EXPLAIN settings (json=1) are present.
3. Remaining positional-children nodes converted to named slots:
   SelectIntersectExceptQuery -> `selects`; ColumnsExceptTransformer ->
   `columns`; ColumnsReplaceTransformer -> `replacements`; Replacement ->
   `expression`; StorageOrderByElement -> `expression`.
4. Asterisk / QualifiedAsterisk / COLUMNS matchers inline `transformers` as
   an array (the ASTColumnsTransformerList wrapper is dropped).
5. Window frame: `frame_begin` / `frame_end` are `{type, offset?,
   preceding?}` objects (preceding only when true), with `frame_type`.
6. Slot renames: `tables`->`from`, `limit_length`->`limit`,
   `limit_offset`->`offset`; LIMIT BY grouped into `limit_by: {length,
   offset?, by}`; ASTSetQuery type id -> `Settings`.
7. Versioned document wrapper `{ "version": 1, "ast": {...} }`; AST.md
   documents the fallback FieldVisitorToString stringification, the full
   enum value sets, and known source-fidelity divergences. New tests:
   query parameters, big integers, and a node-type-id snapshot.

Non-JSON EXPLAIN AST text output is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Regenerate the curated cases/ (41) and the depth-3 shapes/ (2144) corpus
under the new versioned document shape and slot changes. The harvester now
computes shape signatures on the unwrapped `ast` (not the { version, ast }
wrapper) so depth granularity is measured from the real root. README notes
the version wrapper and JS-safe integer strings.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
peter-leonov-ch and others added 2 commits June 14, 2026 14:56
The interpreter calls the versioned-document wrapper, not formatASTAsJSON
directly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a "Downstream consumer" section explaining that the JSON is consumed by
the TypeScript parser (which vendors tests/ast_json_fixtures as a reference
suite), which format decisions exist to line up with its AST types, and the
bump-version + regenerate-fixtures + flag-the-change workflow when the
format changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants