Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
99 commits
Select commit Hold shift + click to select a range
2fcb7c0
Ignore tmp directory
takaebato May 6, 2026
3ffccd1
Add scoped relation binder
takaebato May 6, 2026
a4cd1c5
Refine relation scope resolution
takaebato May 6, 2026
18f8d91
Create AGENTS.md
takaebato May 6, 2026
15332cf
Refactor table extraction around resolver diagnostics
takaebato May 6, 2026
ba3b621
Rename relation binder result type
takaebato May 6, 2026
faa57f9
rename
takaebato May 16, 2026
ec6bdba
ouptut schema
takaebato May 16, 2026
ee2e1e4
scope aware table extraction
takaebato May 16, 2026
8d7ba05
Add extract_operations API with catalog scaffolding
takaebato May 16, 2026
768d938
rename
takaebato May 16, 2026
844b84c
Replace AGENTS.md with CLAUDE.md
takaebato May 16, 2026
444965f
Mark codecov gates as informational
takaebato May 16, 2026
4d2091c
Populate table_flows via scope-kind gating
takaebato May 16, 2026
f37a2c8
Split table operations into reads / writes / flows
takaebato May 16, 2026
2b6144b
Strip alias from TableReference
takaebato May 16, 2026
6d569c2
Drop unused OperationDiagnosticCode variants
takaebato May 16, 2026
d92f00a
Distinguish rustdoc from inline comment guidance in CLAUDE.md
takaebato May 17, 2026
d9a231f
Add column operation extractor type skeleton
takaebato May 17, 2026
3aab952
Phase 5.2a: qualified column reads + writes via resolver collection
takaebato May 17, 2026
c419ee3
Phase 5.2b: scope-chain resolution for unqualified column reads
takaebato May 17, 2026
1b185f6
Phase 5.3: pull column flow facts via ResolvedQuery
takaebato May 17, 2026
d89270a
Phase 5.5: compose flows through CTEs and derived tables
takaebato May 17, 2026
05d4af9
Phase 5.6a: classify column reads with ReadKind
takaebato May 17, 2026
75d0e5c
Tidy resolver walking-context state
takaebato May 17, 2026
6dde3b0
Phase 5.6b: classify GROUP BY and ORDER BY refs
takaebato May 17, 2026
5a03f20
Phase 5.6c: classify OVER (...) refs as Window
takaebato May 17, 2026
47c012a
Phase 5.6d: Aggregation ColumnFlowKind
takaebato May 17, 2026
e0d3d42
Phase 5.6e: Conditional ReadKind modifier for CASE WHEN conditions
takaebato May 17, 2026
8997e70
Phase 5.8: column-level writes for CTAS / CREATE VIEW / ALTER VIEW
takaebato May 17, 2026
d20308c
Phase 5.7: MERGE column-level flows and writes
takaebato May 17, 2026
d6258c2
Phase 5.10: honor CTE and derived-table column rename clauses
takaebato May 17, 2026
22bdaea
Refactor: collapse repeated flow-emission patterns
takaebato May 17, 2026
034e1e5
Refresh CLAUDE.md to match the current resolver and column extractor
takaebato May 17, 2026
59013ff
Group walking-context state into WalkContext + fix subquery leak
takaebato May 17, 2026
87f3d10
Phase 6 (first): catalog-driven INSERT column pairing
takaebato May 17, 2026
b4c2dda
Rename RelationBinding → Binding, WalkContext → VisitContext
takaebato May 17, 2026
73a0eaa
Split relation_resolver module into responsibility files
takaebato May 17, 2026
37d249f
Drop misleading Relation prefix from resolver types
takaebato May 17, 2026
fd2dbbb
Refresh CLAUDE.md for renamed types and module layout
takaebato May 17, 2026
8e41faa
Run CI on pull requests targeting any branch
takaebato May 17, 2026
e7247b1
Unify diagnostic surface on Diagnostic
takaebato May 17, 2026
dee872f
Extend DiagnosticKind with column / wildcard kinds
takaebato May 17, 2026
e1837ef
Attach Option<Span> to Diagnostic for structured source locations
takaebato May 17, 2026
cc804a4
Refresh crate doc and README for the operation-extraction surface
takaebato May 17, 2026
d4d66c5
Add per-variant rustdoc on public enums
takaebato May 17, 2026
4b5d040
Add runnable examples for the main extraction paths
takaebato May 17, 2026
c438ae9
Rename operation_extractor module to table_operation_extractor
takaebato May 17, 2026
abfd7fc
Link the runnable examples from the README
takaebato May 17, 2026
9ae2875
Group column-extractor tests into nested mods + introduce run_cases
takaebato May 17, 2026
498032a
Replace run_cases runner with simple assert_* helpers
takaebato May 17, 2026
41c2aa0
Group table-extractor tests into nested mods by statement family
takaebato May 17, 2026
f2ddcb2
Nest table-extractor and crud-extractor tests + fix update_statement …
takaebato May 17, 2026
7e13182
Consolidate column-extractor test mods to one-per-topic granularity
takaebato May 17, 2026
76e9fdb
Expand integration tests to cover the operation extraction surface
takaebato May 17, 2026
f0fc4fb
Convert table_op tests to whole-value assert_ops + diag helper
takaebato May 17, 2026
86e12c4
Convert remaining partial-assertion tests to whole-value assert_ops
takaebato May 17, 2026
1572ce2
Compress crud-extractor tests with TableReference builders
takaebato May 17, 2026
c1cb05e
Begin column_op whole-value migration: merge + ctas_view mods
takaebato May 17, 2026
9e581af
column_op whole-value migration: delete + composition mods
takaebato May 17, 2026
df37d72
column_op whole-value migration: cte_derived_rename + flows mods
takaebato May 17, 2026
95be6bd
column_op whole-value migration: reads + writes + read_kinds mods
takaebato May 17, 2026
d9626b1
column_op whole-value migration: diagnostics + catalog_strict + cleanup
takaebato May 17, 2026
987ef2f
Add set operations coverage (UNION/INTERSECT/EXCEPT)
takaebato May 19, 2026
6b1b859
Add LATERAL / correlation and ON CONFLICT coverage
takaebato May 19, 2026
7777175
Add JOIN USING / NATURAL JOIN coverage
takaebato May 19, 2026
f2a16b5
Add cross-extractor invariants over a corpus
takaebato May 19, 2026
e29a615
Pin down precise spans on two diagnostic kinds
takaebato May 19, 2026
75e009f
Add rustdoc doctest to extract_column_operations
takaebato May 19, 2026
912ce29
Walk INSERT.on for ON CONFLICT / ON DUPLICATE KEY UPDATE
takaebato May 19, 2026
ade9186
CTAS / CREATE VIEW with UNION body: pair against left-branch names
takaebato May 19, 2026
79f1b7f
EXCLUDED composes through INSERT source via body_projections
takaebato May 19, 2026
e77de31
Walk RETURNING clauses on INSERT / UPDATE / DELETE
takaebato May 19, 2026
89e535a
Cover CUBE / GROUPING SETS / mixed GROUP BY modifiers
takaebato May 19, 2026
5641971
ALTER TABLE: surface column-level writes for column-naming ops
takaebato May 19, 2026
f140e1d
Pin down 3-part qualifier resolution
takaebato May 19, 2026
0aa2e5f
Cover VALUES as derived table / CTE body / scope-permissive row
takaebato May 19, 2026
c0c4ecc
Support WITH-prefixed DML + window-frame coverage
takaebato May 20, 2026
76c137c
Cover scalar subquery / simple CASE / set-op tail / IS NULL
takaebato May 20, 2026
9ec1849
Simplify column lineage model: two flow kinds, plain read/write lists
takaebato May 23, 2026
6a84e89
Collapse table reads/writes to plain TableReference lists
takaebato May 23, 2026
7cf40e9
Rename operation surfaces to lineage vocabulary
takaebato May 24, 2026
6ceb47b
Split diagnostics by granularity; drop pre-1.0 non_exhaustive
takaebato May 24, 2026
1dbf8e8
Document catalog as load-bearing for column lineage
takaebato May 24, 2026
56bb9be
Rename ColumnTarget::Persisted to Relation
takaebato May 24, 2026
aed2083
Align resolver-internal vocabulary to lineage / relation
takaebato May 24, 2026
e991475
Align test lineage-edge helpers off "flow" naming
takaebato May 24, 2026
96f3db0
Rename test names off "flow" to lineage / edge vocabulary
takaebato May 24, 2026
4432a5c
Purge remaining "flow" from comments, loop vars, and examples
takaebato May 24, 2026
cd7f41b
README: finish lineage wording in prose
takaebato May 24, 2026
451a7c0
Drop dead ResolvedQuery.scope_id and stale dead_code allows
takaebato May 24, 2026
4fd14f0
Collapse single-field VisitContext and Column wrappers
takaebato May 24, 2026
8f9aa80
Move ColumnReference into relation.rs next to TableReference
takaebato May 24, 2026
385a3ba
Rename relation module to reference
takaebato May 24, 2026
daf969b
Collapse BindingKey to a single normalized name
takaebato May 24, 2026
4c22a65
Document BindingKey case-folding rationale
takaebato May 24, 2026
5fb1e02
Resolve qualified column refs through aliases to the real table
takaebato May 24, 2026
f5f19d9
Document the Catalog open-world and folding-boundary semantics
takaebato May 24, 2026
c2fc96a
Collapse the CTAS write check into a match guard
takaebato May 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .github/workflows/rust.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ on:
push:
branches: [ "master" ]
pull_request:
branches: [ "master" ]

env:
CARGO_TERM_COLOR: always
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,5 @@
/Cargo.lock
.DS_Store
.idea
tmp/
coverage/
169 changes: 169 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
# CLAUDE.md

## Project

Rust workspace: `sql-insight` library + `sql-insight-cli`. SQL parsing is
built on `sqlparser-rs`; always work against its AST, never re-parse SQL
by hand.

## Commands

- Format: `cargo fmt`
- Test: `cargo test --all`
- Lint: `cargo clippy --all-targets -- -D warnings` (zero-warning policy)

## Architecture

- The `resolver` module walks a `Statement` once and produces a
`Resolution`:
- a scope arena of `Binding`s (`Table` / `Cte` / `DerivedTable` /
`TableFunction`),
- a buffer of `RawColumnRef`s captured at walk time with
resolved-table + synthetic-vs-real + clause-kind metadata,
- a buffer of `FlowEdge`s emitted directly during the walk.
Two post-passes on `into_resolution` compose the flow graph
end-to-end through CTE / derived intermediates and filter reads
down to references whose walk-time owner was a real `Table`.
Sub-modules are split by responsibility: `binding` (scope arena),
`context` (`VisitContext`), `column_ref`, `projection`, `flow`,
`composition`, `rename`; walker files (`expr` / `query` /
`statement` / `table`) live as siblings and add `visit_*` methods
via `impl Resolver` blocks.
- Pull-style design: `resolve_query` returns a `ResolvedQuery`
carrying the body's `projections: Vec<ProjectionGroup>`. Callers
(visit_insert / CTAS / scalar subqueries / etc.) decide what to do
with them — pair with target columns, emit `QueryOutput` edges,
bubble up through `SetExpr::Query`, etc.
- The resolver takes an optional `&dyn Catalog`. With a catalog,
Table bindings come back with `Known` schemas and unqualified
column resolution becomes strict (typos surface as `table: None`).
Without a catalog the resolver is best-effort.
- Extractors consume the resolver's output:
- `table_extractor` — flat list of `TableReference`s (legacy API).
- `crud_table_extractor` — CRUD-bucketed tables (legacy API).
- `table_operation_extractor` — `extract_table_operations` returns
`TableOperation { statement_kind, reads, writes,
lineage, diagnostics }` per parsed statement.
- `column_operation_extractor` — `extract_column_operations`
returns `ColumnOperation { statement_kind, reads,
writes, lineage, diagnostics }` at column granularity. `reads` /
`writes` are plain occurrence lists; `lineage` edges carry
`kind: ColumnLineageKind`.
- Per-statement output convention: extractors return
`Vec<Result<X, Error>>` so one bad statement does not kill the
rest.

## Vocabulary

- `TableOperation` carries three parallel surfaces:
- `reads: Vec<TableReference>` — every table the statement reads
from (occurrence-based; a table read more than once appears more
than once).
- `writes: Vec<TableReference>` — every table the statement writes
to.
- `lineage: Vec<TableLineageEdge>` — directed `source → target`
edges, only for statements that physically move data (INSERT /
UPDATE / MERGE / CTAS / CREATE VIEW). A table that plays both
roles (e.g. `DELETE t1 FROM t1`) appears in both `reads` and
`writes`.
- `ColumnOperation` mirrors the same surfaces at column
granularity:
- `reads: Vec<ColumnReference>` — every column reference, as a
plain occurrence list with no clause tag. References whose
walk-time owning binding was synthetic (CTE / derived / table
function) are dropped — only real-storage references and
unresolved names surface.
- `writes: Vec<ColumnReference>` — INSERT column lists, UPDATE SET
targets, CTAS / CREATE VIEW / ALTER VIEW columns, MERGE
WHEN-clause writes.
- `lineage: Vec<ColumnLineageEdge>` — `source → target` edges with
`kind: ColumnLineageKind` (`Passthrough` / `Transformation`).
Sources flowing through CTE / derived intermediates are composed
end-to-end; composition yields `Transformation` if any step
transforms. Targets: `QueryOutput { name, position }` for
transient SELECT outputs, `Relation(ColumnReference)` for
writes into a named relation (table or view).
- The value-vs-filter distinction is structural, not a tag: a value
contributor is a `lineage` source; a filter-only column is in
`reads` but not `lineage`.
- `StatementKind` — the verb of the statement; combined with the
`reads` / `writes` split recovers every granularity distinction.
- Internal-only `TableRole` (Read / Write) lives inside the resolver
for binding metadata. It is not exposed via the public API —
surface it through `reads` / `writes` instead.
- `TableReference` is identity-only (`catalog` / `schema` / `name`).
Alias is a use-site decoration, not part of a table's identity,
so `HashSet<TableReference>` dedup and cross-statement comparison
behave intuitively. Resolver bindings carry alias as a separate
field; the public API does not currently surface it.
- `ColumnReference` is identity-only too (`table: Option<TableReference>`,
`name: Ident`). `table` is `Option` for cases where resolution
fails (ambiguous, no candidate); the column name still surfaces.

## Design conventions

- Pull design: `resolve_query` collects facts (projections), callers
decide edge construction. Avoid pushing state from caller into
resolver via flag bags — instead expose helpers like
`with_filter_clause` / `with_branch_scope` for scoped, lexical
context.
- Walking-context state lives in `VisitContext` (just `scope_kind`)
— "in effect for the current visit", not "queued". Save / restore
goes through `with_context` (and the focused `with_branch_scope` /
`with_filter_clause` helpers) so the prior context is restored on
scope exit. `scope_kind` is preserved across a subquery boundary so
predicate-ness flows transitively. For owning per-query buffers
like `current_projections: Vec<…>`, `mem::replace` is used
instead.
- Wildcards (`SELECT *`, `t.*`) are not expanded at the parser
level — even with a catalog. The rigor cost (USING / NATURAL JOIN
merge, EXCLUDE / REPLACE / RENAME clauses, CTE column rename,
multi-segment qualifiers) is too high for a SQL-text-only library
to handle correctly. Wildcards contribute nothing to `reads` /
`lineage`; consumers needing per-column source → target lineage
either supply resolved query plans or do their own expansion.

## Code conventions

- Keep changes small and scoped. Preserve public API compatibility
unless an API change is intentional, and update doc comments when
it changes.
- **Public items deserve rustdoc** (`///` on items, `//!` on
modules / crates). State purpose, contract, edge cases, and
include examples where useful — rustdoc is the published API
surface and shows up in `cargo doc`, docs.rs, and IDE hovers.
Length is fine when it earns it.
- **Inline `//` comments**: keep them concise and well-structured.
Add a short example when it clarifies.
- Prefer private modules; export through explicit re-exports in
`lib.rs`.
- Avoid `bool` or ambiguous `Option` parameters in new public APIs.
Prefer enums, named methods, or small option structs.
- Avoid growing large modules. Split before a file becomes
unscannable.
- Keep `sqlparser-rs` AST `match` arms exhaustive in the resolver
and extractors — wildcard arms silently hide newly added variants.
- Public enums are **exhaustive (no `#[non_exhaustive]`) while pre-1.0**
(`StatementKind` / `ColumnLineageKind` / `ColumnTarget` /
`TableLevelDiagnosticKind` / `ColumnLevelDiagnosticKind`). Adding a
variant is therefore a breaking change on purpose — pre-1.0 that
rides a `0.x` bump and forces consumers to re-acknowledge the new
case rather than silently hitting a wildcard arm. Add
`#[non_exhaustive]` at the 1.0 freeze (removing it later is
non-breaking; adding it is breaking, so the 1.0 boundary is the
place). Keep internal `match`es exhaustive regardless.
- Diagnostics are split by extraction granularity:
`TableLevelDiagnostic` (only `UnsupportedStatement`) vs
`ColumnLevelDiagnostic` (adds `WildcardSuppressed` /
`AmbiguousColumn` / `UnresolvedColumn`). The resolver produces the
column-level superset; table-level surfaces project it down via
`ColumnLevelDiagnostic::to_table_level` (exhaustive match, so a new
column kind forces a table-level decision).
- For unsupported SQL, accumulate diagnostics instead of `?`-bailing
mid-walk. Reserve hard errors for genuinely unrecoverable
conditions.
- Tests: compare whole values (`assert_eq!(ops.reads, vec![...])`)
over field-by-field assertions. Use a layered helper convention
— `extract` → `extract_with(dialect)` → `extract_with_catalog(
dialect, catalog)` — so callsites stay terse and new parameters
fall through cleanly.
Loading