This is particularly true for relationships across CTE, subquery, and
temptable workflows (i.e. most real-world sql worklows!).
Collaboration is welcome!
{sqlMiner} extracts table relationship metadata from a folder of SQL scripts. Point it at a directory, and it returns a tidy data.frame describing every JOIN, INSERT, and CTE dependency it finds — one row per relationship. Use the output to audit data lineage, map table dependencies, or spot unexpected cross-schema joins without reading every script by hand.
What the package handles:
LEFT,RIGHT,INNER,FULL OUTER, andCROSSJOINsINSERT INTO … SELECT … FROMpatterns (ETL lineage)- Chained CTEs, with a
depthcolumn that tracks nesting level - Inline subqueries (parsed recursively)
- Single-line (
--) and block (/* */) comment stripping before extraction
Install the development version from GitHub:
# install.packages("pak")
pak::pak("nottmhospitals/sqlMiner")Point sq_read_sql_files() at a folder of .sql files, then pass the
result to sq_sql_mine():
library(sqlMiner)
scripts <- sq_read_sql_files("path/to/your/sql/folder")
relationships <- sq_sql_mine(scripts)
head(relationships)#> script_name source_table target_table relationship_type
#> 1 fixture_basic_join.sql episodes patients JOIN
#> 2 fixture_comments.sql dbo.episodes dbo.patients JOIN
#> 3 fixture_ctes_chained.sql dbo.episodes dbo.staff JOIN
#> 4 fixture_ctes_chained.sql dbo.episodes cte_enriched_episodes JOIN
#> 5 fixture_ctes_chained.sql dbo.episodes dbo.diagnoses JOIN
#> 6 fixture_ctes_simple.sql dbo.episodes dbo.wards JOIN
#> join_type depth
#> 1 LEFT JOIN 1
#> 2 LEFT JOIN 1
#> 3 LEFT JOIN 1
#> 4 LEFT JOIN 3
#> 5 LEFT JOIN 1
#> 6 LEFT JOIN 1
Each row is one relationship. The key columns are:
| Column | Description |
|---|---|
script_name |
The .sql file the relationship came from |
source_table |
The primary read table (the FROM clause) |
target_table |
The table being joined to, or written to for INSERTs |
relationship_type |
"JOIN" or "INSERT" |
join_type |
"LEFT JOIN", "INNER JOIN", etc. (NA for INSERTs) |
depth |
CTE nesting level — base tables = 1, each CTE layer adds 1 |
source_is_cte / target_is_cte |
Whether the table is a CTE defined in the same script |
left_keys / right_keys |
List columns of joining key expressions — source side and target side respectively (NA for INSERTs or unresolvable conditions) |
library(sqlMiner)
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.5.2
fixture_dir <- system.file("extdata", package = "sqlMiner")
scripts <- sq_read_sql_files(fixture_dir)
relationships <- suppressMessages(sq_sql_mine(scripts))Quickly see which scripts are the most complex:
relationships |>
count(script_name, sort = TRUE)
#> script_name n
#> 1 fixture_join_types.sql 5
#> 2 fixture_insert_into.sql 4
#> 3 fixture_ctes_chained.sql 3
#> 4 fixture_select_into.sql 3
#> 5 fixture_subquery_nested.sql 3
#> 6 fixture_ctes_simple.sql 2
#> 7 fixture_multistatement.sql 2
#> 8 fixture_subquery_simple.sql 2
#> 9 fixture_union_all.sql 2
#> 10 fixture_basic_join.sql 1
#> 11 fixture_comments.sql 1
#> 12 fixture_group_by.sql 1
#> 13 fixture_identifier_quoting.sql 1
#> 14 fixture_insert_cte_source.sql 1
#> 15 fixture_schema_prefix.sql 1
#> 16 fixture_table_hints.sql 1Which tables appear most often as JOIN targets across all scripts?
relationships |>
filter(relationship_type == "JOIN") |>
count(target_table, sort = TRUE) |>
head(10)
#> target_table n
#> 1 dbo.patients 10
#> 2 dbo.staff 5
#> 3 dbo.wards 3
#> 4 dbo.diagnoses 2
#> 5 patients 2
#> 6 cte_enriched_episodes 1
#> 7 dbo.departments 1
#> 8 diagnoses 1
#> 9 icd10_codes 1
#> 10 recent_admissions 1Find every script that writes to a table (ETL population):
relationships |>
filter(relationship_type == "INSERT") |>
select(script_name, source_table, target_table, relationship_type)
#> script_name source_table target_table
#> 1 fixture_insert_cte_source.sql cte_ready_episodes dbo.fact_episodes
#> 2 fixture_insert_into.sql dbo.episodes dbo.fact_episodes
#> 3 fixture_select_into.sql dbo.episodes dbo.staging_episodes
#> relationship_type
#> 1 INSERT
#> 2 INSERT
#> 3 INSERTIsolate FULL OUTER JOINs, which often signal data-quality checks or
reconciliation logic:
relationships |>
filter(join_type == "FULL OUTER JOIN") |>
select(script_name, source_table, target_table, on_condition)
#> script_name source_table target_table on_condition
#> 1 fixture_join_types.sql episodes diagnoses e.episode_id = d.episode_idleft_keys and right_keys are list columns — each element is a
character vector of the column expressions that link source_table to
target_table. left_keys is always the source (FROM) side;
right_keys is always the target (JOIN) side, regardless of how the
condition was written in the SQL.
Terminology
| Term | Definition |
|---|---|
| join predicate | A condition that references columns from both sides of the join, e.g. e.patient_id = p.patient_id |
| filter predicate | A condition that references only one side, e.g. p.active = 1 — dropped silently |
| equi-predicate | A join predicate using = — these populate left_keys / right_keys |
| non-equi predicate | A join predicate using >, <, BETWEEN, etc. — excluded from key columns, visible in on_condition |
Rules applied during parsing
- Both
AND- andOR-connected equi-predicates are included (all represent potential linking columns). - Non-equi predicates and filter predicates (constants, literals) are silently dropped.
- When a function call wraps a column — e.g.
COALESCE(e.id, 0)— the raw expression is stored as-is. - When the full condition cannot be resolved (no equi-predicates, or no
aliases to identify sides),
left_keysandright_keysareNA. The rawon_conditionis always available as a fallback. INSERTrows always haveNAfor both key columns.
Inspecting keys
relationships |>
filter(relationship_type == "JOIN") |>
select(source_table, target_table, left_keys, right_keys) |>
head(5)
#> source_table target_table left_keys right_keys
#> 1 episodes patients patient_id patient_id
#> 2 dbo.episodes dbo.patients patient_id patient_id
#> 3 dbo.episodes dbo.staff a.consul.... staff_id
#> 4 dbo.episodes cte_enriched_episodes p.patient_id patient_id
#> 5 dbo.episodes dbo.diagnoses e.episode_id episode_idCompound join keys — use tidyr::unnest() to get one row per key
pair:
library(tidyr)
relationships |>
filter(relationship_type == "JOIN", !is.na(left_keys)) |>
select(script_name, source_table, target_table, left_keys, right_keys) |>
unnest(cols = c(left_keys, right_keys)) |>
head(10)
#> # A tibble: 10 × 5
#> script_name source_table target_table left_keys right_keys
#> <chr> <chr> <chr> <chr> <chr>
#> 1 fixture_basic_join.sql episodes patients patient_… patient_id
#> 2 fixture_comments.sql dbo.episodes dbo.patients patient_… patient_id
#> 3 fixture_ctes_chained.sql dbo.episodes dbo.staff a.consul… staff_id
#> 4 fixture_ctes_chained.sql dbo.episodes cte_enriche… p.patien… patient_id
#> 5 fixture_ctes_chained.sql dbo.episodes dbo.diagnos… e.episod… episode_id
#> 6 fixture_ctes_simple.sql dbo.episodes dbo.wards ward_id ward_id
#> 7 fixture_ctes_simple.sql dbo.episodes recent_admi… p.patien… patient_id
#> 8 fixture_group_by.sql dbo.episodes dbo.departm… e.depart… departmen…
#> 9 fixture_identifier_quoting.sql dbo.episodes dbo.patients patient_… patient_id
#> 10 fixture_insert_into.sql dbo.episodes dbo.patients patient_… patient_idScripts with depth > 1 reference CTEs defined in the same file. Depth
tracks how many CTE layers deep a reference goes:
relationships |>
filter(depth > 1) |>
select(script_name, source_table, target_table, target_is_cte, depth) |>
arrange(desc(depth))
#> script_name source_table target_table target_is_cte
#> 1 fixture_ctes_chained.sql dbo.episodes cte_enriched_episodes TRUE
#> 2 fixture_subquery_nested.sql dbo.episodes dbo.patients FALSE
#> 3 fixture_ctes_simple.sql dbo.episodes recent_admissions TRUE
#> 4 fixture_subquery_nested.sql __subquery__ dbo.staff FALSE
#> 5 fixture_subquery_simple.sql dbo.episodes dbo.patients FALSE
#> depth
#> 1 3
#> 2 3
#> 3 2
#> 4 2
#> 5 2Spot joins that cross schema boundaries (a common governance concern):
relationships |>
filter(relationship_type == "JOIN") |>
mutate(
source_schema = sub("\\..*", "", source_table),
target_schema = sub("\\..*", "", target_table)
) |>
filter(
source_schema != target_schema,
grepl("\\.", source_table),
grepl("\\.", target_table)
) |>
select(script_name, source_table, target_table, join_type)
#> script_name source_table target_table join_type
#> 1 fixture_schema_prefix.sql dbo.episodes staging.patients LEFT JOINsq_script_dependencies() summarises the external tables a single
script reads from and writes to. CTE names are excluded — they are
internal to the script.
sq_script_dependencies(relationships, "fixture_insert_into.sql")
#> table_name role
#> 1 dbo.episodes reads
#> 2 dbo.patients reads
#> 3 dbo.staff reads
#> 4 dbo.diagnoses reads
#> 5 dbo.fact_episodes writessq_table_usage() finds every script that references a given table and
describes its role in each relationship.
sq_table_usage(relationships, "dbo.episodes")
#> script_name role
#> 1 fixture_comments.sql source
#> 2 fixture_ctes_chained.sql source
#> 3 fixture_ctes_simple.sql source
#> 4 fixture_group_by.sql source
#> 5 fixture_identifier_quoting.sql source
#> 6 fixture_insert_into.sql source
#> 7 fixture_multistatement.sql source
#> 8 fixture_schema_prefix.sql source
#> 9 fixture_select_into.sql source
#> 10 fixture_subquery_nested.sql source
#> 11 fixture_subquery_simple.sql source
#> 12 fixture_table_hints.sql source
#> 13 fixture_union_all.sql source