sqlMiner

In development - results are not yet reliable

This is particularly true for relationships across CTE, subquery, and temptable workflows (i.e. most real-world sql worklows!).
Collaboration is welcome!

{sqlMiner} extracts table relationship metadata from a folder of SQL scripts. Point it at a directory, and it returns a tidy data.frame describing every JOIN, INSERT, and CTE dependency it finds — one row per relationship. Use the output to audit data lineage, map table dependencies, or spot unexpected cross-schema joins without reading every script by hand.

What the package handles:

LEFT, RIGHT, INNER, FULL OUTER, and CROSS JOINs
INSERT INTO … SELECT … FROM patterns (ETL lineage)
Chained CTEs, with a depth column that tracks nesting level
Inline subqueries (parsed recursively)
Single-line (--) and block (/* */) comment stripping before extraction

Installation

Install the development version from GitHub:

# install.packages("pak")
pak::pak("nottmhospitals/sqlMiner")

Hello world

Point sq_read_sql_files() at a folder of .sql files, then pass the result to sq_sql_mine():

library(sqlMiner)

scripts <- sq_read_sql_files("path/to/your/sql/folder")
relationships <- sq_sql_mine(scripts)

head(relationships)

#>                script_name source_table          target_table relationship_type
#> 1   fixture_basic_join.sql     episodes              patients              JOIN
#> 2     fixture_comments.sql dbo.episodes          dbo.patients              JOIN
#> 3 fixture_ctes_chained.sql dbo.episodes             dbo.staff              JOIN
#> 4 fixture_ctes_chained.sql dbo.episodes cte_enriched_episodes              JOIN
#> 5 fixture_ctes_chained.sql dbo.episodes         dbo.diagnoses              JOIN
#> 6  fixture_ctes_simple.sql dbo.episodes             dbo.wards              JOIN
#>   join_type depth
#> 1 LEFT JOIN     1
#> 2 LEFT JOIN     1
#> 3 LEFT JOIN     1
#> 4 LEFT JOIN     3
#> 5 LEFT JOIN     1
#> 6 LEFT JOIN     1

Each row is one relationship. The key columns are:

Column	Description
`script_name`	The `.sql` file the relationship came from
`source_table`	The primary read table (the `FROM` clause)
`target_table`	The table being joined to, or written to for INSERTs
`relationship_type`	`"JOIN"` or `"INSERT"`
`join_type`	`"LEFT JOIN"`, `"INNER JOIN"`, etc. (`NA` for INSERTs)
`depth`	CTE nesting level — base tables = 1, each CTE layer adds 1
`source_is_cte` / `target_is_cte`	Whether the table is a CTE defined in the same script
`left_keys` / `right_keys`	List columns of joining key expressions — source side and target side respectively (`NA` for INSERTs or unresolvable conditions)

Extended examples

library(sqlMiner)
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.5.2

fixture_dir <- system.file("extdata", package = "sqlMiner")
scripts <- sq_read_sql_files(fixture_dir)
relationships <- suppressMessages(sq_sql_mine(scripts))

Count relationships per script

Quickly see which scripts are the most complex:

relationships |>
  count(script_name, sort = TRUE)
#>                       script_name n
#> 1          fixture_join_types.sql 5
#> 2         fixture_insert_into.sql 4
#> 3        fixture_ctes_chained.sql 3
#> 4         fixture_select_into.sql 3
#> 5     fixture_subquery_nested.sql 3
#> 6         fixture_ctes_simple.sql 2
#> 7      fixture_multistatement.sql 2
#> 8     fixture_subquery_simple.sql 2
#> 9           fixture_union_all.sql 2
#> 10         fixture_basic_join.sql 1
#> 11           fixture_comments.sql 1
#> 12           fixture_group_by.sql 1
#> 13 fixture_identifier_quoting.sql 1
#> 14  fixture_insert_cte_source.sql 1
#> 15      fixture_schema_prefix.sql 1
#> 16        fixture_table_hints.sql 1

Find the most frequently joined tables

Which tables appear most often as JOIN targets across all scripts?

relationships |>
  filter(relationship_type == "JOIN") |>
  count(target_table, sort = TRUE) |>
  head(10)
#>             target_table  n
#> 1           dbo.patients 10
#> 2              dbo.staff  5
#> 3              dbo.wards  3
#> 4          dbo.diagnoses  2
#> 5               patients  2
#> 6  cte_enriched_episodes  1
#> 7        dbo.departments  1
#> 8              diagnoses  1
#> 9            icd10_codes  1
#> 10     recent_admissions  1

Audit INSERT lineage

Find every script that writes to a table (ETL population):

relationships |>
  filter(relationship_type == "INSERT") |>
  select(script_name, source_table, target_table, relationship_type)
#>                     script_name       source_table         target_table
#> 1 fixture_insert_cte_source.sql cte_ready_episodes    dbo.fact_episodes
#> 2       fixture_insert_into.sql       dbo.episodes    dbo.fact_episodes
#> 3       fixture_select_into.sql       dbo.episodes dbo.staging_episodes
#>   relationship_type
#> 1            INSERT
#> 2            INSERT
#> 3            INSERT

Filter by join type

Isolate FULL OUTER JOINs, which often signal data-quality checks or reconciliation logic:

relationships |>
  filter(join_type == "FULL OUTER JOIN") |>
  select(script_name, source_table, target_table, on_condition)
#>              script_name source_table target_table                on_condition
#> 1 fixture_join_types.sql     episodes    diagnoses e.episode_id = d.episode_id

Understanding join keys

left_keys and right_keys are list columns — each element is a character vector of the column expressions that link source_table to target_table. left_keys is always the source (FROM) side; right_keys is always the target (JOIN) side, regardless of how the condition was written in the SQL.

Terminology

Term	Definition
join predicate	A condition that references columns from both sides of the join, e.g. `e.patient_id = p.patient_id`
filter predicate	A condition that references only one side, e.g. `p.active = 1` — dropped silently
equi-predicate	A join predicate using `=` — these populate `left_keys` / `right_keys`
non-equi predicate	A join predicate using `>`, `<`, `BETWEEN`, etc. — excluded from key columns, visible in `on_condition`

Rules applied during parsing

Both AND- and OR-connected equi-predicates are included (all represent potential linking columns).
Non-equi predicates and filter predicates (constants, literals) are silently dropped.
When a function call wraps a column — e.g. COALESCE(e.id, 0) — the raw expression is stored as-is.
When the full condition cannot be resolved (no equi-predicates, or no aliases to identify sides), left_keys and right_keys are NA. The raw on_condition is always available as a fallback.
INSERT rows always have NA for both key columns.

Inspecting keys

relationships |>
  filter(relationship_type == "JOIN") |>
  select(source_table, target_table, left_keys, right_keys) |>
  head(5)
#>   source_table          target_table    left_keys right_keys
#> 1     episodes              patients   patient_id patient_id
#> 2 dbo.episodes          dbo.patients   patient_id patient_id
#> 3 dbo.episodes             dbo.staff a.consul....   staff_id
#> 4 dbo.episodes cte_enriched_episodes p.patient_id patient_id
#> 5 dbo.episodes         dbo.diagnoses e.episode_id episode_id

Compound join keys — use tidyr::unnest() to get one row per key pair:

library(tidyr)

relationships |>
  filter(relationship_type == "JOIN", !is.na(left_keys)) |>
  select(script_name, source_table, target_table, left_keys, right_keys) |>
  unnest(cols = c(left_keys, right_keys)) |>
  head(10)
#> # A tibble: 10 × 5
#>    script_name                    source_table target_table left_keys right_keys
#>    <chr>                          <chr>        <chr>        <chr>     <chr>     
#>  1 fixture_basic_join.sql         episodes     patients     patient_… patient_id
#>  2 fixture_comments.sql           dbo.episodes dbo.patients patient_… patient_id
#>  3 fixture_ctes_chained.sql       dbo.episodes dbo.staff    a.consul… staff_id  
#>  4 fixture_ctes_chained.sql       dbo.episodes cte_enriche… p.patien… patient_id
#>  5 fixture_ctes_chained.sql       dbo.episodes dbo.diagnos… e.episod… episode_id
#>  6 fixture_ctes_simple.sql        dbo.episodes dbo.wards    ward_id   ward_id   
#>  7 fixture_ctes_simple.sql        dbo.episodes recent_admi… p.patien… patient_id
#>  8 fixture_group_by.sql           dbo.episodes dbo.departm… e.depart… departmen…
#>  9 fixture_identifier_quoting.sql dbo.episodes dbo.patients patient_… patient_id
#> 10 fixture_insert_into.sql        dbo.episodes dbo.patients patient_… patient_id

Inspect CTE depth

Scripts with depth > 1 reference CTEs defined in the same file. Depth tracks how many CTE layers deep a reference goes:

relationships |>
  filter(depth > 1) |>
  select(script_name, source_table, target_table, target_is_cte, depth) |>
  arrange(desc(depth))
#>                   script_name source_table          target_table target_is_cte
#> 1    fixture_ctes_chained.sql dbo.episodes cte_enriched_episodes          TRUE
#> 2 fixture_subquery_nested.sql dbo.episodes          dbo.patients         FALSE
#> 3     fixture_ctes_simple.sql dbo.episodes     recent_admissions          TRUE
#> 4 fixture_subquery_nested.sql __subquery__             dbo.staff         FALSE
#> 5 fixture_subquery_simple.sql dbo.episodes          dbo.patients         FALSE
#>   depth
#> 1     3
#> 2     3
#> 3     2
#> 4     2
#> 5     2

Cross-schema joins

Spot joins that cross schema boundaries (a common governance concern):

relationships |>
  filter(relationship_type == "JOIN") |>
  mutate(
    source_schema = sub("\\..*", "", source_table),
    target_schema = sub("\\..*", "", target_table)
  ) |>
  filter(
    source_schema != target_schema,
    grepl("\\.", source_table),
    grepl("\\.", target_table)
  ) |>
  select(script_name, source_table, target_table, join_type)
#>                 script_name source_table     target_table join_type
#> 1 fixture_schema_prefix.sql dbo.episodes staging.patients LEFT JOIN

What tables does a script depend on?

sq_script_dependencies() summarises the external tables a single script reads from and writes to. CTE names are excluded — they are internal to the script.

sq_script_dependencies(relationships, "fixture_insert_into.sql")
#>          table_name   role
#> 1      dbo.episodes  reads
#> 2      dbo.patients  reads
#> 3         dbo.staff  reads
#> 4     dbo.diagnoses  reads
#> 5 dbo.fact_episodes writes

How is a table used across all scripts?

sq_table_usage() finds every script that references a given table and describes its role in each relationship.

sq_table_usage(relationships, "dbo.episodes")
#>                       script_name   role
#> 1            fixture_comments.sql source
#> 2        fixture_ctes_chained.sql source
#> 3         fixture_ctes_simple.sql source
#> 4            fixture_group_by.sql source
#> 5  fixture_identifier_quoting.sql source
#> 6         fixture_insert_into.sql source
#> 7      fixture_multistatement.sql source
#> 8       fixture_schema_prefix.sql source
#> 9         fixture_select_into.sql source
#> 10    fixture_subquery_nested.sql source
#> 11    fixture_subquery_simple.sql source
#> 12        fixture_table_hints.sql source
#> 13          fixture_union_all.sql source

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github		.github
R		R
inst/extdata		inst/extdata
man		man
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
sqlMiner.Rproj		sqlMiner.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sqlMiner

In development - results are not yet reliable

Installation

Hello world

Extended examples

Count relationships per script

Find the most frequently joined tables

Audit INSERT lineage

Filter by join type

Understanding join keys

Inspect CTE depth

Cross-schema joins

What tables does a script depend on?

How is a table used across all scripts?

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

sqlMiner

In development - results are not yet reliable

Installation

Hello world

Extended examples

Count relationships per script

Find the most frequently joined tables

Audit INSERT lineage

Filter by join type

Understanding join keys

Inspect CTE depth

Cross-schema joins

What tables does a script depend on?

How is a table used across all scripts?

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages