Skip to content

nottmhospitals/sqlMiner

Repository files navigation

sqlMiner

In development - results are not yet reliable

This is particularly true for relationships across CTE, subquery, and temptable workflows (i.e. most real-world sql worklows!).
Collaboration is welcome!

{sqlMiner} extracts table relationship metadata from a folder of SQL scripts. Point it at a directory, and it returns a tidy data.frame describing every JOIN, INSERT, and CTE dependency it finds — one row per relationship. Use the output to audit data lineage, map table dependencies, or spot unexpected cross-schema joins without reading every script by hand.

What the package handles:

  • LEFT, RIGHT, INNER, FULL OUTER, and CROSS JOINs
  • INSERT INTO … SELECT … FROM patterns (ETL lineage)
  • Chained CTEs, with a depth column that tracks nesting level
  • Inline subqueries (parsed recursively)
  • Single-line (--) and block (/* */) comment stripping before extraction

Installation

Install the development version from GitHub:

# install.packages("pak")
pak::pak("nottmhospitals/sqlMiner")

Hello world

Point sq_read_sql_files() at a folder of .sql files, then pass the result to sq_sql_mine():

library(sqlMiner)

scripts <- sq_read_sql_files("path/to/your/sql/folder")
relationships <- sq_sql_mine(scripts)

head(relationships)
#>                script_name source_table          target_table relationship_type
#> 1   fixture_basic_join.sql     episodes              patients              JOIN
#> 2     fixture_comments.sql dbo.episodes          dbo.patients              JOIN
#> 3 fixture_ctes_chained.sql dbo.episodes             dbo.staff              JOIN
#> 4 fixture_ctes_chained.sql dbo.episodes cte_enriched_episodes              JOIN
#> 5 fixture_ctes_chained.sql dbo.episodes         dbo.diagnoses              JOIN
#> 6  fixture_ctes_simple.sql dbo.episodes             dbo.wards              JOIN
#>   join_type depth
#> 1 LEFT JOIN     1
#> 2 LEFT JOIN     1
#> 3 LEFT JOIN     1
#> 4 LEFT JOIN     3
#> 5 LEFT JOIN     1
#> 6 LEFT JOIN     1

Each row is one relationship. The key columns are:

Column Description
script_name The .sql file the relationship came from
source_table The primary read table (the FROM clause)
target_table The table being joined to, or written to for INSERTs
relationship_type "JOIN" or "INSERT"
join_type "LEFT JOIN", "INNER JOIN", etc. (NA for INSERTs)
depth CTE nesting level — base tables = 1, each CTE layer adds 1
source_is_cte / target_is_cte Whether the table is a CTE defined in the same script
left_keys / right_keys List columns of joining key expressions — source side and target side respectively (NA for INSERTs or unresolvable conditions)

Extended examples

library(sqlMiner)
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.5.2

fixture_dir <- system.file("extdata", package = "sqlMiner")
scripts <- sq_read_sql_files(fixture_dir)
relationships <- suppressMessages(sq_sql_mine(scripts))

Count relationships per script

Quickly see which scripts are the most complex:

relationships |>
  count(script_name, sort = TRUE)
#>                       script_name n
#> 1          fixture_join_types.sql 5
#> 2         fixture_insert_into.sql 4
#> 3        fixture_ctes_chained.sql 3
#> 4         fixture_select_into.sql 3
#> 5     fixture_subquery_nested.sql 3
#> 6         fixture_ctes_simple.sql 2
#> 7      fixture_multistatement.sql 2
#> 8     fixture_subquery_simple.sql 2
#> 9           fixture_union_all.sql 2
#> 10         fixture_basic_join.sql 1
#> 11           fixture_comments.sql 1
#> 12           fixture_group_by.sql 1
#> 13 fixture_identifier_quoting.sql 1
#> 14  fixture_insert_cte_source.sql 1
#> 15      fixture_schema_prefix.sql 1
#> 16        fixture_table_hints.sql 1

Find the most frequently joined tables

Which tables appear most often as JOIN targets across all scripts?

relationships |>
  filter(relationship_type == "JOIN") |>
  count(target_table, sort = TRUE) |>
  head(10)
#>             target_table  n
#> 1           dbo.patients 10
#> 2              dbo.staff  5
#> 3              dbo.wards  3
#> 4          dbo.diagnoses  2
#> 5               patients  2
#> 6  cte_enriched_episodes  1
#> 7        dbo.departments  1
#> 8              diagnoses  1
#> 9            icd10_codes  1
#> 10     recent_admissions  1

Audit INSERT lineage

Find every script that writes to a table (ETL population):

relationships |>
  filter(relationship_type == "INSERT") |>
  select(script_name, source_table, target_table, relationship_type)
#>                     script_name       source_table         target_table
#> 1 fixture_insert_cte_source.sql cte_ready_episodes    dbo.fact_episodes
#> 2       fixture_insert_into.sql       dbo.episodes    dbo.fact_episodes
#> 3       fixture_select_into.sql       dbo.episodes dbo.staging_episodes
#>   relationship_type
#> 1            INSERT
#> 2            INSERT
#> 3            INSERT

Filter by join type

Isolate FULL OUTER JOINs, which often signal data-quality checks or reconciliation logic:

relationships |>
  filter(join_type == "FULL OUTER JOIN") |>
  select(script_name, source_table, target_table, on_condition)
#>              script_name source_table target_table                on_condition
#> 1 fixture_join_types.sql     episodes    diagnoses e.episode_id = d.episode_id

Understanding join keys

left_keys and right_keys are list columns — each element is a character vector of the column expressions that link source_table to target_table. left_keys is always the source (FROM) side; right_keys is always the target (JOIN) side, regardless of how the condition was written in the SQL.

Terminology

Term Definition
join predicate A condition that references columns from both sides of the join, e.g. e.patient_id = p.patient_id
filter predicate A condition that references only one side, e.g. p.active = 1 — dropped silently
equi-predicate A join predicate using = — these populate left_keys / right_keys
non-equi predicate A join predicate using >, <, BETWEEN, etc. — excluded from key columns, visible in on_condition

Rules applied during parsing

  • Both AND- and OR-connected equi-predicates are included (all represent potential linking columns).
  • Non-equi predicates and filter predicates (constants, literals) are silently dropped.
  • When a function call wraps a column — e.g. COALESCE(e.id, 0) — the raw expression is stored as-is.
  • When the full condition cannot be resolved (no equi-predicates, or no aliases to identify sides), left_keys and right_keys are NA. The raw on_condition is always available as a fallback.
  • INSERT rows always have NA for both key columns.

Inspecting keys

relationships |>
  filter(relationship_type == "JOIN") |>
  select(source_table, target_table, left_keys, right_keys) |>
  head(5)
#>   source_table          target_table    left_keys right_keys
#> 1     episodes              patients   patient_id patient_id
#> 2 dbo.episodes          dbo.patients   patient_id patient_id
#> 3 dbo.episodes             dbo.staff a.consul....   staff_id
#> 4 dbo.episodes cte_enriched_episodes p.patient_id patient_id
#> 5 dbo.episodes         dbo.diagnoses e.episode_id episode_id

Compound join keys — use tidyr::unnest() to get one row per key pair:

library(tidyr)

relationships |>
  filter(relationship_type == "JOIN", !is.na(left_keys)) |>
  select(script_name, source_table, target_table, left_keys, right_keys) |>
  unnest(cols = c(left_keys, right_keys)) |>
  head(10)
#> # A tibble: 10 × 5
#>    script_name                    source_table target_table left_keys right_keys
#>    <chr>                          <chr>        <chr>        <chr>     <chr>     
#>  1 fixture_basic_join.sql         episodes     patients     patient_… patient_id
#>  2 fixture_comments.sql           dbo.episodes dbo.patients patient_… patient_id
#>  3 fixture_ctes_chained.sql       dbo.episodes dbo.staff    a.consul… staff_id  
#>  4 fixture_ctes_chained.sql       dbo.episodes cte_enriche… p.patien… patient_id
#>  5 fixture_ctes_chained.sql       dbo.episodes dbo.diagnos… e.episod… episode_id
#>  6 fixture_ctes_simple.sql        dbo.episodes dbo.wards    ward_id   ward_id   
#>  7 fixture_ctes_simple.sql        dbo.episodes recent_admi… p.patien… patient_id
#>  8 fixture_group_by.sql           dbo.episodes dbo.departm… e.depart… departmen…
#>  9 fixture_identifier_quoting.sql dbo.episodes dbo.patients patient_… patient_id
#> 10 fixture_insert_into.sql        dbo.episodes dbo.patients patient_… patient_id

Inspect CTE depth

Scripts with depth > 1 reference CTEs defined in the same file. Depth tracks how many CTE layers deep a reference goes:

relationships |>
  filter(depth > 1) |>
  select(script_name, source_table, target_table, target_is_cte, depth) |>
  arrange(desc(depth))
#>                   script_name source_table          target_table target_is_cte
#> 1    fixture_ctes_chained.sql dbo.episodes cte_enriched_episodes          TRUE
#> 2 fixture_subquery_nested.sql dbo.episodes          dbo.patients         FALSE
#> 3     fixture_ctes_simple.sql dbo.episodes     recent_admissions          TRUE
#> 4 fixture_subquery_nested.sql __subquery__             dbo.staff         FALSE
#> 5 fixture_subquery_simple.sql dbo.episodes          dbo.patients         FALSE
#>   depth
#> 1     3
#> 2     3
#> 3     2
#> 4     2
#> 5     2

Cross-schema joins

Spot joins that cross schema boundaries (a common governance concern):

relationships |>
  filter(relationship_type == "JOIN") |>
  mutate(
    source_schema = sub("\\..*", "", source_table),
    target_schema = sub("\\..*", "", target_table)
  ) |>
  filter(
    source_schema != target_schema,
    grepl("\\.", source_table),
    grepl("\\.", target_table)
  ) |>
  select(script_name, source_table, target_table, join_type)
#>                 script_name source_table     target_table join_type
#> 1 fixture_schema_prefix.sql dbo.episodes staging.patients LEFT JOIN

What tables does a script depend on?

sq_script_dependencies() summarises the external tables a single script reads from and writes to. CTE names are excluded — they are internal to the script.

sq_script_dependencies(relationships, "fixture_insert_into.sql")
#>          table_name   role
#> 1      dbo.episodes  reads
#> 2      dbo.patients  reads
#> 3         dbo.staff  reads
#> 4     dbo.diagnoses  reads
#> 5 dbo.fact_episodes writes

How is a table used across all scripts?

sq_table_usage() finds every script that references a given table and describes its role in each relationship.

sq_table_usage(relationships, "dbo.episodes")
#>                       script_name   role
#> 1            fixture_comments.sql source
#> 2        fixture_ctes_chained.sql source
#> 3         fixture_ctes_simple.sql source
#> 4            fixture_group_by.sql source
#> 5  fixture_identifier_quoting.sql source
#> 6         fixture_insert_into.sql source
#> 7      fixture_multistatement.sql source
#> 8       fixture_schema_prefix.sql source
#> 9         fixture_select_into.sql source
#> 10    fixture_subquery_nested.sql source
#> 11    fixture_subquery_simple.sql source
#> 12        fixture_table_hints.sql source
#> 13          fixture_union_all.sql source

About

Discovers Database Relationships From Your Production SQL Code

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages