DuckDB and R
for (CPRD)
Data Analysis

Dr Pietà-Georgina ‘Georgie’ Schofield 2026-01-21

Motivation
- CPRD Aurum Data
Problem
- Find Index Age Distribution by Dementia Type
A Solution
- duckdb, tidyverse, DBI and dbplyr
Example ETL
What Happened to T(ransform)
- Data Quality Issues
Example
- dbplyr
- tibble
- ggplot2
- Uh oh!
This is “Lazy Evaluation”

chain dbplyr then collect()
- Step 1.
- Step 1.
- Warning
- Step 2.
- Step 3.
- Step 3.
- Replot
Discussion
Thank You
pietas@liverpool.ac.uk
https://github.com/PietaSchofield

Motivation

CPRD Aurum Data

250000 dementia patient records from CPRD Aurum
13GB of data in 9 compressed files (each ~2GB)
Unzips to 82GB of data
- patient: 249961 records in 1 file
- practice: 1483 records in 1 file
- staff: 1336287 records in 1 file
- consultation: 120456179 records in 10 files
- observation: 425006899 records in 43 file
- referral: 2826888 records in 1 file
- problem: 14501709 records in 1 file(s)
- drugissue: 197222175 records in 35

Problem

Find Index Age Distribution by Dementia Type

You have a laptop and R and limited memory
You need to find index date for a number of codes groups representing different dementia types
R freezes or just crashes if you try to load 425 Million Records from 43 files
You could load each in turn and filter save but…
- What about patients spanning more than one file
- At some point you need to realise there are data quality issues to sort out
- Repeat the whole process

A Solution

`duckdb`, `tidyverse`, `DBI` and `dbplyr`

tidyverse: (tibble, dplyr, ggplot2, lubridate etc.)
- Other alternative object paradigms exist data.tables but I am at home in the tidyverse
duckdb: is a column orientated local file based sql database engine
- specially tuned for analytics (OLAP) over transaction processing (OLTP)
- similar but much better data science alternative to SQLite
DBI: connect R to databases
- dbplyr permits use of tidyverse functions of database tables rather than tibbles
- can use SQL queries that is your preference.

Example ETL

Extract

“CPRD fob-holder” has done that and you now have your 9 zipped files (13GB)
Copy them from CPRDGeneralinfo drive to a University asset you have read/write access to
- options university desktop / laptop / active data storage network drive (preferable)
- do not copy to unencrypted non_password protected devices (eg. penstick, personal laptops)
Unzip the compressed files into the 93 tab delimited text files (82GB)

Load

Install the rtrhd R package (GitHub)

install.packages("remotes")
remotes::install_github("PietaSchofield/rtrhd")

Get a copy of the CPRD Aurum/EMIS data-dictionary file from CPRDGeneralinfo drive
Then run load_with_data_dictionary (works for EMIS, GOLD, HES, ONS…)

rtrhd::load_with_data_dictionary(
  ddFile=file.path(...), # location of data dictionary file
  dbf=file.path(...), # location of duckdb database file to be created
  datadir=file.path(...), # location the text files that were unzipped 
  dset="..." # a prefix for tablenames eg. 'emis' or 'aurum' or 'gold'
)

Have a cup of coffee (go for lunch)

>
INFO [2026-01-16 16:54:35] emis_patient: 249961 records loaded from 1 file(s)
INFO [2026-01-16 16:54:36] emis_practice: 1483 records loaded from 1 file(s)
INFO [2026-01-16 16:54:37] emis_staff: 1336287 records loaded from 1 file(s)
INFO [2026-01-16 16:58:22] emis_consultation: 120456179 records loaded from 10 file(s)
INFO [2026-01-16 17:17:25] emis_observation: 425006899 records loaded from 43 file(s)
INFO [2026-01-16 17:17:29] emis_referral: 2826888 records loaded from 1 file(s)
INFO [2026-01-16 17:17:52] emis_problem: 14501709 records loaded from 1 file(s)
INFO [2026-01-16 17:27:34] emis_drugissue: 197222175 records loaded from 35 file(s)
>

18GB duckdb file created (you can delete the 83GB of text files now)
761,601,581 records ingested in 1978 seconds (~33 minutes)
Still think you need a database server?

Add the Code-list

coi <- file.path(.projLoc,"dementia_codes.xlsx") |>
  readxl::read_excel(col_type="text") |>
  mutate(
    preferred_term=iconv(preferred_term, 
                         from="UTF-8", to="ASCII//TRANSLIT"),
    proposed_group=iconv(proposed_group, 
                         from="UTF-8", to="ASCII//TRANSLIT")
 ) 

rtrhd::load_table(dbf=dbname,dataset=coi,tab_name="coi_emis_dementia")

rtrhd::list_db(dbname)

The Database

tablename	fields	records
coi_emis_dementia	medcodeid, snomedctconceptid, preferred_term, proposed_group, n_synonyms	671
emis_consultation	patid, consid, pracid, consdate, enterdate, staffid, conssourceid, cprdconstype, consmedcodeid	120456179
emis_drugissue	patid, issueid, pracid, probobsid, drugrecid, issuedate, enterdate, staffid, prodcodeid, dosageid, quantity, quantunitid, duration, estnhscost	197222175
emis_observation	patid, consid, pracid, obsid, obsdate, enterdate, staffid, parentobsid, medcodeid, value, numunitid, obstypeid, numrangelow, numrangehigh, probobsid	425006899
emis_patient	patid, pracid, usualgpstaffid, gender, yob, mob, emis_ddate, regstartdate, patienttypeid, regenddate, acceptable, cprd_ddate	249961
emis_practice	pracid, lcd, uts, region	1483
emis_problem	patid, obsid, pracid, parentprobobsid, probenddate, expduration, lastrevdate, lastrevstaffid, parentprobrelid, probstatusid, signid	14501709
emis_referral	patid, obsid, pracid, refsourceorgid, reftargetorgid, refurgencyid, refservicetypeid, refmodeid	2826888
emis_staff	staffid, pracid, jobcatid	1336287

What Happened to T(ransform)

Data Quality Issues

There will be lots of data quality issues you need to understand.
It is routinely collected healthcare data afterall
Data quality filters such practice
- Up To Standard Date (uts) (but this is not currently supplied, false confidence?)
- Last Collected Date (lcd)
These are much better understood if
- You can easily explore your data
- You can run many look-see queries
- You can visualise the results

Example

`dbplyr`

# connect to the database
dbcon <- DBI::dbConnect(duckdb::duckdb(),dbname,read_only=T)
# build the query
dementia_trends <- tbl(dbcon,"emis_observation") |>
  inner_join(tbl(dbcon,"coi_emis_dementia"), by = "medcodeid") |>
  group_by(patid,proposed_group) |> window_order(obsdate) |>
  summarise(
    index_date=first(obsdate),
    .groups="drop") |>
  inner_join(tbl(dbcon,"emis_patient") |>
             transmute(patid,yob,gender,regstartdate), by = "patid") |>
  mutate(index_year=year(index_date),index_age=index_year-yob) |>
# pull the data to R
  collect()
# disconnect the database
DBI::dbDisconnect(dbcon)

`tibble`

patid	proposed_group	yob	gender	index_year	index_age
********	Unspecified/Other dementia	1956	2	2024	68
********	Unspecified/Other dementia	1966	2	2018	52
********	Vascular	1961	2	2025	64
********	Unspecified/Other dementia	1949	2	2024	75
********	Unspecified/Other dementia	1948	1	2017	69
********	Other/Check	1947	1	2023	76
********	Unspecified/Other dementia	1956	1	2018	62
********	Unspecified/Other dementia	1936	2	2018	82
********	Unspecified/Other dementia	1946	2	2020	74
********	Unspecified/Other dementia	1942	2	2014	72

`ggplot2`

dementia_trends_plot <- dementia_trends %>%
  ggplot(aes(x = index_age, fill = proposed_group)) +
  geom_boxplot(alpha = 0.7) +
  labs(
    title = "Age at First Dementia Diagnosis by Type (250K patients 425M Observations)",
    x = "Age at Index Diagnosis",
    y = "Density"
  ) +
  theme_minimal()

Uh oh!

Dementia trends plot

This is “Lazy Evaluation”

chain `dbplyr` then collect()

Step 1.

dbcon <- DBI::dbConnect(duckdb::duckdb(), dbname, read_only=T)
dementia_trends <- tbl(dbcon, "emis_observation") |>
  inner_join(tbl(dbcon, "coi_emis_dementia"), by = "medcodeid") |>
  inner_join(tbl(dbcon, "emis_patient"), by = c("patid","pracid")) |>
  transmute(patid,pracid,yob,sex=gender,regstart=regstartdate,regend=regenddate,
            obsid,obsdate,enterdate,medcodeid,dementia_type=proposed_group)

Step 1.

> is(dementia_trends)
[1] “tbl_duckdb_connection”
> dementia_trends |> mutate(patid=“****”,regstart=“****”,regend=“*****”) |> head()
# Source: SQL [?? x 11]
# Database: DuckDB 1.4.3 user@Linux 6.18.4-1-MANJARO:R 4.5.2//home/user/Projects/demrfmr/.data/duckdb/demrisk.duckdb
patid pracid yob sex regstart regend obsid obsdate enterdate medcodeid dementia_type
<chr> <int> <int> <int> <chr> <chr> <chr> <date> <date> <chr> <chr>
1 **** 20002 1970 1 **** ***** 17757124480 2022-11-03 2022-11-03 1867041000006112 Unspecified/Other dementia
2 **** 20002 1958 1 **** ***** 1158071968 2014-11-27 2014-11-27 1867041000006112 Unspecified/Other dementia
3 **** 20002 1961 1 **** ***** 23115694179 2024-06-17 2024-06-17 1867041000006112 Unspecified/Other dementia
4 **** 20002 1940 2 **** ***** 10295072833 2020-03-10 2020-03-10 1150521000000117 Unspecified/Other dementia
5 **** 20002 1929 2 **** ***** 7348443459 2019-06-07 2019-06-18 45046017 Alzheimera??s
6 **** 20002 1920 2 **** ***** 1161784698 2017-01-13 2017-01-13 2345931000000111 Unspecified/Other dementia
> dementia_trends |> tally()
# Source: SQL [?? x 1]
# Database: DuckDB 1.4.3 user@Linux 6.18.4-1-MANJARO:R 4.5.2//home/user/Projects/demrfmr/.data/duckdb/demrisk.duckdb
n
<dbl>
1 787996

Warning

Many R display/formatting functions will silently trigger collect() on lazy queries:
For example will all trigger collect() and potentially crash R:
- kable(huge_lazy_query) # knitr::kable()
- DT::datatable(huge_lazy_query) # DT package
- print(huge_lazy_query) # sometimes, depending on print method
- View(huge_lazy_query) # RStudio viewer

Step 2.

dementia_trends <- dementia_trends |>
  inner_join(tbl(dbcon, "emis_practice") |> transmute(pracid, lcd), 
    by= "pracid") |>
  filter(
    obsdate >= regstart,
    obsdate <= coalesce(lcd,coalesce(regend, as.Date('2024-12-31')))
  ) |> 
  group_by(patid, sex, yob, dementia_type) |>
  window_order(obsdate) |>
  summarise(
    index_date = first(obsdate),
    .groups = "drop"
  )

Step 3.

dementia_trends <- dementia_trends |>
  mutate(
    index_year = year(index_date),
    index_age = index_year - yob
  ) |>
  # Additional quality filters in DuckDB
  filter(
    index_age >= 0,
    index_age <= 110
  ) |>
  collect() |>
  mutate( dementia_type = gsub("a\\?\\?","'",dementia_type))
DBI::dbDisconnect(dbcon)

Step 3.

Replot

Dementia trends plot

Discussion

What This Talk Was Not

This wasn’t a talk about how to clean your data e.g. dodgy dates
Answer: DEPENDS ON YOUR RESEARCH QUESTION
- Prevalence study? → Maybe keep patient, drop bad record
- Disease progression study? → Whole patient sequence compromised
- Healthcare utilization? → Registration period matters more
An exhaustive tutorial on R, tidyverse, dbplyr or duckdb
Dissing data.tables or python or sql any other tool/language

What This Talk Was

To show the potential of duckdb, tidyverse and dbplyr
- for efficient data exploration
- for effective data visualisation
“This talk was brought to you by…”

# CRAN packages
install.packages(c("duckdb","tidyverse","dbplyr","remotes","rmarkdown"))

# My R Tools for Routine Healthcare Data (rtrhd) package
remotes::install_github("PietaSchofield/rtrhd")

WARNING on Manjaro Linux duckdb takes AGES!! to install … no really go for dinner, watch a movie contemplate, life choices long,… but on Windows and Mac I understand it is almost instant

Take Home Message

I loaded the data it took 33 minutes
To render these slides using rmarkdown::render
- live from the database
- counting rows in each of 9 tables
- queries processing 425Million records twice
- drawing two ggplots
- < 15 seconds

Thank You
pietas@liverpool.ac.uk
https://github.com/PietaSchofield

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
README_files/figure-gfm		README_files/figure-gfm
.gitignore		.gitignore
LICENSE		LICENSE
README.html		README.html
README.md		README.md
index.Rmd		index.Rmd
meta.yaml		meta.yaml

Folders and files

Latest commit

History

Repository files navigation

DuckDB and Rfor (CPRD)Data Analysis

Motivation

CPRD Aurum Data

Problem

Find Index Age Distribution by Dementia Type

A Solution

duckdb, tidyverse, DBI and dbplyr

Example ETL

Extract

Load

Have a cup of coffee (go for lunch)

Add the Code-list

The Database

What Happened to T(ransform)

Data Quality Issues

Example

dbplyr

tibble

ggplot2

Uh oh!

This is “Lazy Evaluation” chain dbplyr then collect()

Step 1.

Step 1.

Warning

Step 2.

Step 3.

Step 3.

Replot

Discussion

What This Talk Was Not

What This Talk Was

Take Home Message

Thank You pietas@liverpool.ac.ukhttps://github.com/PietaSchofield

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

DuckDB and R
for (CPRD)
Data Analysis

`duckdb`, `tidyverse`, `DBI` and `dbplyr`

`dbplyr`

`tibble`

`ggplot2`

This is “Lazy Evaluation”

chain `dbplyr` then collect()

Thank You
pietas@liverpool.ac.uk
https://github.com/PietaSchofield

Packages