Update attrition by ablack3 · Pull Request #363 · darwin-eu-dev/TreatmentPatterns

ablack3 · 2026-03-31T22:14:53Z

update attrition recording. preserve person id. remove inequality full join in favor of inner join followed by filter.

…l join in favor of inner join followed by filter.

Copilot

Pull request overview

This PR updates how attrition is recorded and how cohort/event records are joined when constructing treatment histories, with the intent to preserve the original person identifier and to replace a non-equi full_join() with an equi-inner_join() plus post-filtering.

Changes:

Remove subject_id_origin handling and treat subject_id/personId as the preserved original ID (character).
Replace non-equi full_join() between events/targets with inner_join(personId) followed by a date-window filter(), and add a new attrition entry for “no event records”.
Re-number attrition reason_ids across pathway construction steps and adjust cohort table assertions accordingly.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
R/TreatmentPatterns-package.R	Removes `subject_id_origin` from declared global variables.
R/exportPatientLevel.R	Stops adding `subject_id_origin` to patient-level export; returns table unchanged.
R/constructPathways.R	Reworks event/target join strategy and updates attrition reason IDs (adds a new attrition step).
R/computePathways.R	Updates attrition reason ID and changes `subject_id` type assertion to character.
R/CDMInterface.R	Casts `subject_id` to character, changes person join key casting, and adds attrition entries after cohort filtering steps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-31T22:20:35Z

R/constructPathways.R

+  # Attrition: persons with no event records (may appear as 'None' paths if nonePaths = TRUE)
+  attrCounts <- fetchAttritionCounts(andromeda, "treatmentHistoryJoined")
+  appendAttrition(
+    toAdd = data.frame(
+      number_records = attrCounts$nRecords,
+      number_subjects = attrCounts$nSubjects,
+      reason_id = 4,
+      reason = "Subjects with no event records (may appear as 'None' paths)"
+    ),
+    andromeda = andromeda
+  )


The new attrition step for “Subjects with no event records” is computed from treatmentHistoryJoined, but that table is created via inner_join(). Subjects with zero event records will not appear in treatmentHistoryJoined, so this attrition count cannot be correct. Compute this attrition from the target cohort population instead (e.g., target persons anti-joined against event persons), and/or compute “no events within window” after the window filter.

Copilot · 2026-03-31T22:20:35Z

R/constructPathways.R

+  # Step 1: Equi-join on person to match events to targets
+  andromeda$treatmentHistoryJoined <- dplyr::inner_join(
    x = andromeda$eventCohorts,
    y = andromeda$targetCohorts,
    by = dplyr::join_by(
-      personId == personId,
-      subject_id_origin == subject_id_origin,
-      y$indexDate <= x$startDate,
-      x$startDate <= y$endDate,
+      personId == personId
    ), suffix = c("Event", "Target")
  )

+  # Attrition: persons with no event records (may appear as 'None' paths if nonePaths = TRUE)
+  attrCounts <- fetchAttritionCounts(andromeda, "treatmentHistoryJoined")
+  appendAttrition(
+    toAdd = data.frame(
+      number_records = attrCounts$nRecords,
+      number_subjects = attrCounts$nSubjects,
+      reason_id = 4,
+      reason = "Subjects with no event records (may appear as 'None' paths)"
+    ),
+    andromeda = andromeda
+  )
+
+  # Step 2: Filter to events within the time window
+  andromeda[[sprintf("cohortTable_%s", targetCohortId)]] <- andromeda$treatmentHistoryJoined %>%
+    dplyr::filter(
+      .data$indexDate <= .data$startDateEvent,
+      .data$startDateEvent <= .data$endDateTarget
+    )


Joining targets/events on personId only and then filtering by dates can create a very large intermediate (cartesian within person when there are multiple targets and many events), which can severely impact runtime and memory on real CDM-scale data. Prefer pushing the date constraints into the join (non-equi join) or otherwise reduce the join size before materializing the combined table.

Copilot · 2026-03-31T22:20:35Z

R/constructPathways.R

+  # Step 1: Equi-join on person to match events to targets
+  andromeda$treatmentHistoryJoined <- dplyr::inner_join(
    x = andromeda$eventCohorts,
    y = andromeda$targetCohorts,
    by = dplyr::join_by(
-      personId == personId,
-      subject_id_origin == subject_id_origin,
-      y$indexDate <= x$startDate,
-      x$startDate <= y$endDate,
+      personId == personId
    ), suffix = c("Event", "Target")
  )



andromeda$treatmentHistoryJoined is persisted in the returned Andromeda even though it is only used as an intermediate for the window filter. This can significantly bloat the returned object (and disk usage) for large runs. Consider removing it after cohortTable_<target> is created, or storing it only behind a debug/diagnostics option.

Copilot · 2026-03-31T22:20:36Z

R/CDMInterface.R

      number_records = sum(n),
      number_subjects = length(n),
      reason_id = 2,
+      reason = "Persons with no record in target cohort",


The attrition reason says “Persons with no record in target cohort”, but the preceding filter keeps only subjects that DO have at least one target cohort record. This makes the attrition output misleading; rename the reason to reflect the step (e.g., “After filtering to persons with >=1 target cohort record”).

Suggested change

reason = "Persons with no record in target cohort",

reason = "After filtering to persons with >=1 target cohort record",

Copilot · 2026-03-31T22:20:36Z

R/CDMInterface.R

+  # Filter to persons with at least one target cohort record
  targetId <- as.numeric(targetCohortIds)
-  
+
  andromeda[[andromedaTableName]] <- andromeda[[andromedaTableName]] %>%
    dplyr::mutate(cohort_definition_id = as.numeric(.data$cohort_definition_id)) %>%
    dplyr::group_by(.data$subject_id) %>%
    dplyr::filter(any(.data$cohort_definition_id %in% targetId, na.rm = TRUE)) %>%
    dplyr::ungroup()


The target-cohort presence filter is applied before the minEraDuration filter. After removing short eras, some subjects may no longer have any target records, which contradicts the intent of the target filter and can skew attrition reporting. Apply minEraDuration filtering before the target filter or re-run the target filter after minEraDuration.

Copilot · 2026-03-31T22:20:36Z

R/computePathways.R

  assertions <- checkmate::makeAssertCollection()
  checkmate::assertIntegerish(cohortTableHead$cohort_definition_id, add = assertions)
-  checkmate::assertIntegerish(cohortTableHead$subject_id, add = assertions)
+  checkmate::assertCharacter(cohortTableHead$subject_id, add = assertions)


Changing subject_id validation to character implies personId is now character throughout outputs. This is a breaking API change for downstream consumers and also conflicts with existing tests that expect numeric/integer personId (e.g., tests/testthat/test-computePathways.R around the filterTreatments test). Either preserve numeric personId (and add a separate origin/id column) or update the public contract + tests accordingly.

Suggested change

checkmate::assertCharacter(cohortTableHead$subject_id, add = assertions)

checkmate::assertIntegerish(cohortTableHead$subject_id, add = assertions)

Copilot · 2026-03-31T22:20:36Z

R/exportPatientLevel.R

-    dplyr::select("personId", "subject_id_origin") %>%
-    dplyr::inner_join(table, by = dplyr::join_by(personId == personId)) %>%
-    dplyr::distinct()
+  # personId already contains the original subject_id (as character)


addOriginId() no longer adds the original ID column to the exported patient-level CSV, which changes the treatment_history.csv schema (and will likely change ncol() expectations in tests/testthat/test-exportPatientLevel.R). If the intent is to drop subject_id_origin, update the export contract/tests; otherwise keep exporting a dedicated origin identifier column separate from personId.

Suggested change

# personId already contains the original subject_id (as character)

# Ensure a dedicated origin identifier column is present for export.

# personId already contains the original subject_id (as character),

# so if subject_id_origin is missing we create it from personId.

if (!"subject_id_origin" %in% colnames(table)) {

table <- dplyr::mutate(table, subject_id_origin = .data$personId)

}

codecov · 2026-04-01T02:53:39Z

Codecov Report

❌ Patch coverage is 98.11321% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
R/constructPathways.R	96.29%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

mvankessel-EMC · 2026-04-01T08:41:37Z

@ablack3 I'm fully behind the extra attrition steps. Could you split that out into a separate PR? I still have to think about the other changes. Edit; also you could you let Copilot re-review after the changes you made on its initial feedback?

update attrition recording. preserve person id. remove inequality ful…

8321269

…l join in favor of inner join followed by filter.

ablack3 requested review from Copilot and mvankessel-EMC March 31, 2026 22:15

Copilot started reviewing on behalf of ablack3 March 31, 2026 22:15 View session

Copilot AI reviewed Mar 31, 2026

View reviewed changes

ablack3 mentioned this pull request Apr 1, 2026

Feature request: attrition for "not in any target cohorts" #361

Open

ablack3 marked this pull request as draft April 1, 2026 02:05

address pr comments

2c48424

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update attrition#363

Update attrition#363
ablack3 wants to merge 2 commits intodevelopfrom
issue_362_attrition

ablack3 commented Mar 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

Copilot AI Mar 31, 2026

Uh oh!

codecov bot commented Apr 1, 2026

Uh oh!

mvankessel-EMC commented Apr 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	reason = "Persons with no record in target cohort",
	reason = "After filtering to persons with >=1 target cohort record",

	checkmate::assertCharacter(cohortTableHead$subject_id, add = assertions)
	checkmate::assertIntegerish(cohortTableHead$subject_id, add = assertions)

-  # personId already contains the original subject_id (as character)
+  # Ensure a dedicated origin identifier column is present for export.
+  # personId already contains the original subject_id (as character),
+  # so if subject_id_origin is missing we create it from personId.
+  if (!"subject_id_origin" %in% colnames(table)) {
+    table <- dplyr::mutate(table, subject_id_origin = .data$personId)
+  }

Conversation

ablack3 commented Mar 31, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 1, 2026

Codecov Report

Uh oh!

mvankessel-EMC commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mvankessel-EMC commented Apr 1, 2026 •

edited

Loading